Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Feature request, date type #32473

Closed
zbrookle opened this issue Mar 5, 2020 · 16 comments
Closed

ENH: Feature request, date type #32473

zbrookle opened this issue Mar 5, 2020 · 16 comments
Labels
Closing Candidate May be closeable, needs more eyeballs datetime.date stdlib datetime.date support Enhancement

Comments

@zbrookle
Copy link
Contributor

zbrookle commented Mar 5, 2020

There currently isn’t any native date dtype in pandas, which makes it impossible to integrate this type into file formats like parquet, where schema is defined

@jbrockmendel
Copy link
Member

Period[D] behaves a lot like date. Could that be adapted to solve the parquet issue?

@zbrookle
Copy link
Contributor Author

zbrookle commented Mar 6, 2020

That seems kind of like a work around, is there a real reason that pandas doesn’t have a date type? It’s a pretty essential and standard type for most data platforms and frameworks

@jreback
Copy link
Contributor

jreback commented Mar 6, 2020

That seems kind of like a work around, is there a real reason that pandas doesn’t have a date type? It’s a pretty essential and standard type for most data platforms and frameworks

@zbrookle you are welcome to spend the time to develop one

normalized Timestamps are pretty easy to understand and easily act like a date type

@zbrookle
Copy link
Contributor Author

zbrookle commented Mar 6, 2020

@jreback Okay awesome I’d definitely love to work on it

@jreback
Copy link
Contributor

jreback commented Mar 6, 2020

our current setup actually would be pretty reasonable to support a dtype and extension array of type

datetime64[D] (and maybe other freqs); may also have to be named slightly different

could be backed by a nullable integer array that represent ordinala from epoch

i don’t think it’s that hard actually

@jbrockmendel jbrockmendel added Dtype Conversions Unexpected or buggy dtype conversions Enhancement and removed Dtype Conversions Unexpected or buggy dtype conversions labels May 16, 2020
@zbrookle zbrookle mentioned this issue May 28, 2020
1 task
@jorisvandenbossche
Copy link
Member

As reference, Apache Arrow has two different date types: https://github.com/apache/arrow/blob/21a6474fd6e4b5c0444d8364ebfdefcf856ef7fd/format/Schema.fbs#L144-L157
There is date64 (milliseconds since 1970 in 64bit) and date32 (days since 1970 in 32bit).

For pandas, having two different types feels a bit overkill though. And compatibility with numpy's datetime64[D] would also be nice, but so this doesn't map to any of the Arrow types (since the days are in 32bit).

@zbrookle
Copy link
Contributor Author

@jorisvandenbossche I think the compatibility with NumPy is probably the most important since that's what the backend of pandas is. It wouldn't be too hard to have logic that converts to the appropriate pyarrow format when a dataframe writes to parquet. I think there would just have to be an error that would be raised if they tried to write out to a date that is beyond 32 bits, which I don't think will happened very frequently

@jorisvandenbossche
Copy link
Member

I think the compatibility with NumPy is probably the most important since that's what the backend of pandas is

Compatibility with numpy is important, but I am not fully sure the numpy being the backend is important in this case. I think many of the datetime-specific functionalities are implemented in pandas, and are not coming from numpy (although I am not very much up to date with the exact details here)

@zbrookle
Copy link
Contributor Author

zbrookle commented Jun 1, 2020

@jorisvandenbossche I'm actually very confident that the backend for at least the datetime64 object in pandas is NumPy (because a lot of my implementation of the date was based off the datetime type) and so I think the biggest priority in terms of conversion is that these two types at the very least work well and efficiently together

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 1, 2020

Yes, we use the numpy datetime64 dtype, but that doesn't mean we use much of numpy to do datetime-related things with it (in the end, it's just an int64 array with an annotation).

Eg the code to get "fields" from a date (like the month, of the day of the month, etc): https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslibs/fields.pyx

@jbrockmendel
Copy link
Member

I think many of the datetime-specific functionalities are implemented in pandas, and are not coming from numpy (although I am not very much up to date with the exact details here)

This is correct, largely because the relevant C functions in numpy are not exposed. If they become exposed (xref numpy/numpy#16364) i expect we'll start using them (once our minimum np version catches up) instead of having our own mostly-copy/pasted versions.

@jbrockmendel
Copy link
Member

Eg the code to get "fields" from a date (like the month, of the day of the month, etc): https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslibs/fields.pyx

Those functions all assume the int64s being passed represent nanosecond unit timestamps, which wouldn't apply here. What would apply is wrapping Period[D]

@zbrookle
Copy link
Contributor Author

zbrookle commented Jun 1, 2020

I’m aware that the actual date functions and abstractions are not through NumPy, but the backing array containing the data is, and that's what will impact conversion between the different data types within pandas using the .astype method. I think the only real decision here is whether to have the DateDtype backed by an int32 array or an int64 array.

  • If it's an int32 array, then when converting to another type of date, there would need to be a type conversion in addition to the conversion from datetime epoch to date epoch
  • If it's an int64 array, then only the epoch conversion would be needed, but additional conversion and handling would be necessary for pyarrow
    So really it's a trade off between pyarrow and numpy, so whatever you all think is most important I'll use

@tswast
Copy link
Contributor

tswast commented Oct 14, 2021

Copying my comment from here: googleapis/python-db-dtypes-pandas#30

[In the db-dtypes package], we did try datetime64[D]. It made the conversion to timestamp much more difficult. Also, I suspect adding time to date to get a timestamp will be a common operation, so the numpy datetime64[ns] backing array was a better fit for that too.

@jbrockmendel
Copy link
Member

we now have pyarrow-backed date dtypes. im curious if that handles this use case

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Apr 10, 2023
@mroeschke
Copy link
Member

I think this is handled by the ArrowDtype pa.date type so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs datetime.date stdlib datetime.date support Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants