Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: casting datetime strings with offzet to tz-naive datetime64 fails #50140

Closed
jorisvandenbossche opened this issue Dec 9, 2022 · 8 comments · Fixed by #51514
Closed

REGR: casting datetime strings with offzet to tz-naive datetime64 fails #50140

jorisvandenbossche opened this issue Dec 9, 2022 · 8 comments · Fixed by #51514
Labels
Astype Blocker Blocking issue or pull request for an upcoming release Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@jorisvandenbossche
Copy link
Member

On pandas 1.5 (so no deprecation warning):

>>> pd.Index(['2021-01-01 00:00:00+01:00', '2021-01-02 00:00:00+01:00']).astype("datetime64[ns]")
DatetimeIndex(['2020-12-31 23:00:00', '2021-01-01 23:00:00'], dtype='datetime64[ns]', freq=None)

On the main branch:

>>> pd.Index(['2021-01-01 00:00:00+01:00', '2021-01-02 00:00:00+01:00']).astype("datetime64[ns]")
...
TypeError: Cannot use .astype to convert from timezone-aware dtype to timezone-naive dtype. 
Use obj.tz_localize(None) or obj.tz_convert('UTC').tz_localize(None) instead.

This started to fail a few days ago on pyarrow's CI. This comes up if you roundtrip a pandas DataFrame where the columns are a tz-aware DatetimeIndex (in Arrow they will be string columns, and then in the arrow->pandas conversion we try to cast the string to datetime64 and then localize. We should probably directly cast to the tz-aware dtype).

From a quick look at the recent commits and the code path this takes, might be caused by #50015 (cc @jbrockmendel)

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version Astype labels Dec 9, 2022
@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Dec 9, 2022
@jorisvandenbossche
Copy link
Member Author

And to be clear, this is certainly a dubious case where it is not clear if we actually want this behaviour. I think it might make sense to raise for this in the future. But if we want to do that, it should still be a conscious decision (not accidental side effect of unrelated PR), and should probably first be deprecated.

@MarcoGorelli
Copy link
Member

From a quick look at the recent commits and the code path this takes, might be caused by #50015 (cc @jbrockmendel)

git bisect confirms: https://www.kaggle.com/code/marcogorelli/pandas-regression-example?scriptVersionId=113357910

@jbrockmendel
Copy link
Member

Yah that particular .astype is definitely wonky. We should add a whatnsew note announcing the new behavior as an API change in 2.0.

@jorisvandenbossche
Copy link
Member Author

It's something we can easily deprecate first, I think?

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Dec 20, 2022
@jorisvandenbossche
Copy link
Member Author

@jbrockmendel do you have time to take a look at this?

@jbrockmendel
Copy link
Member

will take a look

@simonjayhawkins
Copy link
Member

It's something we can easily deprecate first, I think?

#51514 closes by documenting this as expected.

@simonjayhawkins simonjayhawkins added the Blocker Blocking issue or pull request for an upcoming release label Feb 22, 2023
@jorisvandenbossche
Copy link
Member Author

To be explicit about the implications of doing this as a breaking change: this breaks pandas<->pyarrow roundtrips, and so eg also pandas<->parquet roundtripping (admittedly, in a somewhat corner case of using a DatetimeIndex for the column labels)

So users doing pd.read_parquet(..) with a file they wrote earlier will now get an error (with released pyarrow) that they can't easily solve themselves.

jorisvandenbossche added a commit to apache/arrow that referenced this issue Apr 4, 2023
### What changes are included in this PR?

- The issue with numpy 1.25 in the assert equal helper was fixed in pandas 1.5.3 -> removing the skip (in theory can still run into this error when using an older pandas version with the latest numpy, but that's not something you should do)
- Casting tz-aware strings to datetime64[ns] was not fixed in pandas (pandas-dev/pandas#50140) -> updating our implementation to work around it
- Casting to numpy string dtype (pandas-dev/pandas#50127) is not yet fixed -> updating the skip

### Are there any user-facing changes?

No
* Closes: #15070

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
)

### What changes are included in this PR?

- The issue with numpy 1.25 in the assert equal helper was fixed in pandas 1.5.3 -> removing the skip (in theory can still run into this error when using an older pandas version with the latest numpy, but that's not something you should do)
- Casting tz-aware strings to datetime64[ns] was not fixed in pandas (pandas-dev/pandas#50140) -> updating our implementation to work around it
- Casting to numpy string dtype (pandas-dev/pandas#50127) is not yet fixed -> updating the skip

### Are there any user-facing changes?

No
* Closes: apache#15070

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
)

### What changes are included in this PR?

- The issue with numpy 1.25 in the assert equal helper was fixed in pandas 1.5.3 -> removing the skip (in theory can still run into this error when using an older pandas version with the latest numpy, but that's not something you should do)
- Casting tz-aware strings to datetime64[ns] was not fixed in pandas (pandas-dev/pandas#50140) -> updating our implementation to work around it
- Casting to numpy string dtype (pandas-dev/pandas#50127) is not yet fixed -> updating the skip

### Are there any user-facing changes?

No
* Closes: apache#15070

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Astype Blocker Blocking issue or pull request for an upcoming release Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants