Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: to_datetime re-parsing Arrow-backed objects #53301

Merged
merged 13 commits into from
Jul 11, 2023

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented May 19, 2023

Ideally, we could push this down into DTA/DTI, but that doesn't seem to support Arrow arrays ATM.

@lithomas1 lithomas1 added Datetime Datetime data dtype Arrow pyarrow functionality labels May 19, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a whatsnew too.

Also, does something similar happen to to_timedelta?

@lithomas1
Copy link
Member Author

Also, does something similar happen to to_timedelta?

I think so, I'll try to get at it at a followup.

I think in the future, it might also make sense for DF inputs with Arrow backed year, month, etc. columns to convert to Arrow Backed dt columns.

e.g.

>>> import pandas as pd
+ /Users/thomasli/opt/miniconda3/envs/pandas-dev/bin/ninja
[1/1] Generating write_version_file with a custom command
>>> df = pd.DataFrame({'year': [2015, 2016],
...                    'month': [2, 3],
...                    'day': [4, 5]})
>>> df = df.astype("int64[pyarrow]")
>>> pd.to_datetime(df)
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]

should return something with dtype: timestamp[ns][pyarrow]

Main thing I'm worried about with this PR is doing the wrong thing for utc=True (e.g. localizing instead of converting and vice versa).

(cc @phofl in case any thoughts and to double check my logic).

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very small comment

elif isinstance(arg_dtype, ArrowDtype) and arg_dtype.kind == "M":
# TODO: Combine with above if DTI/DTA supports Arrow timestamps
if utc:
arg = arg.astype("timestamp[ns, UTC][pyarrow]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use pd.ArrowDtype(pa...) here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.astype shouldn't allow converting a tz-naive to tzaware (or vice-versa). i think this should use tz_convert/tz_localize instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(btw if .astype works on ArrowEA to convert tznaive to tzaware we probably want to change that to match the non-arrow raising behavior)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(btw if .astype works on ArrowEA to convert tznaive to tzaware we probably want to change that to match the non-arrow raising behavior)

Sure, I'll raise a PR after this one gets in.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2023

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 6, 2023
@lithomas1 lithomas1 removed the Stale label Jul 6, 2023
Comment on lines 416 to 419
arg = Index(arg_array._dt_tz_localize("UTC"))
else:
# ArrowExtensionArray
arg = arg._dt_tz_localize("UTC")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is tz_localize always the right one, or do we need tz_convert if the input was timezone-aware to begin with?

In the test you've added, dti_arrow is always timezone-naive - would we make a timezone-aware example too please, just to check this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the catch.

I messed up the test (it does parametrize over UTC and US/Central), but I forgot to propagate the tz information in the tests.

@MarcoGorelli MarcoGorelli added this to the 2.1 milestone Jul 11, 2023
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, looks good to me!

@mroeschke mroeschke merged commit 9d1d1b1 into pandas-dev:main Jul 11, 2023
@mroeschke
Copy link
Member

Thanks @lithomas1

@lithomas1 lithomas1 deleted the arrow-to-datetime branch July 11, 2023 17:07
@jbrockmendel
Copy link
Member

Does the to_datetime docstring need to be updated to reflect non-DTI return?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: to_datetime doesn't recognise Arrow date time dtypes and converts them again
5 participants