Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.describe treating pyarrow timestamps/timedeltas as categorical #53001

Merged
merged 3 commits into from
May 1, 2023

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Apr 29, 2023

Fixes the following such that pyarrow-backed timestamps and timedeltas are described in the same way numpy-backed timestamps and timedeltas are described.

Current behavior:

In [1]: import pandas as pd

In [2]: df_numpy = pd.DataFrame({
   ...:     "datetime": pd.to_datetime(range(10)),
   ...:     "timedelta": pd.to_timedelta(range(10)),
   ...:     "numeric": range(10),
   ...: })

In [3]: df_pyarrow = df_numpy.convert_dtypes(dtype_backend="pyarrow")

In [4]: df_numpy.describe()
Out[4]: 
                            datetime                  timedelta   numeric
count                             10                         10  10.00000
mean   1970-01-01 00:00:00.000000004  0 days 00:00:00.000000004   4.50000
min              1970-01-01 00:00:00            0 days 00:00:00   0.00000
25%    1970-01-01 00:00:00.000000002  0 days 00:00:00.000000002   2.25000
50%    1970-01-01 00:00:00.000000004  0 days 00:00:00.000000004   4.50000
75%    1970-01-01 00:00:00.000000006  0 days 00:00:00.000000006   6.75000
max    1970-01-01 00:00:00.000000009  0 days 00:00:00.000000009   9.00000
std                              NaN  0 days 00:00:00.000000003   3.02765

In [5]: df_pyarrow.describe()
Out[5]: 
                   datetime        timedelta  numeric
count                    10               10     10.0
unique                   10               10     <NA>
top     1970-01-01 00:00:00  0 days 00:00:00     <NA>
freq                      1                1     <NA>
mean                    NaN              NaN      4.5
std                     NaN              NaN  3.02765
min                     NaN              NaN      0.0
25%                     NaN              NaN     2.25
50%                     NaN              NaN      4.5
75%                     NaN              NaN     6.75
max                     NaN              NaN      9.0

New behavior:

In [4]: df_pyarrow.describe()
Out[4]: 
                            datetime                  timedelta   numeric
count                             10                         10  10.00000
mean   1970-01-01 00:00:00.000000004  0 days 00:00:00.000000004   4.50000
min              1970-01-01 00:00:00            0 days 00:00:00   0.00000
25%    1970-01-01 00:00:00.000000002  0 days 00:00:00.000000002   2.25000
50%    1970-01-01 00:00:00.000000004  0 days 00:00:00.000000004   4.50000
75%    1970-01-01 00:00:00.000000006  0 days 00:00:00.000000006   6.75000
max    1970-01-01 00:00:00.000000009  0 days 00:00:00.000000009   9.00000
std                              NaN  0 days 00:00:00.000000003   3.02765

@lukemanley lukemanley added Bug Arrow pyarrow functionality labels Apr 29, 2023
@lukemanley lukemanley added this to the 2.0.2 milestone Apr 29, 2023
dtype=object,
index=["count", "mean", "min", "25%", "50%", "75%", "max"],
)
for k, v in expected.items():
Copy link
Member

@phofl phofl Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do this explicitly? I might be missing something...

Edit: When creating the Series

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure - updated

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Thanks @lukemanley

@mroeschke mroeschke merged commit 9442dfc into pandas-dev:main May 1, 2023
@lumberbot-app
Copy link

lumberbot-app bot commented May 1, 2023

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.0.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 9442dfcc4f87c68632eb3ffbaf8dae6da4b532bd
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #53001: BUG: Series.describe treating pyarrow timestamps/timedeltas as categorical'
  1. Push to a named branch:
git push YOURFORK 2.0.x:auto-backport-of-pr-53001-on-2.0.x
  1. Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #53001 on branch 2.0.x (BUG: Series.describe treating pyarrow timestamps/timedeltas as categorical)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@mroeschke
Copy link
Member

Mind backporting @lukemanley?

NumanIjaz pushed a commit to NumanIjaz/pandas that referenced this pull request May 1, 2023
…rical (pandas-dev#53001)

* Series.describe treating pyarrow timestamps and timedeltas as categorical

* gh refs

* cleanup
lukemanley added a commit to lukemanley/pandas that referenced this pull request May 2, 2023
@lukemanley
Copy link
Member Author

Mind backporting @lukemanley?

just backported in #53031

mroeschke pushed a commit that referenced this pull request May 2, 2023
…rrow timestamps/timedeltas as categorical) (#53031)

* Backport PR #53001: BUG: Series.describe treating pyarrow timestamps/timedeltas as categorical

* clean
topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 7, 2023
…rical (pandas-dev#53001)

* Series.describe treating pyarrow timestamps and timedeltas as categorical

* gh refs

* cleanup
Rylie-W pushed a commit to Rylie-W/pandas that referenced this pull request May 19, 2023
…rical (pandas-dev#53001)

* Series.describe treating pyarrow timestamps and timedeltas as categorical

* gh refs

* cleanup
@lukemanley lukemanley deleted the arrow-describe-temporal branch May 30, 2023 22:16
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
…rical (pandas-dev#53001)

* Series.describe treating pyarrow timestamps and timedeltas as categorical

* gh refs

* cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants