API: Make describe changes backwards compatible #34798

TomAugspurger · 2020-06-15T13:46:51Z

Adds the new behavior as a feature flag / deprecation.

Closes #33903

(Do we have a list of issues for deprecations introduced in 1.x?)

Adds the new behavior as a feature flag / deprecation. Closes pandas-dev#33903

TomAugspurger · 2020-06-15T13:47:38Z

cc @david-cortes.

jreback

can you add to the deprecation removal list as well for 2.0

pandas/core/generic.py

jreback · 2020-06-15T14:23:14Z

pandas/tests/frame/methods/test_describe.py

+        result = df.describe(include="all", datetime_is_numeric=True)
+        tm.assert_frame_equal(result, expected)
+
+        s1_ = s1.describe()


can you make as a separate test

jreback · 2020-06-15T14:23:21Z

pandas/tests/series/methods/test_describe.py

@@ -98,3 +98,19 @@ def test_describe_with_tz(self, tz_naive_fixture):
            index=["count", "mean", "min", "25%", "50%", "75%", "max"],
        )
        tm.assert_series_equal(result, expected)
+
+        with tm.assert_produces_warning(FutureWarning):


jorisvandenbossche · 2020-06-15T14:34:34Z

(Do we have a list of issues for deprecations introduced in 1.x?)

#30228

jorisvandenbossche · 2020-06-15T14:52:55Z

BTW, some other remarks on the new behaviour: this should also be enabled in DataFrame.describe by default?
Now (on master), a datetime is not included by default. Which made sense before, as it was regarded as a categorical and not numeric columns. But with datetime_as_numeric, it should be included by default?

Also, I would include "std" as well for datetimes, but with a NaN entry. That makes that the keys are always all the same for all numeric types, regardless of the exact dtype (and that also ensures the ordering is preserved when when doing describe on a dataframe with both numeric and datetime columns)

TomAugspurger · 2020-06-15T15:29:47Z

I was surprised to see that master didn't treat datetimes as numeric in describe. But I don't know how high-priority fixing that is (though a keyword calling datetime_is_numeric does bump the priority).

jorisvandenbossche · 2020-06-15T15:32:52Z

I think we should fix it, because if we add a keyword to enable the future behaviour, we should ensure we have the proper future behaviour we want (which is now not yet the case, IMO). Unless we disallow the keyword for dataframe ..
(now, doesn't necessarily need to happen in this PR though)

TomAugspurger · 2020-06-15T15:41:43Z

Fixing that will get a bit messy, since it overlaps with the include and exclude keywords, and it's unclear how timedelta & period should behave (are they also numeric-like? Should the single keyword datetime_is_numeric control all of those? If so, should it be renamed?)

WillAyd · 2020-06-15T18:57:39Z

In the spirit of practicality over purity do we really need to do this? Outside of the Dask test do we expect end users would be relying on the old behavior?

jorisvandenbossche · 2020-06-15T19:53:40Z

it's unclear how timedelta & period should behave (are they also numeric-like? Should the single keyword datetime_is_numeric control all of those? If so, should it be renamed?)

For timedelta that is clear, I think, since that also returns the numeric describe output, while period does not?

jorisvandenbossche · 2020-06-15T19:56:07Z

It seems timedelta is alraedy included by default, on released version:

In [7]: pd.DataFrame({'a': pd.timedelta_range("2012", periods=3), 'b': [1, 2, 3]}).describe()                                                                                                                      
Out[7]: 
                            a    b
count                       3  3.0
mean   1 days 00:00:00.000002  2.0
std           1 days 00:00:00  1.0
min    0 days 00:00:00.000002  1.0
25%    0 days 12:00:00.000002  1.5
50%    1 days 00:00:00.000002  2.0
75%    1 days 12:00:00.000002  2.5
max    2 days 00:00:00.000002  3.0

TomAugspurger · 2020-06-15T20:33:09Z

In the spirit of practicality over purity do we really need to do this? Outside of the Dask test do we expect end users would be relying on the old behavior?

Hard to say. It's not hard to construct code that relies on the old behavior though.

WillAyd · 2020-06-16T01:35:27Z

My point is that this seems like a lot of churn for potentially negligible value add. I can't think of a pipeline where the old behavior is actually useful, so I don't think worth adding a keyword that we subsequently plan to deprecate for the sake of maintaining compat

jorisvandenbossche · 2020-06-16T06:39:33Z

There are a lot of things in pandas that I personally don't find particularly useful, but there are probably still a lot of people using those things. So I also find it hard to say whether it is important here. But it also doesn't seem difficult to actually do it with a deprecation.

But whether we go through a deprecation or not, we still need to agree on what we think the new / future behaviour should be.
I would say that it should follow what we already do for timedelta, so include it by default for dataframe.

TomAugspurger · 2020-07-07T14:14:20Z

Split the tests and fixed the merge conflicts.

I'm happy with this as is since it restores the default 1.0 behavior. If we want additional changes then let's do them as followups.

jorisvandenbossche · 2020-07-08T14:36:47Z

I'm happy with this as is since it restores the default 1.0 behavior. If we want additional changes then let's do them as followups.

So I suppose this means you didn't change the dataframe behaviour (which is certainly fine to leave for a follow-up). But should be then maybe raise a NotImplementedError when specifying datetime_as_numeric=True for DataFrame?

jreback · 2020-07-08T15:40:41Z

lgtm. @jorisvandenbossche comment here or in a followup ok

TomAugspurger · 2020-07-13T14:56:15Z

@jorisvandenbossche can you remind me what the dataframe issue is? That datetimes are not included in this output while timedeltas are?

In [4]: pd.DataFrame({'a': pd.date_range("2012", periods=3), 'b': [1, 2, 3]}).describe()
   ...:
Out[4]:
         b
count  3.0
mean   2.0
std    1.0
min    1.0
25%    1.5
50%    2.0
75%    2.5
max    3.0

In [6]: pd.DataFrame({'a': pd.timedelta_range("2012", periods=3), 'b': [1, 2, 3]}).describe()
   ...:
Out[6]:
                               a    b
count                          3  3.0
mean   1 days 00:00:00.000002012  2.0
std              1 days 00:00:00  1.0
min    0 days 00:00:00.000002012  1.0
25%    0 days 12:00:00.000002012  1.5
50%    1 days 00:00:00.000002012  2.0
75%    1 days 12:00:00.000002012  2.5
max    2 days 00:00:00.000002012  3.0

Interestingly if you have datetime, numeric, and timedelta, then the timedelta is not included:

In [5]: pd.DataFrame({'a': pd.date_range("2012", periods=3), 'b': [1, 2, 3], 'c': pd.period_range('2000', periods=3)}).describe()
   ...:
Out[5]:
         b
count  3.0
mean   2.0
std    1.0
min    1.0
25%    1.5
50%    2.0
75%    2.5
max    3.0

This all seems buggy, but I hope can be handled separately.

jorisvandenbossche · 2020-07-13T19:42:00Z

Yes, that needs to be cleaned up and can indeed be handled separately.

But the one thing that might be relevant for this PR:

pd.DataFrame({'a': pd.date_range("2012", periods=3), 'b': [1, 2, 3]}).describe(datetime_is_numeric=True)

is not actually doing what the keyword is expected to be doing (I suppose, didn't try with this branch).
And if we want to enable that in a follow-up PR, we should maybe not allow one to pass that now for the DataFrame case (because if we would then "fix" it in 1.2, it's kind of a breaking change, although nobody of course should rely on it). So we could raise a NotImplementedError that keyword is specified and self is a DataFrame.

TomAugspurger · 2020-07-14T12:49:42Z

@jorisvandenbossche gotcha. Added a test and implemented that.

In [3]: pd.DataFrame({'a': pd.date_range("2012", periods=3), 'b': [1, 2, 3]}).describe(datetime_is_numeric=True)
Out[3]:
                         a    b
count                    3  3.0
mean   2012-01-02 00:00:00  2.0
min    2012-01-01 00:00:00  1.0
25%    2012-01-01 12:00:00  1.5
50%    2012-01-02 00:00:00  2.0
75%    2012-01-02 12:00:00  2.5
max    2012-01-03 00:00:00  3.0
std                    NaN  1.0

jreback · 2020-07-14T17:12:17Z

hopefully fixed the conflict correctly. merging on green.

TomAugspurger · 2020-07-14T19:04:27Z

Thanks. All green.

TomAugspurger · 2020-07-14T19:04:51Z

Err, wait, the whatsnew doesn't look right. Will fix.

jreback · 2020-07-14T20:22:02Z

cool

API: Make describe changes backwards compatible

05bd224

Adds the new behavior as a feature flag / deprecation. Closes pandas-dev#33903

TomAugspurger added API Design Deprecate Functionality to remove in pandas labels Jun 15, 2020

TomAugspurger added this to the 1.1 milestone Jun 15, 2020

jorisvandenbossche mentioned this pull request Jun 15, 2020

DEPR: log of deprecations in 1.x (to be removed in 2.0) #30228

Closed

doctest

27e9768

jreback requested changes Jun 15, 2020

View reviewed changes

TomAugspurger added 3 commits July 7, 2020 09:09

Merge remote-tracking branch 'upstream/master' into 33903-describe

12b4ae2

whatsnew

53cfee8

fixups

6586280

jreback approved these changes Jul 8, 2020

View reviewed changes

fixup

0222e76

Merge branch 'master' into 33903-describe

5be87df

TomAugspurger added 3 commits July 14, 2020 14:05

Merge remote-tracking branch 'upstream/master' into 33903-describe

c6d5454

doc fixup

166f6f4

newline

122e2f5

jreback merged commit b018691 into pandas-dev:master Jul 14, 2020

TomAugspurger deleted the 33903-describe branch July 14, 2020 20:30

fangchenli pushed a commit to fangchenli/pandas that referenced this pull request Jul 16, 2020

API: Make describe changes backwards compatible (pandas-dev#34798)

cf1b141

jseabold mentioned this pull request Nov 17, 2020

Silence deprecation warning CamDavidsonPilon/lifelines#1165

Merged

jorisvandenbossche mentioned this pull request Jan 22, 2021

Follow-up compatibility issues with pandas dask/dask#7100

Closed

7 tasks

mroeschke mentioned this pull request Oct 28, 2022

DEPR: Remove datetime_is_numeric in describe #49368

Merged

anmyachev mentioned this pull request Jun 3, 2023

FEAT-#5936: support pandas 2.0.2 modin-project/modin#5995

Merged

7 tasks

itholic mentioned this pull request Aug 3, 2023

[SPARK-43873][PS] Enabling FrameDescribeTests apache/spark#42319

Closed

Galaxy3696 mentioned this pull request Feb 6, 2025

Pandas version pinned to <2 OSOceanAcoustics/echoshader#143

Closed

Uh oh!

API: Make describe changes backwards compatible #34798

API: Make describe changes backwards compatible #34798

Uh oh!

Conversation

TomAugspurger commented Jun 15, 2020

Uh oh!

TomAugspurger commented Jun 15, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jreback Jun 15, 2020

Choose a reason for hiding this comment

Uh oh!

jreback Jun 15, 2020

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jun 15, 2020

Uh oh!

jorisvandenbossche commented Jun 15, 2020

Uh oh!

TomAugspurger commented Jun 15, 2020

Uh oh!

jorisvandenbossche commented Jun 15, 2020

Uh oh!

TomAugspurger commented Jun 15, 2020

Uh oh!

WillAyd commented Jun 15, 2020

Uh oh!

jorisvandenbossche commented Jun 15, 2020

Uh oh!

jorisvandenbossche commented Jun 15, 2020

Uh oh!

TomAugspurger commented Jun 15, 2020

Uh oh!

WillAyd commented Jun 16, 2020

Uh oh!

jorisvandenbossche commented Jun 16, 2020

Uh oh!

TomAugspurger commented Jul 7, 2020

Uh oh!

jorisvandenbossche commented Jul 8, 2020

Uh oh!

jreback commented Jul 8, 2020

Uh oh!

TomAugspurger commented Jul 13, 2020

Uh oh!

jorisvandenbossche commented Jul 13, 2020

Uh oh!

TomAugspurger commented Jul 14, 2020

Uh oh!

jreback commented Jul 14, 2020

Uh oh!

TomAugspurger commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Jul 14, 2020

Uh oh!

jreback commented Jul 14, 2020

Uh oh!

Uh oh!

TomAugspurger commented Jul 14, 2020 •

edited

Loading