Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Make describe changes backwards compatible #34798

Merged
merged 10 commits into from Jul 14, 2020

Conversation

TomAugspurger
Copy link
Contributor

Adds the new behavior as a feature flag / deprecation.

Closes #33903

(Do we have a list of issues for deprecations introduced in 1.x?)

Adds the new behavior as a feature flag / deprecation.

Closes pandas-dev#33903
@TomAugspurger TomAugspurger added API Design Deprecate Functionality to remove in pandas labels Jun 15, 2020
@TomAugspurger TomAugspurger added this to the 1.1 milestone Jun 15, 2020
@TomAugspurger
Copy link
Contributor Author

cc @david-cortes.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add to the deprecation removal list as well for 2.0

pandas/core/generic.py Show resolved Hide resolved
result = df.describe(include="all", datetime_is_numeric=True)
tm.assert_frame_equal(result, expected)

s1_ = s1.describe()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make as a separate test

@@ -98,3 +98,19 @@ def test_describe_with_tz(self, tz_naive_fixture):
index=["count", "mean", "min", "25%", "50%", "75%", "max"],
)
tm.assert_series_equal(result, expected)

with tm.assert_produces_warning(FutureWarning):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@jorisvandenbossche
Copy link
Member

(Do we have a list of issues for deprecations introduced in 1.x?)

#30228

@jorisvandenbossche
Copy link
Member

BTW, some other remarks on the new behaviour: this should also be enabled in DataFrame.describe by default?
Now (on master), a datetime is not included by default. Which made sense before, as it was regarded as a categorical and not numeric columns. But with datetime_as_numeric, it should be included by default?

Also, I would include "std" as well for datetimes, but with a NaN entry. That makes that the keys are always all the same for all numeric types, regardless of the exact dtype (and that also ensures the ordering is preserved when when doing describe on a dataframe with both numeric and datetime columns)

@TomAugspurger
Copy link
Contributor Author

I was surprised to see that master didn't treat datetimes as numeric in describe. But I don't know how high-priority fixing that is (though a keyword calling datetime_is_numeric does bump the priority).

@jorisvandenbossche
Copy link
Member

I think we should fix it, because if we add a keyword to enable the future behaviour, we should ensure we have the proper future behaviour we want (which is now not yet the case, IMO). Unless we disallow the keyword for dataframe ..
(now, doesn't necessarily need to happen in this PR though)

@TomAugspurger
Copy link
Contributor Author

Fixing that will get a bit messy, since it overlaps with the include and exclude keywords, and it's unclear how timedelta & period should behave (are they also numeric-like? Should the single keyword datetime_is_numeric control all of those? If so, should it be renamed?)

@WillAyd
Copy link
Member

WillAyd commented Jun 15, 2020

In the spirit of practicality over purity do we really need to do this? Outside of the Dask test do we expect end users would be relying on the old behavior?

@jorisvandenbossche
Copy link
Member

it's unclear how timedelta & period should behave (are they also numeric-like? Should the single keyword datetime_is_numeric control all of those? If so, should it be renamed?)

For timedelta that is clear, I think, since that also returns the numeric describe output, while period does not?

@jorisvandenbossche
Copy link
Member

It seems timedelta is alraedy included by default, on released version:

In [7]: pd.DataFrame({'a': pd.timedelta_range("2012", periods=3), 'b': [1, 2, 3]}).describe()                                                                                                                      
Out[7]: 
                            a    b
count                       3  3.0
mean   1 days 00:00:00.000002  2.0
std           1 days 00:00:00  1.0
min    0 days 00:00:00.000002  1.0
25%    0 days 12:00:00.000002  1.5
50%    1 days 00:00:00.000002  2.0
75%    1 days 12:00:00.000002  2.5
max    2 days 00:00:00.000002  3.0

@TomAugspurger
Copy link
Contributor Author

In the spirit of practicality over purity do we really need to do this? Outside of the Dask test do we expect end users would be relying on the old behavior?

Hard to say. It's not hard to construct code that relies on the old behavior though.

@WillAyd
Copy link
Member

WillAyd commented Jun 16, 2020

My point is that this seems like a lot of churn for potentially negligible value add. I can't think of a pipeline where the old behavior is actually useful, so I don't think worth adding a keyword that we subsequently plan to deprecate for the sake of maintaining compat

@jorisvandenbossche
Copy link
Member

There are a lot of things in pandas that I personally don't find particularly useful, but there are probably still a lot of people using those things. So I also find it hard to say whether it is important here. But it also doesn't seem difficult to actually do it with a deprecation.

But whether we go through a deprecation or not, we still need to agree on what we think the new / future behaviour should be.
I would say that it should follow what we already do for timedelta, so include it by default for dataframe.

@TomAugspurger
Copy link
Contributor Author

Split the tests and fixed the merge conflicts.

I'm happy with this as is since it restores the default 1.0 behavior. If we want additional changes then let's do them as followups.

@jorisvandenbossche
Copy link
Member

I'm happy with this as is since it restores the default 1.0 behavior. If we want additional changes then let's do them as followups.

So I suppose this means you didn't change the dataframe behaviour (which is certainly fine to leave for a follow-up). But should be then maybe raise a NotImplementedError when specifying datetime_as_numeric=True for DataFrame?

@jreback
Copy link
Contributor

jreback commented Jul 8, 2020

lgtm. @jorisvandenbossche comment here or in a followup ok

@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche can you remind me what the dataframe issue is? That datetimes are not included in this output while timedeltas are?

In [4]: pd.DataFrame({'a': pd.date_range("2012", periods=3), 'b': [1, 2, 3]}).describe()
   ...:
Out[4]:
         b
count  3.0
mean   2.0
std    1.0
min    1.0
25%    1.5
50%    2.0
75%    2.5
max    3.0

In [6]: pd.DataFrame({'a': pd.timedelta_range("2012", periods=3), 'b': [1, 2, 3]}).describe()
   ...:
Out[6]:
                               a    b
count                          3  3.0
mean   1 days 00:00:00.000002012  2.0
std              1 days 00:00:00  1.0
min    0 days 00:00:00.000002012  1.0
25%    0 days 12:00:00.000002012  1.5
50%    1 days 00:00:00.000002012  2.0
75%    1 days 12:00:00.000002012  2.5
max    2 days 00:00:00.000002012  3.0

Interestingly if you have datetime, numeric, and timedelta, then the timedelta is not included:

In [5]: pd.DataFrame({'a': pd.date_range("2012", periods=3), 'b': [1, 2, 3], 'c': pd.period_range('2000', periods=3)}).describe()
   ...:
Out[5]:
         b
count  3.0
mean   2.0
std    1.0
min    1.0
25%    1.5
50%    2.0
75%    2.5
max    3.0

This all seems buggy, but I hope can be handled separately.

@jorisvandenbossche
Copy link
Member

Yes, that needs to be cleaned up and can indeed be handled separately.

But the one thing that might be relevant for this PR:

pd.DataFrame({'a': pd.date_range("2012", periods=3), 'b': [1, 2, 3]}).describe(datetime_is_numeric=True)

is not actually doing what the keyword is expected to be doing (I suppose, didn't try with this branch).
And if we want to enable that in a follow-up PR, we should maybe not allow one to pass that now for the DataFrame case (because if we would then "fix" it in 1.2, it's kind of a breaking change, although nobody of course should rely on it). So we could raise a NotImplementedError that keyword is specified and self is a DataFrame.

@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche gotcha. Added a test and implemented that.

In [3]: pd.DataFrame({'a': pd.date_range("2012", periods=3), 'b': [1, 2, 3]}).describe(datetime_is_numeric=True)
Out[3]:
                         a    b
count                    3  3.0
mean   2012-01-02 00:00:00  2.0
min    2012-01-01 00:00:00  1.0
25%    2012-01-01 12:00:00  1.5
50%    2012-01-02 00:00:00  2.0
75%    2012-01-02 12:00:00  2.5
max    2012-01-03 00:00:00  3.0
std                    NaN  1.0

@jreback
Copy link
Contributor

jreback commented Jul 14, 2020

hopefully fixed the conflict correctly. merging on green.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jul 14, 2020

Thanks. All green.

@TomAugspurger
Copy link
Contributor Author

Err, wait, the whatsnew doesn't look right. Will fix.

@jreback jreback merged commit b018691 into pandas-dev:master Jul 14, 2020
@jreback
Copy link
Contributor

jreback commented Jul 14, 2020

cool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Deprecate Functionality to remove in pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Revert changes to describe
4 participants