API/DEPR: numeric_only kwarg for apply/reductions #28900

jbrockmendel · 2019-10-10T16:56:57Z

A lot of complexity in DataFrame._reduce/apply/groupby ops is driven by the numeric_only=None. In these cases we try to apply the reduction to non-numeric dtypes and if it fails, exclude them. This hugely complicated our exception handling.

We should consider removing that option and require users to subset the appropriate columns before doing their operations.

WillAyd · 2019-10-10T18:05:31Z

This is in need of some clean up and could certainly be refactored; might be a tough sell to do away with the implicit dropping of "nuisance" columns though

I know the OP mentions DataFrame ops, but worth mention this is very prevalent in GroupBy as well

TomAugspurger · 2019-10-10T19:16:35Z

might be a tough sell to do away with the implicit dropping of "nuisance" columns though.

Agreed with this.

Ignoring object-dtype for the moment, can we determine the valid dtypes to include / drop before doing the operation?

jbrockmendel · 2019-10-10T20:15:06Z

@TomAugspurger can you provide some context for the block quote? It looks like a dask thing that might be unrelated (or could be related in a really interesting way).

Ignoring object-dtype for the moment, can we determine the valid dtypes to include / drop before doing the operation?

I think so. Will become more certain as the exception-catching PRs in groupby progress.

TomAugspurger · 2019-10-10T20:22:08Z

Copy-paste fail. Fixed.

jbrockmendel · 2019-10-10T22:29:18Z

Related: what would you expect the behavior to be if you did df.groupby("group").mean(numeric_only=False) on a DataFrame that includes non-numeric columns?

Motivated by a case I'm troubleshooting in tests.groupby.test_function.test_arg_passthru where it looks like it is trying the op on non-numeric (in this case categorical[str]) columns and then dropping those.

jbrockmendel · 2019-10-21T22:16:28Z

@jorisvandenbossche you might be interested in weighing in here. The current behavior of suppressing exceptions and excluding columns is a big part of the reason why groupby is so hard to reason about

jorisvandenbossche · 2019-10-22T08:21:32Z

I agree that it seems a big breakage to go away from the automatic dropping of "nuisance" columns.

what would you expect the behavior to be if you did df.groupby("group").mean(numeric_only=False) on a DataFrame that includes non-numeric columns?

I would expect it to raise, like it does for df.mean(numeric_only=False) ...

jbrockmendel · 2020-09-05T01:14:13Z

@jreback on the call today you seemed positive on this idea, asked how many tests rely on current behavior. Looks like 50 total, roughly a third in DataFrame reductions (mostly trying to do math on strings), a third in groupby.apply/agg (some strings, some Categorical), and the rest window (mostly dt64)

I've spent a big part of the afternoon trying to get the deprecation for #36076 right and its a PITA, would be a lot easier following this.

jorisvandenbossche · 2020-09-05T14:30:13Z

I've spent a big part of the afternoon trying to get the deprecation for #36076 right and its a PITA, would be a lot easier following this.

Note that datetime/timedelta are numeric dtypes (correct?) but still don't support any/all, so I don't think it's fully related? (for categorical it's of course different)

jbrockmendel · 2020-09-25T18:10:47Z

Thinking about how to actually implement this deprecation, it would look something like:

if numeric_only is not False:
   warnings.warn("numeric_only=whatever is deprecated and will raise in a future version.  Instead, select the columns you want to operate on and use `df[columns].whatever(numeric_only=False)`"

and then I guess after that is enforced in 2.0, we then do another deprecation to get rid of the kwarg altogether?

jreback · 2020-09-25T19:59:33Z

#don't we normally just change the default to None; then can easily tell if it's user changed

and 2.0 just drop it entirely

jbrockmendel · 2020-09-25T20:13:59Z

numeric_only=None is the default, and its the most problem-causing of the options (True, False, None)

jreback · 2020-09-25T20:22:59Z

ahh then can we do lib.no_default?

we want to warn if it's explicitly passed

jbrockmendel · 2020-09-25T20:25:07Z

we want to warn if it's explicitly passed

Yes, but the behavior is going to change even if it is not explicitly passed, so we need to warn in that case too.

jorisvandenbossche · 2020-09-25T20:54:33Z

Personally, I am not yet fully convinced we actually want to deprecate this. Automatically dropping columns that are not relevant for the aggregation is also quite convenient, and I am sure a ton of code relies on this behaviour.
(and I have the impression others also are not yet convinced in the comments above)

But that doesn't mean we can't try to improve the complexity of the situation. For example echoing @TomAugspurger's comment above (#28900 (comment)), would it be possible to more deterministically know which columns will be included?
And if it is only the object dtype columns that are the problem / introduce the complexities, we could maybe also think about only changing something for object dtype?

Thinking about how to actually implement this deprecation, it would look something like:

If we deprecate, I think we should ensure we only raise a warning when actually something would change for the user. So eg when having only numerical columns, the exact value of numeric_only has no impact, and you shouldn't see any warning?

jbrockmendel · 2020-09-25T21:56:40Z

AFAICT there are two options to achieve internally consistent behavior:

Remove numeric_only altogether (this issue)
Make DataFrame._reduce
a) always operate block-wise (column-wise would be fine too)
b) operate column-wise for object-dtype
c) even then, axis=1 is a PITA

My main preference is to fix as much as possible of these without waiting for 2.0.

If we deprecate, I think we should ensure we only raise a warning when actually something would change for the user. So eg when having only numerical columns, the exact value of numeric_only has no impact, and you shouldn't see any warning?

For sufficiently simple cases this is doable, but things get pretty FUBAR pretty quickly.

jbrockmendel · 2021-10-06T18:08:36Z

Closed by #43743, #43741, #42834, #41480, #41475

jbrockmendel added API Design Apply Apply, Aggregate, Transform Deprecate Functionality to remove in pandas labels Nov 30, 2019

jbrockmendel mentioned this issue Mar 20, 2020

ENH/PERF: enable column-wise reductions for EA-backed columns #32867

Closed

jorisvandenbossche mentioned this issue Jun 2, 2020

ENH: Set numeric_only=True in aggregations #34521

Closed

jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Jun 2, 2020

jbrockmendel mentioned this issue Sep 2, 2020

BUG: ser.to_frame().all() inconsistent with ser.all() with CategoryDtype #36076

Closed

jbrockmendel closed this as completed Sep 5, 2020

jbrockmendel reopened this Sep 17, 2020

jbrockmendel added the Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply label Sep 21, 2020

jbrockmendel mentioned this issue Sep 24, 2020

Series[categorical] median raises, but DataFrame doesn't #21020

Closed

jbrockmendel mentioned this issue Apr 30, 2021

REF: make libreduction behavior match _aggregate_series_pure_python #41242

Merged

4 tasks

mroeschke added Reduction Operations sum, mean, min, max, etc. and removed API Design labels Jul 21, 2021

jbrockmendel closed this as completed Oct 6, 2021

DriesSchaumont mentioned this issue Jan 31, 2022

mean over axis=1 in not-completely-numeric frame returns all nans #3689

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/DEPR: numeric_only kwarg for apply/reductions #28900

API/DEPR: numeric_only kwarg for apply/reductions #28900

jbrockmendel commented Oct 10, 2019

WillAyd commented Oct 10, 2019

TomAugspurger commented Oct 10, 2019 •

edited

jbrockmendel commented Oct 10, 2019

TomAugspurger commented Oct 10, 2019

jbrockmendel commented Oct 10, 2019

jbrockmendel commented Oct 21, 2019

jorisvandenbossche commented Oct 22, 2019

jbrockmendel commented Sep 5, 2020

jorisvandenbossche commented Sep 5, 2020 •

edited

jbrockmendel commented Sep 25, 2020

jreback commented Sep 25, 2020

jbrockmendel commented Sep 25, 2020

jreback commented Sep 25, 2020 •

edited

jbrockmendel commented Sep 25, 2020

jorisvandenbossche commented Sep 25, 2020

jbrockmendel commented Sep 25, 2020

jbrockmendel commented Oct 6, 2021

API/DEPR: numeric_only kwarg for apply/reductions #28900

API/DEPR: numeric_only kwarg for apply/reductions #28900

Comments

jbrockmendel commented Oct 10, 2019

WillAyd commented Oct 10, 2019

TomAugspurger commented Oct 10, 2019 • edited

jbrockmendel commented Oct 10, 2019

TomAugspurger commented Oct 10, 2019

jbrockmendel commented Oct 10, 2019

jbrockmendel commented Oct 21, 2019

jorisvandenbossche commented Oct 22, 2019

jbrockmendel commented Sep 5, 2020

jorisvandenbossche commented Sep 5, 2020 • edited

jbrockmendel commented Sep 25, 2020

jreback commented Sep 25, 2020

jbrockmendel commented Sep 25, 2020

jreback commented Sep 25, 2020 • edited

jbrockmendel commented Sep 25, 2020

jorisvandenbossche commented Sep 25, 2020

jbrockmendel commented Sep 25, 2020

jbrockmendel commented Oct 6, 2021

TomAugspurger commented Oct 10, 2019 •

edited

jorisvandenbossche commented Sep 5, 2020 •

edited

jreback commented Sep 25, 2020 •

edited