Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/DEPR: numeric_only kwarg for apply/reductions #28900

Closed
jbrockmendel opened this issue Oct 10, 2019 · 17 comments
Closed

API/DEPR: numeric_only kwarg for apply/reductions #28900

jbrockmendel opened this issue Oct 10, 2019 · 17 comments
Labels
Apply Apply, Aggregate, Transform Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc.

Comments

@jbrockmendel
Copy link
Member

A lot of complexity in DataFrame._reduce/apply/groupby ops is driven by the numeric_only=None. In these cases we try to apply the reduction to non-numeric dtypes and if it fails, exclude them. This hugely complicated our exception handling.

We should consider removing that option and require users to subset the appropriate columns before doing their operations.

@WillAyd
Copy link
Member

WillAyd commented Oct 10, 2019

This is in need of some clean up and could certainly be refactored; might be a tough sell to do away with the implicit dropping of "nuisance" columns though

I know the OP mentions DataFrame ops, but worth mention this is very prevalent in GroupBy as well

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 10, 2019

might be a tough sell to do away with the implicit dropping of "nuisance" columns though.

Agreed with this.

Ignoring object-dtype for the moment, can we determine the valid dtypes to include / drop before doing the operation?

@jbrockmendel
Copy link
Member Author

@TomAugspurger can you provide some context for the block quote? It looks like a dask thing that might be unrelated (or could be related in a really interesting way).

Ignoring object-dtype for the moment, can we determine the valid dtypes to include / drop before doing the operation?

I think so. Will become more certain as the exception-catching PRs in groupby progress.

@TomAugspurger
Copy link
Contributor

Copy-paste fail. Fixed.

@jbrockmendel
Copy link
Member Author

Related: what would you expect the behavior to be if you did df.groupby("group").mean(numeric_only=False) on a DataFrame that includes non-numeric columns?

Motivated by a case I'm troubleshooting in tests.groupby.test_function.test_arg_passthru where it looks like it is trying the op on non-numeric (in this case categorical[str]) columns and then dropping those.

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche you might be interested in weighing in here. The current behavior of suppressing exceptions and excluding columns is a big part of the reason why groupby is so hard to reason about

@jorisvandenbossche
Copy link
Member

I agree that it seems a big breakage to go away from the automatic dropping of "nuisance" columns.

what would you expect the behavior to be if you did df.groupby("group").mean(numeric_only=False) on a DataFrame that includes non-numeric columns?

I would expect it to raise, like it does for df.mean(numeric_only=False) ...

@jbrockmendel
Copy link
Member Author

@jreback on the call today you seemed positive on this idea, asked how many tests rely on current behavior. Looks like 50 total, roughly a third in DataFrame reductions (mostly trying to do math on strings), a third in groupby.apply/agg (some strings, some Categorical), and the rest window (mostly dt64)

I've spent a big part of the afternoon trying to get the deprecation for #36076 right and its a PITA, would be a lot easier following this.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 5, 2020

I've spent a big part of the afternoon trying to get the deprecation for #36076 right and its a PITA, would be a lot easier following this.

Note that datetime/timedelta are numeric dtypes (correct?) but still don't support any/all, so I don't think it's fully related? (for categorical it's of course different)

@jbrockmendel jbrockmendel reopened this Sep 17, 2020
@jbrockmendel jbrockmendel added the Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply label Sep 21, 2020
@jbrockmendel
Copy link
Member Author

Thinking about how to actually implement this deprecation, it would look something like:

if numeric_only is not False:
   warnings.warn("numeric_only=whatever is deprecated and will raise in a future version.  Instead, select the columns you want to operate on and use `df[columns].whatever(numeric_only=False)`"

and then I guess after that is enforced in 2.0, we then do another deprecation to get rid of the kwarg altogether?

@jreback
Copy link
Contributor

jreback commented Sep 25, 2020

#don't we normally just change the default to None; then can easily tell if it's user changed

and 2.0 just drop it entirely

@jbrockmendel
Copy link
Member Author

numeric_only=None is the default, and its the most problem-causing of the options (True, False, None)

@jreback
Copy link
Contributor

jreback commented Sep 25, 2020

ahh then can we do lib.no_default?

we want to warn if it's explicitly passed

@jbrockmendel
Copy link
Member Author

we want to warn if it's explicitly passed

Yes, but the behavior is going to change even if it is not explicitly passed, so we need to warn in that case too.

@jorisvandenbossche
Copy link
Member

Personally, I am not yet fully convinced we actually want to deprecate this. Automatically dropping columns that are not relevant for the aggregation is also quite convenient, and I am sure a ton of code relies on this behaviour.
(and I have the impression others also are not yet convinced in the comments above)

But that doesn't mean we can't try to improve the complexity of the situation. For example echoing @TomAugspurger's comment above (#28900 (comment)), would it be possible to more deterministically know which columns will be included?
And if it is only the object dtype columns that are the problem / introduce the complexities, we could maybe also think about only changing something for object dtype?

Thinking about how to actually implement this deprecation, it would look something like:

If we deprecate, I think we should ensure we only raise a warning when actually something would change for the user. So eg when having only numerical columns, the exact value of numeric_only has no impact, and you shouldn't see any warning?

@jbrockmendel
Copy link
Member Author

AFAICT there are two options to achieve internally consistent behavior:

  1. Remove numeric_only altogether (this issue)
  2. Make DataFrame._reduce
    a) always operate block-wise (column-wise would be fine too)
    b) operate column-wise for object-dtype
    c) even then, axis=1 is a PITA

My main preference is to fix as much as possible of these without waiting for 2.0.

If we deprecate, I think we should ensure we only raise a warning when actually something would change for the user. So eg when having only numerical columns, the exact value of numeric_only has no impact, and you shouldn't see any warning?

For sufficiently simple cases this is doable, but things get pretty FUBAR pretty quickly.

@jbrockmendel
Copy link
Member Author

Closed by #43743, #43741, #42834, #41480, #41475

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

6 participants