Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cythonized GroupBy mad #20024

Closed
wants to merge 9 commits into from
Closed

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Mar 7, 2018

       before           after         ratio
     [01b91c26]       [9d7f0ac9]
+         678±3μs         807±10μs     1.19  groupby.GroupByMethods.time_method('object', 'value_counts')
+       116±0.8μs          135±1μs     1.16  groupby.GroupByMethods.time_method('object', 'shift')
+        777±10μs         888±10μs     1.14  groupby.GroupByMethods.time_method('object', 'unique')
+      71.6±0.9μs       81.4±0.4μs     1.14  groupby.GroupByMethods.time_method('object', 'size')
-           730ms       2.87±0.4ms     0.00  groupby.GroupByMethods.time_method('int', 'mad')
-           1.23s       3.05±0.5ms     0.00  groupby.GroupByMethods.time_method('float', 'mad')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@@ -12,7 +12,7 @@

AGG_FUNCTIONS = ['sum', 'prod', 'min', 'max', 'median', 'mean', 'skew',
'mad', 'std', 'var', 'sem']
AGG_FUNCTIONS_WITH_SKIPNA = ['skew', 'mad']
AGG_FUNCTIONS_WITH_SKIPNA = ['skew']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking to why I did this - since mad uses the mean behind the scenes I figured it made sense to rely on mean to do as much of the heavy lifting as possible. Unfortunately mean doesn't currently handle the skipna parameter but I think it's worth addressing within that function and allowing that to pass through to mad rather than implementing specifically within mad.

As always glad to open an issue for that if you agree on approach


# Wrap in a try..except to catch a TypeError with bool data
# Ideally this would be implemented in `mean` instead of here
try:
Copy link
Member Author

@WillAyd WillAyd Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As touched on in the comments I ideally would not want this try...except and think instead that the mean application should be throwing the error. While mean does raise for object types, something like pd.Series([True, False, True]).mean() is entirely valid and therefore this ends up throwing a TypeError in subsequent operation

I think mean should raise on boolean data and have that pass through. Will open separate issue if you agree

@@ -1300,17 +1301,6 @@ def test_non_cython_api(self):
g = df.groupby('A')
gni = df.groupby('A', as_index=False)

# mad
Copy link
Member Author

@WillAyd WillAyd Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was failing with this change because it allowed for mad operations on object types, returning NaN rather than raising. I figured the explicit tests added elsewhere made more sense and that raising is preferable to NaN

There's also some code that tests any further down within this test. Didn't check in detail but can probably be removed on account of changes done in #19722. Assuming I have updates I'll include in the next batch or alternately open up a separate issue

@jreback
Copy link
Contributor

jreback commented Mar 7, 2018

i think we r going to remove mad (there is an issue about this); so not sure we should add this

@WillAyd
Copy link
Member Author

WillAyd commented Mar 7, 2018

I assume you are referencing #11787. FWIW it would be a pretty trivial change here to support 'median' as well.

OK with deprecating, but I will say the one thing that we may want to consider is how users would roll this on their own. Operating on a series is for sure straightforward:

abs(df['val'] - df['val'].mean()).mean()

But attempting the same on a GroupBy object will Raise

abs(df.groupby('key') - df.groupby('key').mean()).mean()
ValueError: Unable to coerce to Series, length must be 1: given 2

So the user would be responsible for some heavier lifting on that side, assuming there's no built-in way for the plain GroupBy object to handle subtraction of its aggregated result.

@jreback
Copy link
Contributor

jreback commented Mar 7, 2018

try with a group by and then transform
this is a typical pattern

@WillAyd
Copy link
Member Author

WillAyd commented Mar 7, 2018

Assuming you mean

abs(df - df.groupby('key').transform('mean')).mean()

It's possible but then requires the user to explicitly drop the grouped field(s) somewhere in the operation.

A similar implementation (which I used here) doesn't add the grouped fields but is definitely more verbose than what would be required for the Series / DataFrame counterparts

abs(df.groupby('key').shift(0) - df.groupby('key').transform('mean')).mean()

@jreback
Copy link
Contributor

jreback commented Mar 7, 2018

not sure why u need .shift
also you don’t need the final mean
not sure what that is for

@WillAyd
Copy link
Member Author

WillAyd commented Mar 7, 2018

If you remove the shift you get

ValueError: Unable to coerce to Series, length must be 1: given 2

The inner mean is the "centering" op and the outer is the resulting op. I think this comment sums up the possible combinations pretty well

@shoyer
Copy link
Member

shoyer commented Mar 7, 2018

In xarray we support arithmetic with GroupBy objects so your example would actually work:

abs(df.groupby('key') - df.groupby('key').mean()).mean()

It would be interesting to explore porting this syntax to pandas. Xarray users find it pretty useful for doing these sorts of grouped normalizations (which are common in climate science).

@gfyoung gfyoung added Groupby Performance Memory or execution speed performance labels Mar 8, 2018
@WillAyd
Copy link
Member Author

WillAyd commented Mar 8, 2018

@shoyer that could be a viable option here, and certainly make the mad operation across DataFrame / Series / GroupBy consistent while opening up some other possibilities (ex: easy demeaning).

Do you know where that is implemented in xarray? Dug through the source but nothing was immediately apparent to me

@shoyer
Copy link
Member

shoyer commented Mar 8, 2018

We use some awkward machinery for defining binary ops in xarray (you'd need to figure out how to do this for pandas), but here's where the core groupby arithmetic logic is defined:
https://github.com/pydata/xarray/blob/870e4eaf1895cfeffdc27dab61ad739e67133777/xarray/core/groupby.py#L301-L332

I think you could do something pretty similar for pandas, though obviously the implementation would be pretty different (e.g., you could use .loc instead of .sel()).

@jreback
Copy link
Contributor

jreback commented Jul 8, 2018

@WillAyd ideally like to move the groupby cython routines to pandas/core/groupby/cython before we attempt this

@jbrockmendel jbrockmendel mentioned this pull request Aug 11, 2018
4 tasks
@WillAyd
Copy link
Member Author

WillAyd commented Nov 13, 2018

Closing due to potential deprecation of method cited in #11787

@WillAyd WillAyd closed this Nov 13, 2018
@WillAyd WillAyd deleted the grp-mad-perf branch January 16, 2020 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants