Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: axis=1 reductions with EA dtypes #54341

Merged
merged 13 commits into from Aug 13, 2023

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Aug 1, 2023

cc @rhshadrach - this includes the test you wrote in #51923 (with a few edits) as it looks like axis=1 reductions with EA dtypes are not well tested.

This PR avoids special-casing the internal EA types (although special-casing might allow for further optimization).

xref: #51474

Timings on a slow laptop:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10000, 4), dtype="float64[pyarrow]")
%timeit df.sum(axis=1)

# 3.79 s ± 137 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)      -> main
# 1.35 s ± 40.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     -> 2.0.3
# 250 ms ± 53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)       -> 1.5.3
# 5.42 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  -> PR

df = pd.DataFrame(np.random.randn(10000, 4), dtype="Float64")
%timeit df.sum(axis=1)

# 1.85 s ± 44.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     -> main
# 1.2 s ± 58.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)      -> 2.0.3
# 16.4 ms ± 745 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  -> 1.5.3
# 5.15 ms ± 286 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  -> PR

@lukemanley lukemanley added Performance Memory or execution speed performance ExtensionArray Extending pandas with custom dtypes or arrays. Reduction Operations sum, mean, min, max, etc. labels Aug 1, 2023
@rhshadrach
Copy link
Member

rhshadrach commented Aug 1, 2023

Wow - this is quite clever! On my AMD Ryzen 9 5950X machine, I get

df = pd.DataFrame(np.random.randn(10000, 4), dtype="float64[pyarrow]")
%timeit df.sum(axis=1)
698 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- main
929 µs ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR

df = pd.DataFrame(np.random.randn(10000, 4), dtype="Float64")
%timeit df.sum(axis=1)
401 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- main
1.16 ms ± 2.52 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR

Also, for wide inputs, there appears to not be a slowdown

df = pd.DataFrame(np.random.randn(4, 10000), dtype="float64[pyarrow]")
%timeit df.sum(axis=1)
29 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- main
28.2 ms ± 440 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- PR

df = pd.DataFrame(np.random.randn(4, 10000), dtype="Float64")
%timeit df.sum(axis=1)
22.9 ms ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) <-- main
22.4 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) <-- PR

Edit:

I also ran ASVs for frame_methods and stat_ops. frame_methods showed no change, stat_ops:

       before           after         ratio
     [c93e8034]       [8335a189]
     <main>           <perf-axis1-ea-reductions>
-      4.25±0.03s       10.2±0.3ms     0.00  stat_ops.FrameOps.time_op('sum', 'Int64', 1)
-         4.24±0s       10.0±0.3ms     0.00  stat_ops.FrameOps.time_op('prod', 'Int64', 1)
-      6.85±0.06s       13.0±0.3ms     0.00  stat_ops.FrameOps.time_op('skew', 'Int64', 1)
-      7.66±0.03s       13.7±0.3ms     0.00  stat_ops.FrameOps.time_op('median', 'Int64', 1)
-      7.46±0.04s       10.6±0.3ms     0.00  stat_ops.FrameOps.time_op('var', 'Int64', 1)
-      7.61±0.04s       10.8±0.3ms     0.00  stat_ops.FrameOps.time_op('std', 'Int64', 1)

@lukemanley
Copy link
Member Author

lukemanley commented Aug 1, 2023

I also ran ASVs for frame_methods and stat_ops. frame_methods showed no change

isn't that showing a huge change, seconds -> milliseconds?

Edit: nevermind, I see the "no change" comment was for frame_methods. I'm not sure if any of the benchmarks in frame_methods would be impacted by this but it also looks like frame_methods generally does not measure EA types.

@rhshadrach
Copy link
Member

Right - I wasn't sure where axis=1 ops might show up (haven't thoroughly looked through the benchmarks); these seemed to be the most likely places.

pandas/core/arrays/masked.py Outdated Show resolved Hide resolved
Comment on lines +1909 to +1910
if how in ["any", "all"] and isinstance(values, SparseArray):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is a general issue with SparseArray any/all so really independent of this PR, is that right? I'm thinking this should be fixed in SparseArray itself rather than in groupby code. Would it be okay to xfail any relevant tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the issue with my request above; this would make axis=1 fail for SparseArray whereas it didn't before. I would be okay opening up an issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure - by opening an issue, do you mean xfail for now as part of this PR? or open an issue and address that first before this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave this PR as-is; open an issue to cleanup after this is merged (before is okay too)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah - got it. thanks

pandas/tests/frame/test_reductions.py Show resolved Hide resolved
nrows, ncols = df.shape
row_index = np.tile(np.arange(nrows), ncols)
col_index = np.repeat(np.arange(ncols), nrows)
ser = Series(arr, index=col_index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add comments on what is going on here. i think i get it, but it could be non-obvious to someone who doesnt know e.g. groupby.agg

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with comments

@jbrockmendel
Copy link
Member

how would this compare to using reduce_axis_1 approach? i think that would avoid copies/allocations. agreed this is clever, but if theres a way to avoid depending on the groupby code id prefer it

@lukemanley
Copy link
Member Author

how would this compare to using reduce_axis_1 approach? i think that would avoid copies/allocations. agreed this is clever, but if theres a way to avoid depending on the groupby code id prefer it

I might be missing something but my read of _reduce_axis1 is that it would only provide a benefit for 2D arrays whereas this specifically targets EA (1D) arrays.

@rhshadrach
Copy link
Member

@lukemanley - the rough idea of _reduce_axis_1 is to do df['a'] + df['b'] instead of transpose and sum along axis 0. This will help EAs but not 2d blocks (which have a fast transpose).

@jbrockmendel - I think looking into _reduce_axis_1 approach is a good idea, but it won't work for reductions that don't have an online algorithm - e.g. median.

@lukemanley
Copy link
Member Author

The additional copy required to construct the 1D array would be nice to avoid. I took a quick look at using _reduce_axis1 for sum and it looks promising - comparable performance to this PR and avoids the copy. As @rhshadrach points out it would only be an option for a subset of reduction types. Any thoughts on what to do with the other reductions?

Somewhat surprising is that with this current PR, axis=1 reductions seem to be faster than axis=0 reductions for EA dtypes:

import pandas as pd
import numpy as np

data = np.random.randn(1000, 1000)

df = pd.DataFrame(data, dtype="Float64")
%timeit df.sum(axis=0)
%timeit df.sum(axis=1)
# 150 ms ± 5.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 62.7 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

df = pd.DataFrame(data, dtype="float64[pyarrow]")
%timeit df.sum(axis=0)
%timeit df.sum(axis=1)
# 301 ms ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 63.6 ms ± 4.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@jbrockmendel
Copy link
Member

Any thoughts on what to do with the other reductions?

median comes to mind. Are there others?

At some point for #53261 I'm hoping to implement chunked-friendly implementations of the groupby reductions. I think that combined with the trick in this PR could get the best of both worlds avoiding the copy.

@@ -11083,6 +11083,25 @@ def _get_data() -> DataFrame:
).iloc[:0]
result.index = df.index
return result

if df.shape[1] and name != "kurt":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC kurt is excluded here bc GroupBy doesnt support it. Can you comment to that effect

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment here

@rhshadrach
Copy link
Member

Any thoughts on what to do with the other reductions?

median comes to mind. Are there others?

It looks like just nunique and quantile.

@rhshadrach
Copy link
Member

Somewhat surprising is that with this current PR, axis=1 reductions seem to be faster than axis=0 reductions for EA dtypes:

It appears as if axis=0 has more overhead; I don't believe the results hold up for size e.g. (10000, 10000).

@jbrockmendel
Copy link
Member

It appears as if axis=0 has more overhead; I don't believe the results hold up for size e.g. (10000, 10000).

That seems reasonable. I'd be interested in seeing profiling results though!

@rhshadrach
Copy link
Member

rhshadrach commented Aug 5, 2023

It appears as if axis=0 has more overhead; I don't believe the results hold up for size e.g. (10000, 10000).

That seems reasonable. I'd be interested in seeing profiling results though!

On my slower laptop now:

data = np.random.randn(10000, 10000)

df = pd.DataFrame(data, dtype="Float64")
%timeit df.sum(axis=0)
%timeit df.sum(axis=1)
# 1.02 s ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 2.62 s ± 46.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

df = pd.DataFrame(data, dtype="float64[pyarrow]")
%timeit df.sum(axis=0)
%timeit df.sum(axis=1)
# 1.24 s ± 7.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 2.55 s ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; cc @jbrockmendel

@rhshadrach rhshadrach added this to the 2.1 milestone Aug 5, 2023
@rhshadrach
Copy link
Member

rhshadrach commented Aug 5, 2023

I would be okay with saying this closes the last two of the three issues mentioned in the OP, leaving the _reduce_axis_1 open for further improvements.

nrows, ncols = df.shape
row_index = np.tile(np.arange(nrows), ncols)
col_index = np.repeat(np.arange(ncols), nrows)
ser = Series(arr, index=col_index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do copy=False here for CoW?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. thanks

@rhshadrach
Copy link
Member

@lukemanley - there's a whatsnew conflict here

@rhshadrach rhshadrach merged commit 19c8a4a into pandas-dev:main Aug 13, 2023
37 checks passed
@rhshadrach
Copy link
Member

Thanks @lukemanley - very nice!

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 13, 2023
rhshadrach pushed a commit that referenced this pull request Aug 13, 2023
…types) (#54528)

Backport PR #54341: PERF: axis=1 reductions with EA dtypes

Co-authored-by: Luke Manley <lukemanley@gmail.com>
@topper-123
Copy link
Contributor

Yeah, very nice indeed 👍 .

@jbrockmendel
Copy link
Member

might also close #28487? and less likely but related #36123?

@lukemanley
Copy link
Member Author

might also close #28487? and less likely but related #36123?

doesn't look like it closes either of these

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc.
Projects
None yet
5 participants