Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: DataFrame.where for EA dtype mask #51574

Merged
merged 6 commits into from
Feb 25, 2023

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Feb 23, 2023

Perf improvement when the mask/cond is "boolean" (EA) dtype.

Ending up with an EA mask is common when working with EA-backed data:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 2), dtype="Float64")
mask = df > 0

print(mask.dtypes) 

# 0    boolean
# 1    boolean
# dtype: object

ASV added:

       before           after         ratio
     [ddae6946]       [48684e5c]
     <main>           <where-putmask-ea-boolean>
-      51.0±0.7ms       18.4±0.4ms     0.36  frame_methods.Where.time_where(False, 'Float64')
-      45.3±0.6ms       13.5±0.4ms     0.30  frame_methods.Where.time_where(True, 'Float64')
-      44.4±0.8ms       13.1±0.4ms     0.29  frame_methods.Where.time_where(False, 'float64[pyarrow]')
-      47.2±0.9ms       13.5±0.7ms     0.29  frame_methods.Where.time_where(True, 'float64[pyarrow]')

EA-backed frames are not widely covered by ASVs, but there are other methods it may help with, e.g.:

       before           after         ratio
     [ddae6946]       [48684e5c]
     <main>           <where-putmask-ea-boolean>
+      11.0±0.7ms       13.6±0.4ms     1.24  frame_methods.Clip.time_clip('float64')
-        95.9±4ms         41.0±2ms     0.43  frame_methods.Clip.time_clip('Float64')
-        82.0±2ms       30.8±0.2ms     0.38  frame_methods.Clip.time_clip('float64[pyarrow]')

Note a slight slowdown in the non-EA clip ASV. However, I think the simplification is worth it and it puts clip back to where it was a few days ago in terms of perf (prior to #51472). In addition, we might be able to improve perf via the discussion in #51547.

@lukemanley lukemanley added Performance Memory or execution speed performance ExtensionArray Extending pandas with custom dtypes or arrays. labels Feb 23, 2023
@@ -390,6 +398,14 @@ def putmask(self, mask, new, align: bool = True):
align_keys = ["mask"]
new = extract_array(new, extract_numpy=True)

if isinstance(mask, ABCDataFrame) and mask._mgr.any_extension_types:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a viable option to avoid getting here with DataFrames?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect something could be done in NDFrame._where to avoid most cases. And then maybe .clip could just call _where?

If you're already thinking about changes to ._where in #51547 I can close this if it makes sense to combine the two.

Copy link
Member Author

@lukemanley lukemanley Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this to NDFrame._where and updated the timings in the OP.

@phofl
Copy link
Member

phofl commented Feb 23, 2023

I think we can do them separately, but I agree that this should be handled before getting to the manager level

@lukemanley lukemanley changed the title PERF: BlockManager.where/putmask for EA dtype mask PERF: DataFrame.where for EA dtype mask Feb 24, 2023
return self._update_inplace(result)
else:
return result.__finalize__(self)
if lower is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically reverts the initial clip pr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically reverts the initial clip pr?

Partially, but it still retains the new inplace behavior and the perf improvements for EAs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, forgot about the inplace fix

@phofl
Copy link
Member

phofl commented Feb 25, 2023

Can you fix conflicts?

@phofl phofl added this to the 2.1 milestone Feb 25, 2023
@phofl phofl merged commit d05c0b9 into pandas-dev:main Feb 25, 2023
@phofl
Copy link
Member

phofl commented Feb 25, 2023

thx @lukemanley

@lukemanley lukemanley deleted the where-putmask-ea-boolean branch March 17, 2023 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants