Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: pd.concat EA-backed indexes and sort=True #49178

Merged
merged 9 commits into from Oct 22, 2022

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Oct 19, 2022

algos.safe_sort is currently converting EA-backed indexes to ndarrays which can cause a perf hit. It can be significant if the index contains pd.NA as np.argsort will raise which gets caught in the try-catch. This PR avoids the numpy conversion for EA-backed indexes.

Note: One test which relied on the numpy conversion was updated.

import numpy as np
import pandas as pd
from pandas.core.indexes.api import safe_sort_index

vals = [pd.NA] + list(np.arange(100_000))
idx = pd.Index(vals, dtype="Int64")

%timeit safe_sort_index(idx)

# 81.6 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)   <- main
# 2.73 ms ± 24.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

I updated an existing ASV that was just added yesterday to cover this case as well:

       before           after         ratio
     [8b503a8c]       [2e2a02fa]
                      <safe-sort-index>
-        27.1±1ms       22.2±0.4ms     0.82  join_merge.ConcatIndexDtype.time_concat_series('string[pyarrow]', 'non_monotonic', 1, True)
-      15.9±0.2ms       12.3±0.1ms     0.78  join_merge.ConcatIndexDtype.time_concat_series('string[python]', 'has_na', 1, True)
-      31.4±0.7ms       24.1±0.7ms     0.77  join_merge.ConcatIndexDtype.time_concat_series('string[pyarrow]', 'has_na', 1, True)
-      22.8±0.2ms       16.4±0.3ms     0.72  join_merge.ConcatIndexDtype.time_concat_series('Int64', 'has_na', 1, True)
-      13.8±0.4ms       9.26±0.1ms     0.67  join_merge.ConcatIndexDtype.time_concat_series('Int64', 'non_monotonic', 1, True)

@lukemanley lukemanley added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode ExtensionArray Extending pandas with custom dtypes or arrays. Index Related to the Index class or subclasses labels Oct 19, 2022
num_pos = ~str_pos & ~null_pos
str_argsort = np.argsort(values[str_pos])
num_argsort = np.argsort(values[num_pos])
str_locs = str_pos.nonzero()[0].take(str_argsort)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is nonzero doing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is converting the boolean mask to positional indices within the larger array. we then sort those positional indices via the argsort of the string subset. This is all to be able to call take on the original input so we can avoid converting everything to ndarray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so False=0 and True is nonzero is the catch here? If yes, could you add a short comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. I added a comment explaining the operation.

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke
Copy link
Member

Just a merge conflict otherwise LGTM

@phofl phofl added this to the 2.0 milestone Oct 22, 2022
@phofl phofl merged commit c667fc4 into pandas-dev:main Oct 22, 2022
@lukemanley lukemanley deleted the safe-sort-index branch October 26, 2022 10:18
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Index Related to the Index class or subclasses Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants