Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Categorical indexing performance regression #30744

Closed
jorisvandenbossche opened this issue Jan 6, 2020 · 4 comments · Fixed by #30747
Closed

PERF: Categorical indexing performance regression #30744

jorisvandenbossche opened this issue Jan 6, 2020 · 4 comments · Fixed by #30747
Labels
Performance Memory or execution speed performance
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Recent regression in the categoricals.CategoricalSlicing.time_getitem_list benchmark: https://pandas.pydata.org/speed/pandas/#categoricals.CategoricalSlicing.time_getitem_list?commits=6efc2379-b9de33e3

Reproducible example for this benchmark:

N = 10 ** 6
categories = ["a", "b", "c"]
values = [0] * N + [1] * N + [2] * N
data = pd.Categorical.from_codes(values, categories=categories)

list_ = list(range(10000))

%timeit data[list_]

Now, this slowdown is due to the changes in #30308. Categorical __getitem__ now checks if the key is a boolean indexer: https://github.com/pandas-dev/pandas/pull/30308/files#diff-f3b2ea15ba728b55cab4a1acd97d996d

So this slowdown is of course expected, and also only for Categorical itself (eg pd.Series indexing already handles this boolean checking). So in that light, we can certainly ignore this regression.
But, this led me think: maybe the ExtensionArrays are a good place to start not supporting object dtype as boolean indexer? (and so not add support for it now, which also avoids this performance regression)

@jorisvandenbossche jorisvandenbossche added the Performance Memory or execution speed performance label Jan 6, 2020
@jorisvandenbossche
Copy link
Member Author

Actually, I think I am wrong in the fact that it is the object dtype support that is the additional overhead in this case. It rather is the conversion of the list to a numpy array, to then only see it is an integer array and not a boolean array, and throw the converted array away (and later have to do the conversion again, when actually indexing with the integer list)

@TomAugspurger
Copy link
Contributor

So IIUC, the best thing to do is convert list inputs into array inputs as early as possible? And then re-use that (hopefully well-typed) input later on?

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 6, 2020
Convert to an array earlier on.
Closes pandas-dev#30744
@TomAugspurger TomAugspurger added this to the 1.0 milestone Jan 6, 2020
@jorisvandenbossche
Copy link
Member Author

Yes, that's correct. But seeing that basically every internal ExtensionArray and also external ExtensionArray would want to do this, I am wondering if we rather want to expose something like check_array_indexer instead of the boolean specific one we exposed now. That could also avoid those avoidable extra conversions.

@jorisvandenbossche
Copy link
Member Author

Such a common function might also help with #30738

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants