PERF: Categorical indexing performance regression #30744

jorisvandenbossche · 2020-01-06T16:28:56Z

Recent regression in the categoricals.CategoricalSlicing.time_getitem_list benchmark: https://pandas.pydata.org/speed/pandas/#categoricals.CategoricalSlicing.time_getitem_list?commits=6efc2379-b9de33e3

Reproducible example for this benchmark:

N = 10 ** 6
categories = ["a", "b", "c"]
values = [0] * N + [1] * N + [2] * N
data = pd.Categorical.from_codes(values, categories=categories)

list_ = list(range(10000))

%timeit data[list_]

Now, this slowdown is due to the changes in #30308. Categorical __getitem__ now checks if the key is a boolean indexer: https://github.com/pandas-dev/pandas/pull/30308/files#diff-f3b2ea15ba728b55cab4a1acd97d996d

So this slowdown is of course expected, and also only for Categorical itself (eg pd.Series indexing already handles this boolean checking). So in that light, we can certainly ignore this regression.
But, this led me think: maybe the ExtensionArrays are a good place to start not supporting object dtype as boolean indexer? (and so not add support for it now, which also avoids this performance regression)

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-01-06T16:39:40Z

Actually, I think I am wrong in the fact that it is the object dtype support that is the additional overhead in this case. It rather is the conversion of the list to a numpy array, to then only see it is an integer array and not a boolean array, and throw the converted array away (and later have to do the conversion again, when actually indexing with the integer list)

TomAugspurger · 2020-01-06T16:56:17Z

So IIUC, the best thing to do is convert list inputs into array inputs as early as possible? And then re-use that (hopefully well-typed) input later on?

Convert to an array earlier on. Closes pandas-dev#30744

jorisvandenbossche · 2020-01-06T19:25:17Z

Yes, that's correct. But seeing that basically every internal ExtensionArray and also external ExtensionArray would want to do this, I am wondering if we rather want to expose something like check_array_indexer instead of the boolean specific one we exposed now. That could also avoid those avoidable extra conversions.

jorisvandenbossche · 2020-01-06T19:26:01Z

Such a common function might also help with #30738

Convert to an array earlier on. Closes #30744

jorisvandenbossche added the Performance Memory or execution speed performance label Jan 6, 2020

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 6, 2020

PERF: Categorical getitem perf

7b05c0d

Convert to an array earlier on. Closes pandas-dev#30744

TomAugspurger mentioned this issue Jan 6, 2020

PERF: Categorical getitem perf #30747

Merged

TomAugspurger added this to the 1.0 milestone Jan 6, 2020

TomAugspurger closed this as completed in #30747 Jan 6, 2020

TomAugspurger added a commit that referenced this issue Jan 6, 2020

PERF: Categorical getitem perf (#30747)

d3f94a4

Convert to an array earlier on. Closes #30744

jorisvandenbossche mentioned this issue Jan 7, 2020

REF: Implement BaseMaskedArray class for integer/boolean ExtensionArrays #30789

Merged

jorisvandenbossche mentioned this issue Jan 20, 2020

API: generalized check_array_indexer for validating array-like getitem indexers #31150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Categorical indexing performance regression #30744

PERF: Categorical indexing performance regression #30744

jorisvandenbossche commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020

TomAugspurger commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020

PERF: Categorical indexing performance regression #30744

PERF: Categorical indexing performance regression #30744

Comments

jorisvandenbossche commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020

TomAugspurger commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020