PERF: Speed up boolean masking on Series / index #30349

TomAugspurger · 2019-12-19T16:28:18Z

On master, this takes a boolean mask on a 10,000 element series take me ~300us

In [2]: s = pd.Series(np.arange(10000))
   ...: m1 = np.zeros(len(s), dtype="bool")
   ...: m2 = pd.array(m1, dtype="boolean")

In [3]: %timeit s[m1]
275 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Much of that time is spent in a try / except inside Index.get_value

%load_ext line_profiler
%lprun -f pd.Index.get_value s[m1]

  4481         1          1.0      1.0      0.1          try:
  4482         1        626.0    626.0     63.9              return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
  4483         1          1.0      1.0      0.1          except KeyError as e1:

That can never succeed for a boolean mask. By skipping that path entirely, we improve perf on this example by ~2x

In [3]: %timeit s[m1]
155 µs ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We might be able to restructure Index.get_value or Series.__getitem__ a bit to not go down this path when we have a boolean ndarray as a mask.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-12-22T09:55:34Z

Another 30% of the time is spent in applying the boolean mask on the RangeIndex (for the case above). This also seems a non-optimized case: it's not directly handled in the RangeIndex.__getitem__, and therefore it is handled in the base Index class with inferring of the result class. While I think we can know that a boolean filter on a RangeIndex should always give an Int64Index as result. Using that assumption should speed up this part with a few factors.

jbrockmendel · 2020-01-25T03:16:48Z

@TomAugspurger I now get 157us for the s[m1] example you posted, which is about the same as the optimized target you suggested. I expect this is due to the recent improvements in FooIndex.get_value. Skipping the Index.get_value call gets this down to 148us, so there might not be much more room for optimization (notwithstanding the RangeIndex suggestion from @jorisvandenbossche )

jbrockmendel · 2020-02-09T20:20:28Z

and therefore it is handled in the base Index class with inferring of the result class

I don't think that's accurate anymore. RangeIndex.__getitem__ defers to Index.__getitem__, which in turn operates on an ndarray and wraps in _shallow_copy. RangeIndex._shallow_copy goes directly to Int64Index.

jorisvandenbossche · 2020-02-10T08:57:36Z

And a quick timing confirmed that getitem with a rangeindex is not slower than with a int64 index for the above case, so it's indeed certainly not due to slowness of RangeIndex specifically.

Now, the reason that I actually saw "inference of index dtype" in the profile as mentioned above: although Index.__getitem__ calls _shallow_copy to avoid inferring, Int64Index._shallow_copy is then actually calling _shallow_copy_with_infer (the comment says to avoid storing floats in an Int64Index, but so this is not a fully proper fastpath)

jorisvandenbossche · 2020-02-10T09:00:01Z

(I didn't check again how significant this part is of the total, though)

jbrockmendel · 2020-02-10T16:48:00Z

it sounds like this issue is closeable? (there are likely still perf gains to be had, but they aren't really specific to boolean masking)

jorisvandenbossche · 2020-02-11T23:05:06Z

Yep. I opened #31903 for the observation about dtype inference in Index.getitem I mentioned above.

TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Dec 19, 2019

TomAugspurger added this to the Contributions Welcome milestone Dec 19, 2019

TomAugspurger mentioned this issue Dec 19, 2019

DOC/TST: Indexing with NA raises #30308

Merged

This was referenced Jan 24, 2020

API: generalized check_array_indexer for validating array-like getitem indexers #31150

Merged

PERF: optimize is_scalar, is_iterator #31294

Merged

jbrockmendel mentioned this issue Jan 28, 2020

PERF: avoid is_bool_indexer check where possible #31399

Merged

jorisvandenbossche closed this as completed Feb 11, 2020

jorisvandenbossche modified the milestones: Contributions Welcome, No action Feb 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Speed up boolean masking on Series / index #30349

PERF: Speed up boolean masking on Series / index #30349

TomAugspurger commented Dec 19, 2019

jorisvandenbossche commented Dec 22, 2019

jbrockmendel commented Jan 25, 2020

jbrockmendel commented Feb 9, 2020

jorisvandenbossche commented Feb 10, 2020

jorisvandenbossche commented Feb 10, 2020

jbrockmendel commented Feb 10, 2020

jorisvandenbossche commented Feb 11, 2020

PERF: Speed up boolean masking on Series / index #30349

PERF: Speed up boolean masking on Series / index #30349

Comments

TomAugspurger commented Dec 19, 2019

jorisvandenbossche commented Dec 22, 2019

jbrockmendel commented Jan 25, 2020

jbrockmendel commented Feb 9, 2020

jorisvandenbossche commented Feb 10, 2020

jorisvandenbossche commented Feb 10, 2020

jbrockmendel commented Feb 10, 2020

jorisvandenbossche commented Feb 11, 2020