Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Speed up boolean masking on Series / index #30349

Closed
TomAugspurger opened this issue Dec 19, 2019 · 7 comments
Closed

PERF: Speed up boolean masking on Series / index #30349

TomAugspurger opened this issue Dec 19, 2019 · 7 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance

Comments

@TomAugspurger
Copy link
Contributor

On master, this takes a boolean mask on a 10,000 element series take me ~300us

In [2]: s = pd.Series(np.arange(10000))
   ...: m1 = np.zeros(len(s), dtype="bool")
   ...: m2 = pd.array(m1, dtype="boolean")

In [3]: %timeit s[m1]
275 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Much of that time is spent in a try / except inside Index.get_value

%load_ext line_profiler
%lprun -f pd.Index.get_value s[m1]

  4481         1          1.0      1.0      0.1          try:
  4482         1        626.0    626.0     63.9              return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
  4483         1          1.0      1.0      0.1          except KeyError as e1:

That can never succeed for a boolean mask. By skipping that path entirely, we improve perf on this example by ~2x

In [3]: %timeit s[m1]
155 µs ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We might be able to restructure Index.get_value or Series.__getitem__ a bit to not go down this path when we have a boolean ndarray as a mask.

@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Dec 19, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Dec 19, 2019
@jorisvandenbossche
Copy link
Member

Another 30% of the time is spent in applying the boolean mask on the RangeIndex (for the case above). This also seems a non-optimized case: it's not directly handled in the RangeIndex.__getitem__, and therefore it is handled in the base Index class with inferring of the result class. While I think we can know that a boolean filter on a RangeIndex should always give an Int64Index as result. Using that assumption should speed up this part with a few factors.

@jbrockmendel
Copy link
Member

@TomAugspurger I now get 157us for the s[m1] example you posted, which is about the same as the optimized target you suggested. I expect this is due to the recent improvements in FooIndex.get_value. Skipping the Index.get_value call gets this down to 148us, so there might not be much more room for optimization (notwithstanding the RangeIndex suggestion from @jorisvandenbossche )

@jbrockmendel
Copy link
Member

and therefore it is handled in the base Index class with inferring of the result class

I don't think that's accurate anymore. RangeIndex.__getitem__ defers to Index.__getitem__, which in turn operates on an ndarray and wraps in _shallow_copy. RangeIndex._shallow_copy goes directly to Int64Index.

@jorisvandenbossche
Copy link
Member

And a quick timing confirmed that getitem with a rangeindex is not slower than with a int64 index for the above case, so it's indeed certainly not due to slowness of RangeIndex specifically.

Now, the reason that I actually saw "inference of index dtype" in the profile as mentioned above: although Index.__getitem__ calls _shallow_copy to avoid inferring, Int64Index._shallow_copy is then actually calling _shallow_copy_with_infer (the comment says to avoid storing floats in an Int64Index, but so this is not a fully proper fastpath)

@jorisvandenbossche
Copy link
Member

(I didn't check again how significant this part is of the total, though)

@jbrockmendel
Copy link
Member

it sounds like this issue is closeable? (there are likely still perf gains to be had, but they aren't really specific to boolean masking)

@jorisvandenbossche
Copy link
Member

Yep. I opened #31903 for the observation about dtype inference in Index.getitem I mentioned above.

@jorisvandenbossche jorisvandenbossche modified the milestones: Contributions Welcome, No action Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants