Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: accessing sliced indexes with populated indexing engines #51738

Merged
merged 6 commits into from Mar 8, 2023

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Mar 2, 2023

Improves performance of indexes that are sliced from indexes with already-built indexing engines by copying the relevant data from the existing indexing engine, thereby avoiding recomputation.

Performance example:

>>> import pandas as pd
>>>
>>> idx = pd.Index(np.arange(1_000_000))
>>> idx.is_unique, dx.is_monotonic_increasing  # building the engine
(True, True)
>>> %timeit idx[:].is_unique
13.9 ms ± 78.8 µs per loop  # main
2.76 µs ± 9.74 ns per loop  # this PR
>>> %timeit idx[:].is_monotonic_increasing
4.26 ms ± 1.21 µs per loop  # main
2.7 µs ± 3.9 ns per loop  # this PR
>>> %timeit  idx[:].get_loc(999_999)
4.26 ms ± 1.49 µs per loop  # main
3.77 µs ± 41.7 ns per loop  # this PR

Not sure how to test this, as the relevant attributes are in cython code, but I don't think we do tests for indexing engines currently?

@jbrockmendel
Copy link
Member

nice! this has been on my todo list for ages but was always intimidating

@topper-123 topper-123 force-pushed the index_slice_perf branch 2 times, most recently from 30d8e4e to 7b50a89 Compare March 4, 2023 21:40
@mroeschke mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Mar 6, 2023
.pre-commit-config.yaml Outdated Show resolved Hide resolved
Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return type(self)._simple_new(res, name=self._name)
result = type(self)._simple_new(res, name=self._name)
if "_engine" in self._cache:
reverse = slobj.step is not None and slobj.step < 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming this is still valid if slobj is empty? slice(0,0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is valid if we have e.g. slice(None), because then slobj.step is always None. For slice(0,0) the .step attribute is None, so no problem there.

@mroeschke mroeschke added this to the 2.1 milestone Mar 8, 2023
@mroeschke mroeschke merged commit 9b4cffc into pandas-dev:main Mar 8, 2023
@mroeschke
Copy link
Member

Thanks @topper-123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants