Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Index.__getitem__ performance issue #6370

Closed
immerrr opened this issue Feb 16, 2014 · 2 comments · Fixed by #6440
Closed

PERF: Index.__getitem__ performance issue #6370

immerrr opened this issue Feb 16, 2014 · 2 comments · Fixed by #6440
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@immerrr
Copy link
Contributor

immerrr commented Feb 16, 2014

Once again, caused by #6328 investigation.

There's something very strange with how Index objects handle slices:

In [1]: import pandas.util.testing as tm

In [2]: idx = tm.makeStringIndex(1000000)

In [3]: timeit idx[:-1]
100000 loops, best of 3: 2 µs per loop

In [4]: timeit idx[slice(None,-1)]
100 loops, best of 3: 6.5 ms per loop

Obviously, this happens because Index doesn't override __getslice__ provided by ndarray, hence idx[:-1] is executed via ndarray.__getslice__ -> Index.__array_finalize__ and idx[slice(None, -1)] goes via Index.__getitem__ -> Index.__new__.

__getitem__ is made 1000x slower trying to infer slice data type and convert it to a different subclass. The problem is that interactive invocation idx[:-1], which is when that milliseconds-vs-microseconds issue doesn't matter, is likely to miss this feature, because it's dispatched via __getslice__ . But for programmatic invocation idx[slice(None, -1)] which hits this soft spot, I'd argue that this type conversion magic is not at all necessary.

Is there a rationale behind this?

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

I think this orginally had to do with some compat with ndarray (e.g. trying to preserver the interfact), but since __getslice__ is deprecated even in ndarray, seems silly to call it. Why don't you change and see if tests pass and get perf gain (you may need to add this vbench).

@jreback jreback added this to the 0.14.0 milestone Feb 16, 2014
@immerrr
Copy link
Contributor Author

immerrr commented Feb 17, 2014

As for __getslice__, the problem is how python2 is implemented: when you do obj[i:j] it first looks for obj.__getslice__(i,j) falling back to obj.__getitem__(slice(i,j)). Since ndarray provides __getslice__ (at least in 1.7), the only thing that we can do in order to provide custom __getitem__ is to override __getslice__ to reroute those "basic slicing" requests.

I've rewritten this a bit, but this caused test breakage. Will look into it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants