Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Increase threashold for using binary search in IndexEngine #54746

Merged
merged 4 commits into from Aug 26, 2023

Conversation

l3robot
Copy link
Contributor

@l3robot l3robot commented Aug 25, 2023

Checklist

Explanation

So after a bit of research for #54550, I found that the source of the behavior is coming from PR #22826 at this line and particularly this comment here.

In the comment, @rtlee9 makes an excellent explanation for the reason behind the threshold. The default behavior is a linear search (O(n)) in the pandas index and #22826 added a binary search optimization for monotonically increasing index which is in O(m*log(n)) where m is the number of target indexes.

Proposed Solution

I'm proposing we increase the limit, which was set conservatively to 5. We could at least increase it to n / (2 * log2[n]) which is derivable from m*log2[n]<n and dividing per 2 for keeping a certain margin. I'll open a PR for that.

New results of benchmarking + homemade script

Here is a screenshot of asv continuous -f 1.1 -E virtualenv upstream/main HEAD -b ^indexing.*Index* output (done on Macbook Air m1, inside the docker container provided by pandas):

Capture d’écran, le 2023-08-23 à 09 05 41

Here is also a new output for the script use to demonstrate the problem in #54700:

With all at once: num_indexes=1 => 0.00048s
With all at once: num_indexes=2 => 0.00045s
With all at once: num_indexes=3 => 0.00038s
With all at once: num_indexes=4 => 0.00035s
With all at once: num_indexes=5 => 0.00035s
With all at once: num_indexes=6 => 0.00045s
With all at once: num_indexes=7 => 0.00042s
With all at once: num_indexes=8 => 0.00066s
With all at once: num_indexes=9 => 0.00049s
With all at once: num_indexes=10 => 0.00048s

Thanks in advance for the review!

@l3robot l3robot requested a review from WillAyd as a code owner August 25, 2023 02:31
@@ -375,7 +375,7 @@ cdef class IndexEngine:
# map each starget to its position in the index
if (
stargets and
len(stargets) < 5 and
len(stargets) < (n / (2 * n.bit_length())) and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit_lenght is just a fast way of computing the ceiling of log2.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I think this is good

@@ -105,7 +105,7 @@ Deprecations

Performance improvements
~~~~~~~~~~~~~~~~~~~~~~~~
-
- Increase the threshold to use binary search instead of a linear search in IndexEngine :issue:`54550`
Copy link
Member

@mroeschke mroeschke Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Increase the threshold to use binary search instead of a linear search in IndexEngine :issue:`54550`
- Performance improvement when indexing with more than 4 keys (:issue:`54550`)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks a lot for the review.
done!

@mroeschke mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Aug 25, 2023
@l3robot l3robot changed the title Increase threashold for using binary search in IndexEngine PERF: Increase threashold for using binary search in IndexEngine Aug 25, 2023
@mroeschke mroeschke added this to the 2.2 milestone Aug 26, 2023
@mroeschke mroeschke merged commit 48f5a96 into pandas-dev:main Aug 26, 2023
38 checks passed
@mroeschke
Copy link
Member

Thanks @l3robot

@l3robot
Copy link
Contributor Author

l3robot commented Aug 26, 2023

Thanks @l3robot

Glad to contribute!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Unreliable performance of .loc with big non-unique index dataframe
4 participants