-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(rust, python): Improve explodes: offsets_to_indexes
performance
#8964
Conversation
while value_count < *offset { | ||
value_count += 1; | ||
idx.push(last_idx) | ||
for (offset_start, offset_end) in offsets.iter().zip(offsets.iter().skip(1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you replace offsets.iter().skip(1)
with offsets[1..].iter()
that should be faster as this removes a branch in the hot loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can go ahead and make the change, though it doesn't appear to significantly impact performance. If I'm not wrong, offsets.iter().skip(1)
only gets called once in this function at the start of the loop, so the difference should be small, if there is one. Re-running the benchmark with this change suggests the performance difference is ±1%, which I would guess is mostly due to noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the skip will be called every iteration and has an unlikely
branch. The branch prediction will correctly predict the correct branch, but it can prevent future compiler optimizations to have that branch.
offsets_to_indexes
performance
Looks great! Thank you @kpberry |
No problem, happy to help! |
I refactored offsets_to_indexes to remove the manual pointer arithmetic and value_count counter, which results in a typical 8-15% speedup for large arrays (1,000 to 10,000,000 elements) and reasonable numbers of randomly generated offsets (between 1% and 50% of the capacity). For small arrays and small numbers of offsets, performance is not significantly impacted. I also added a criterion to the main loop which exits early when the indexes array is already larger than capacity, which results in an arbitrarily large speedup when capacity is exceeded and does not impact performance in typical cases.
I have uploaded my benchmark script here, in case you want to verify the performance improvement.