-
-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization on the Repeat operation #43
Conversation
If possible, avoid the use of the copyDenseSlice function to be able to reuse the source slice.
This looks amazing. That's a crazy diff |
|
I'll fix up the |
Added tests for slow path (by creating masked copies)
@owulveryck I have pushed new code to separate out
|
@owulveryck if you like the results you see from the benchmarks please squash and merge. Thank you so much |
defaultengine_matop_misc.go
Outdated
|
||
for k := 0; k < tmp; k++ { | ||
if srcStart >= t.len() || destStart+stride > d.len() { | ||
break | ||
} | ||
copyDenseSliced(d, destStart, d.len(), t, srcStart, t.len()) | ||
if fastCopy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking on this. Wondering if we pull fastCopy
into a separate method would it be cleaner and not have if
s in loops which makes it hard for reasoning?
thinknig
I like this new interface ! Here is the result of my bench (with
|
The
denseRepeat
method of the standard engine is callingcopyDenseSlice
method in two embedded for loops.This method is slicing the source and destination at each pass, even if the start and end indices of the source did not change.
This PR is optimizing this by avoiding the call to copyDenseSlice if possible and avoiding to create a slice array if it's not needed.
The
Memcpy
method is called directly from thedenseRepeat
method to avoid any time loss by passing values or references to the function.On top of being less greedy with the memory, this change also reduces the object creation and by consequence the pressure on the GC.
This commit includes a couple of benchmarks made from the tests.
The comparison gives the following results: