-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sort: improve performance of heapSort for worst case inputs #39512
Comments
Change https://golang.org/cl/237437 mentions this issue: |
As a datapoint for the complexity/maintenence/speedup tradeoff: For the production profiling I have available if time spend in and below I think it is therefore fine to improve the speed if the actual added code is very small and obvious but adding another function that would rarely be run and needs additional care to be tested due to only being used on adverserial inputs might not be a good tradeoff. |
@martisch -- I understand that the amount of time spend in @Windsooon -- The two PRs change different things. PR1 changes |
@danielf You are right that not all cases are typical but the non-typical cases are rare. The cost for the typical case isnt free because that code will require binary size use and more icaches that before might have had more typically used code that is now flushed when it is used. When we have 0.001% usage vs normal sort then 11% is about 0.0001% speedup. The code should therefore not regress the typical case by ~0.0001% otherwise it is near a net loss in performance. Of course this is hard to measure and microbenchmarks wont show this. The tradeoff might still be ok here, I just want to point out that adding code also has negative performance side effects that microbenchmarks wont show. On the other side more code also needs more maintainence. |
Hi @martisch ,
If it is true that in ALL applications we get 0.001% of In any case, thanks for your comments! |
I cant share the raw data. It would first be to large and reveal call stacks. For context this is the relative difference of all CPU time within a fixed time window that is sampled from Googles machines running Googles own services for the sort vs heapsort portion of time (Google Wider Profiling paper: https://research.google/pubs/pub36575/). The absolute value is also tiny (I had to adjust the normal search to be more granular and exhasutive to even even find it). If we improve the
agreed but the same rationale can be applied to overoptimize any function. Its not unthinkable that there are applications that would benefit from not having the heapsort fallback at all for reducing binary size. Its not unthinkable that there are applications that would benefit from a memmove implementation that is tuned for 48byte copies. But adding more ifs to all the functions to add for specialised cases also has a negative costs for other cases. I checked the data and actually the time spend in heapsort was mostly caused by one binary with a spike in the last days I looked over. If there are specific applications that would benefit from a more tuned heapsort maybe they should use a tuned heapsort directly instead of sort.Sort?
Being locked in to an implementation because it has been overoptimized for edge cases is I think something to avoid because it blocks other implementations that give better performance overall from being introduced due to regressing some application benchmark. I would think that the general approach to the Go standard library should be to optimize for the common case while allowing more specialized implementations to live outside the standard library. Its fair to say that the datapoint I provided above is biased in many ways but it also is quite a large sample over many uses of Go. If others can share data too maybe it would show that heapsort is much more called than that data I have shows. This is not to say there might not exist an application in the world that would benefit from this. But then there might be better ways to apply a application specific optimization and its own heap sort implementation not using an interface. |
@martisch , thanks for this very comprehensive reply! I agree with most (if not all) of the points raised, but I think that most of them don't apply to this CL, so I will comment below. Of course there are terms like "code simplicity" that are subjective. The general idea of the comments below is: The change will help few, not harm anyone, and are simple enough to be easily maintainable.
I expected this, but I wouldn't be doing my job if I hadn't asked :)
I thought about doing the same changes in both
I agree with the general idea of what you said, but:
My code doesn't have an "if" to decide on the algorithm. It ALWAYS uses the new function to sift down after a pop. It's NOT a specialized solution for a particular scenario.
I agree with this sentence, I just don't see it as an argument to not make
I agree with this sentence, but I really don't think that it is a fair representation of what my CL proposes.
I agree with the sentence "the common case should not suffer in favor of the edge cases" (this is a little mantra that people of my field have). I also agree with your specialized implementations comment. That said, there is already an implementation of If the idea is just maintainability than using
The solution is simple AND performant. :D
There will be applications that will benefit from it, and there will be no applications that suffer from it, and it is a simple, small, self-contained code. |
That is were I disagree: there can be applications that suffer. This might not be net negative here but any change that increases instruction size has the potential to decrease performance overall (when those instructions are executed) by evicting more useful instructions from the CPU (i)caches or needing to load more cache lines first (in this case this might be offset by the data being touched less). That is why I have mentioned memmove before. It can be benchmarked from profiling across large pools of machines that overoptimized (lots of special cases) memmove perform better in microbenchmarks but then perform worse overall due to more icache misses (branch mispredicts can be kept lower with jumptables).
I might have reached to extremes because we talked about the extreme of unthinkables :)
True, the if was meant in the general. For a specialisation that makes the instruction footprint larger to handle the uncommon case better there can be still negative effects. |
Mmm, I think that you are nitpicking a little bit... This code (when activated, which is rare) will be called at least 12 times (because of the insertionSort optimization), and likely considerably more (the adversarial case calls it for basically the full array). A code that is called 12 times (and likely way more) deserves to replace something else temporarily in the L1i cache. :) Note that this code will be called all these "at least 12 times but likely way more" before any other code is run, so it's not like it is going to be inserted into L1i and bumped out of it successively. It will go in (bumping something else out), executed a bunch of times, and then be bumped out again. The code that is bumped out will very likely still reside in L2 in the duration of the heapSort execution, so returning it to L1i shouldn't make the CPU pipeline suffer too much. |
It may very well be nit-picky but then the argument was absolute that this can not harm any program :) I think we largely agree on the effects that could potentially happen but disagree on where to draw the line of improving the uncommon cases in the algorithm. And I think thats fine. Would be boring If we all thought alike. Thanks for the constructive discussion. |
I am good with this conclusion! Thanks also, this was great. :) |
@danicat @martisch Do you think we should overwrite the quicksort function in an iterative version? In most case, it should be faster. |
As long as it does not cause any more heap allocations you could write a prototype and post the numbers then we can start a discussion. If its a very small win it might not be worth breaking peoples tests if the order of the sort changes even if this is not a stable sort. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
darwin/amd64What did you do?
I've implemented a faster version of siftDown for when it's expected that the root will end up in one of the lower levels. This is the likely outcome right after a "pop", since we move a former leaf to the root.
Benchmarks using an adversarial input (the one in sort_test.go) show that the new code is ~11.8% faster when using a struct with a complex Less, and ~5.5% faster when sorting ints.
My changes makes the code slower in the case when a lot of elements are equivalent (!Less(a, b) && !Less(b, a)), but this case is uncommon when the Heapsort part of sort.go is activated.
Notes
The same changes can be applied to container/heap and to runtime, assuming benchmarks support them.
benchstat results
The text was updated successfully, but these errors were encountered: