-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: enhance map cacheline efficiency #48687
Comments
This is not an API change so taking it out of the proposal process. CC @randall77 |
I'm not sure you are considering the cache line advantages when the first key comparison fails but the second or third one succeeds. The current layout makes it more likely that the second and third key comparisons will be in the same cacheline. |
The trouble with interleaving key/value in the bucket is it may require more alignment padding. |
The hope is that only one tophash comparison matches, so you only have to look at a single key. |
@ianlancetaylor, you mean because the If so, I assume non-unique The experimental results show that non-unique values in the Whereas, with the interleaving of keys and elements, presumably most if not all (assuming good alignment to cacheline boundaries) associated keys and elements would end up in the same cacheline, but not necessarily the same cacheline as
@randall77, the experiment above seems to validate that ~ 98.65% of the time only one Interestingly, the same experiment shows that if
Perhaps more interesting would be to extend
And just for completeness, extending
Did you check to see if NOT packing / aligning / padding results in performance issues, e.g. "On recent Intel processors (Sandy Bridge and Nehalem), there is no performance penalty for reading or writing misaligned memory operands" [1] (admittedly a little old for these days) ? Or, maybe opportunistically only use the simple interleaving (i.e. where
Hmmm... I might have a crack myself at demonstrating an improvement. Have never compiled Golang from scratch (in order to change [1] https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/ |
No, I didn't. It is kind of a moot point because there are architectures we run on that do not allow misaligned accesses at all. And we do not want to have different implementations on different architectures, if we can avoid it.
We could simply interleave only for types which require no padding, but then we need to somehow fork the code based on what packing would happen. Again, not ideal code size and maintenance-wise. We could interleave always, with padding if needed, which would waste space but allow a single implementation. Maybe the space wasted wouldn't be too bad ( |
Also the garbage collector requires aligned pointers. |
I'm going to have a go at such a demonstration this weekend :-) As a pre-step, today, I built Golang from scratch for the first time... yay! :-) Quick question though: If I modify
|
If you modify only the runtime (not the rest of the toolchain), you don't need to run make.bash again. You can do |
And if you do modify the compiler, you can run Also, that build sounds really slow. Are you running all.bash or make.bash? (EDIT: I see you are running make.bash.) The former runs all the tests, but the latter is all you need to be able to use your installation. If you're hacking on maps I'd guess that you'll find most bugs pretty quickly, without the full test suite. |
Gotip download which does a full build imho takes 150s on Mac M1. It might help to start with 1.17. as base go since it has the register-based ABI enabled. |
@andig I see darwin/amd64 in the snippet above. The M1 is really fast. :) Using 1.17 as a bootstrap toolchain is a fine idea, but probably unlikely to speed things up more than 5-10%. Fortunately, a full rebuild isn't necessary to play with the runtime in isolation. |
Thanks for the tips about iterating faster.
But if I only run my new test and
And @andig: The M1 is fast :-) |
Is it? I get this on my 2014 laptop with an Haswell Mobile processor and a spinning hard disk:
Are you sure you don't have a nosy antivirus meddling with |
@ALTree admittedly the laptop is full of corporate security software, and also standard Mac software doing stuff in the background... all chewing CPU :-) I guess the M1 would still have the standard Mac software in the background doing its thing too. But the Mac Maybe a 'spinning hard disk' is not such a big issue if there's plenty of RAM? |
Yeah, then it's probably OS related. In my experience |
As a side note, I switched to a GCP Linux box to get more accurate performance test runs due to little or no background processes running on Linux versus MacOS. Linux also ran
|
This might be caused by #48496. I'm poking at a simple fix for that now. |
@randall77 below are the results of my weekend performance experiments and how I conducted them :-) Looks like a performance improvement is demonstrated but only for larger map keys. Unfortunately in this experiment then the interleaved keys slightly decreases performance for smaller map keys. So future work might be to repeat the experiment but after introducing dynamic interleaving of map keys and values depending upon the map key size? If this can be achieved then this experiment hints at faster map access times for maps with bigger keys; slightly faster map read times, and ~ +15% faster map write times. Build Golang from scratch on MacOS and hack in experimental changes
[1] https://golang.org/doc/install/source Hack
|
That's the tricky part, because then there's a cost of dynamically choosing which interleaving you're going to use. My experiment a while back added some fields to
So you could find a key given its index |
@randall77 thanks for the comment and info. That's interesting! Yes, I guess even if But I'm wondering if I can implement an opportunistic conditional interleave (without an extra data structure) purely based on the size of the key... which in theory wouldn't change the memory footprint of the data structures? I was thinking generic code like this: Create a few simple variables dependent upon having a suitably sized
Then later, there's a single expression (very similar to the current code) without an
In theory this mechanism would work for the maps I performance tested before. The |
@simonhf seems like a reasonable thing to try. Thanks. |
Follow-up: I was wrong. Using 1.17.1 as a bootstrap toolchain on an M1 is much faster. |
@randall77 so I tried out opportunistic conditional interleave idea. Made the following changes:
Runnings tests on LinuxTest
|
Interesting. What do your benchmarks look like? Sounds good enough to pursue some more, but we always have to be wary of overfitting the benchmark. So what does the
Are there now branches when computing each of these, or are you doing something smarter? I'm imagining some x + i*y formula for each, with the coefficients x,y stored in the maptype. |
@randall77 the code for the original experiments was put in a github branch with the steps needs to make and run it detailed above. In this latest set of experiments I just updated that branch here [1]. In theory you can just grab the code and replicate my experiments, but on your different system. The benchmark for the 3 different types of hash implementation can be found here [2], [3], and [4]. Sorry, I was thinking about trying to combine all 3 benchmarks files into a single file. They are mostly duplicated code and very simple.
Yes, I agree. And if I am guilty of doing that then it is only subconsciously so far :-) But it would be good to have more benchmarks, for example perhaps benchmarking with different sizes of values too?
In the end I just kept it simple stupid for this experiment, as can be seen here [5]: Instead of I'm guessing that shorter It may be interesting to experiment with shorter 8 byte keys with longer values? Perhaps
There are blocks of code like this [5] in the various map functions:
This allows us to have a very similar element locating calculations to before, but without any associated
Yep, so the Looking forward to your further comments and suggestions. [1] simonhf@9f0ed4b |
Looking at the Golang source code for how
map
is implemented [1], there seems to be a possible opportunity to enhance cacheline efficiency:The current algorithm appears to work as follows for looking up a key in a map:
.tophash
array [4] for a key candidate <-- cacheline fetch 1So there are 3 arrays in a bucket;
.tophash
array, key array, and element (value) array. Each array has the same hard-codedbucketCnt
length of 8 elements [7]. A key array item size is typically 16 bytes. This means a typical total key array size is 8 items * 16 byte size = 128 bytes long, or 2x (typically sized) 64 byte cachelines. The total key array size means the associated element array item is usually located in a different cacheline.This can also be seen (added padding and comments to make it easier to read) from the way the key and element addresses are calculated in the source code:
Possible optimization:
If the key and element arrays would be interleaved (instead of separate arrays) then this would not guarantee that the associated
key[n]
andelement[n]
are always in the same cacheline, but often they would be :-)As an example, the interleaved version of the above
k
ande
assignments would look something like this:In theory, performance would more often be better due to less CPU cacheline fetches, and otherwise the same performance as before?
[1] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L532
[2] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L515
[3] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L517
[4] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L532
[5] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L542
[6] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L543
[7] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L66
[8] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L538
[9] https://github.com/golang/go/blob/go1.17.1/src/runtime/map.go#L543
The text was updated successfully, but these errors were encountered: