{{ message }}

# runtime: enhance map cacheline efficiency via prefetch#52902

Open
opened this issue May 13, 2022 · 4 comments
Open

# runtime: enhance map cacheline efficiency via prefetch#52902

opened this issue May 13, 2022 · 4 comments
Labels
NeedsInvestigation Performance
Milestone

### simonhf commented May 13, 2022

 Was watching this  video on Golang map design. Says the 6.5 load factor is used and that this can cause `overflow` buckets, and I wondered how many overflow buckets there are in a typical map? So I wrote this one-liner to simulate the number of overflow buckets for maps with increasing sizes of fixed, non-overflow buckets. For each map size, it adds "keys" until the 6.5 load factor is reached (usually resulting in the map being 81% full, at which point a real map would expand and double in size). At the point of reaching the 6.5 load factor size then the number of 1st level overflow buckets is shown, as well as 2nd & 3rd level: ``````\$ perl -e '\$| ++; use Digest::SHA; foreach \$bits(4..26){ \$buckets = 1 << \$bits; \$slots = \$buckets * 7; printf qq[- %2d=bits %10d=buckets %10d=slots ], \$bits, \$buckets, \$slots; foreach \$seed(1..1){ \$a = chr(0) x \$buckets; \$i = 1; while(1){ \$h = hex(substr(Digest::SHA::sha256_hex(\$i . \$seed . q[foobar]), 0, 8)); \$b = (\$h & (\$buckets - 1)); last if(\$i >= \$slots * 81.25 / 100); substr(\$a, \$b, 1) = chr(1 + ord(substr(\$a, \$b, 1))); \$i ++; } printf qq[.]; push @p, int(\$i / \$slots * 100); } foreach(sort {\$a <=> \$b} @p){ printf qq[ %2d%%], \$_; } undef @p; \$over1 = 0; \$over2 = 0; \$over3 = 0; foreach \$b(0..(\$buckets - 1)){ if(ord(substr(\$a, \$b, 1)) > 8){ \$over1 ++; } if(ord(substr(\$a, \$b, 1)) > 16){ \$over2 ++; } if(ord(substr(\$a, \$b, 1)) > 24){ \$over3 ++; } } printf qq[ e.g. %10d=i %10d=over1 or %4.1f%% %4d=over2 %d=over3], \$i, \$over1, \$over1 / \$buckets * 100, \$over2, \$over3; printf qq[\n]; }' - 4=bits 16=buckets 112=slots . 81% e.g. 91=i 2=over1 or 12.5% 0=over2 0=over3 - 5=bits 32=buckets 224=slots . 81% e.g. 182=i 4=over1 or 12.5% 0=over2 0=over3 - 6=bits 64=buckets 448=slots . 81% e.g. 364=i 7=over1 or 10.9% 0=over2 0=over3 - 7=bits 128=buckets 896=slots . 81% e.g. 728=i 12=over1 or 9.4% 0=over2 0=over3 - 8=bits 256=buckets 1792=slots . 81% e.g. 1456=i 33=over1 or 12.9% 0=over2 0=over3 - 9=bits 512=buckets 3584=slots . 81% e.g. 2912=i 62=over1 or 12.1% 0=over2 0=over3 - 10=bits 1024=buckets 7168=slots . 81% e.g. 5824=i 130=over1 or 12.7% 0=over2 0=over3 - 11=bits 2048=buckets 14336=slots . 81% e.g. 11648=i 259=over1 or 12.6% 1=over2 0=over3 - 12=bits 4096=buckets 28672=slots . 81% e.g. 23296=i 514=over1 or 12.5% 2=over2 0=over3 - 13=bits 8192=buckets 57344=slots . 81% e.g. 46592=i 982=over1 or 12.0% 0=over2 0=over3 - 14=bits 16384=buckets 114688=slots . 81% e.g. 93184=i 2003=over1 or 12.2% 2=over2 0=over3 - 15=bits 32768=buckets 229376=slots . 81% e.g. 186368=i 4021=over1 or 12.3% 5=over2 0=over3 - 16=bits 65536=buckets 458752=slots . 81% e.g. 372736=i 7948=over1 or 12.1% 5=over2 0=over3 - 17=bits 131072=buckets 917504=slots . 81% e.g. 745472=i 16071=over1 or 12.3% 12=over2 0=over3 - 18=bits 262144=buckets 1835008=slots . 81% e.g. 1490944=i 32169=over1 or 12.3% 19=over2 0=over3 - 19=bits 524288=buckets 3670016=slots . 81% e.g. 2981888=i 63855=over1 or 12.2% 50=over2 0=over3 - 20=bits 1048576=buckets 7340032=slots . 81% e.g. 5963776=i 127938=over1 or 12.2% 103=over2 0=over3 - 21=bits 2097152=buckets 14680064=slots . 81% e.g. 11927552=i 256563=over1 or 12.2% 162=over2 0=over3 - 22=bits 4194304=buckets 29360128=slots . 81% e.g. 23855104=i 511457=over1 or 12.2% 414=over2 0=over3 - 23=bits 8388608=buckets 58720256=slots . 81% e.g. 47710208=i 1025173=over1 or 12.2% 795=over2 0=over3 - 24=bits 16777216=buckets 117440512=slots . 81% e.g. 95420416=i 2051391=over1 or 12.2% 1578=over2 0=over3 - 25=bits 33554432=buckets 234881024=slots . 81% e.g. 190840832=i 4103432=over1 or 12.2% 3172=over2 0=over3 - 26=bits 67108864=buckets 469762048=slots . 81% e.g. 381681664=i 8201492=over1 or 12.2% 6258=over2 0=over3 `````` At the point of reaching the 6.5 load factor then there are generally about 12% 1st level overflow buckets in the simulation. Which presumably means about 1 in 8 bucket accesses might have to read an overflow bucket to find the key? Worse, as the number of buckets increases, then the number of 2nd level overflow buckets also increases, but is relatively minor compared to the number of 1st level overflow buckets. So if the programmer is interested in absolute worst case key lookup times, then there will be a very small percentage of keys where the absolute key look up time will be longer than other keys due to having to scan not only the 1st level overflow bucket, but a 2nd level overflow bucket too? The key lookup code  is still very much like in the presentation  and has these two loops: ``````hash := t.hasher(key, uintptr(h.hash0)) m := bucketMask(h.B) b := (*bmap)(add(h.buckets, (hash&m)*uintptr(t.bucketsize))) // regular bucket ... for ; b != nil; b = b.overflow(t) { for i := uintptr(0); i < bucketCnt; i++ { if b.tophash[i] != top { // delay point without prefetch ... } } // come here if key not found in bucket; loop to nth overflow bucket } `````` I was thinking that this would be the perfect opportunity to use a CPU prefetch instruction -- if it exists in Golang... does it? -- and the pseudo code would look something like: ``````for ; b != nil; b = b.overflow(t) { b_next := b.overflow(t) if b_next != nil { __builtin_prefetch(&b_next, 0, 0) } for i := uintptr(0); i < bucketCnt; i++ { if b.tophash[i] != top { // delay point without prefetch ... } } // come here if key not found in bucket; loop to nth overflow bucket } `````` Why would it be the perfect opportunity? Because generally `__builtin_prefetch()` (from GCC in this case) is difficult to use in code because depending upon how fast the RAM is, after executing this instruction there will be a sizeable and difficult to predict async delay before the desired cacheline is cached and ready to be used. Most code does not know which next memory address it's going to need to access this far in the future, and thus it's notoriously difficult to use the prefetch instruction in real code... In this case -- for the regular code in use today; normally and without prefetching -- this async delay becomes a sync delay at the first time the cacheline containing address `b` is accessed, e.g. `b.tophash[i]`. However, this code is always going to loop over 8 bucket items before accessing the the next overflow `b`. So if the time to fetch the overflow `b` cacheline is x, and the time to execute the prefetch instruction is y, and the time to loop over the 8 bucket items is z, then using the prefetch instruction we stand to save x + y - z time (assuming that x > z) :-) I would love to try this out, but does Golang already implement a prefetch instruction? Have been searched and not finding. If not, how to go about implementing it? Presumably, if it existed this could speed up up to approx. 1 in 8 bucket reads using the current load factor for a fuller map? Thoughts? The text was updated successfully, but these errors were encountered: changed the title untime: enhance map cacheline efficiency via prefetch runtime: enhance map cacheline efficiency via prefetch May 13, 2022

### ianlancetaylor commented May 13, 2022 added the NeedsInvestigation label May 13, 2022

### randall77 commented May 13, 2022

 We do have a prefetch you can try, it is `runtime/internal/sys.Prefetch`. I'm not entirely convinced this would be worth it. A given map is probably operating below the max load factor, making overflows less likely. Running at max load factor is kind of a worst case. Even if a bucket has an overflow, it is probably only a few elements. Say it is 2. Then there is a 8/10 chance on an existing-key-lookup that we won't have to look in the overflow, despite its existence. (On missing-key-lookup, or insert, we would though.) Every access would need execute the instructions needed to compute and conditionally issue the prefetch, even if the overflow doesn't exist, or is not needed. Successful prefetches which aren't needed just pollute the cache. In cases where prefetching might help, it is likely that cache space is a scarce resource. That said, when the prefetch is useful it can save lots of cycles. Definitely worth an experiment.

### simonhf commented May 13, 2022

 @randall77 good points and thanks for the tip about `runtime/internal/sys.Prefetch` :-) Running at max load factor is kind of a worst case Presumably it wouldn't have to run at max load factor, and the overflow buckets just increase in number long before the max load factor is reached? Isn't it also normal case for any map that grows bigger? I.e. a map will reach the max load factor perhaps many times during its time growing bigger? Every access would need execute the instructions needed to compute and conditionally issue the prefetch, even if the overflow doesn't exist, or is not needed. This gave me the thought that maybe there could be an extra conditional so that this business logic only kicks in when the map reaches a certain load factor? Presumably the current load factor is relatively easy to compute? This way, the overhead for the general access to a not heavily loaded map would be a single simpler `if` statement with no "instructions needed to compute and conditionally issue the prefetch"? :-) Definitely worth an experiment Maybe the experiment could be as follows: Step 1: Time adding n more keys to a map, and then time reading all map keys using key lookup. Step 2: Loop again until a map key cap is reached. Run this experiment with and without the prefetch mechanism, and compare the overall times, and the times for each step 1 operation. Presumably this experiment would let us compare performance for different load levels? What do you think?