As this is my first foray into optimizing memory accessed and what not in Go, I try to document my findings here (also to flush my mental cache).
It seems that the problem is with memory and the memory caches. Using perf and the perf-test.sh tool, cay has a better TLB rate (TLB metric group) and higher instruction per cycle count (default perf summary), but worse memory cache usage.
Thus, I need to look into the memory locality and whether the cay map is way too sparse. Maybe look into the fillrate?
(No metrics or metric group given to perf. ) Check the insn per cycle (instructions per cycle), the higher the better. Except for the 8 entry case, it is higher for cay than the builtin map. Further branch-misses is significantly lower (upwards of 10x) for Cay in all cases (except 8).
$ ./perf-test.sh
# Benchmark size 8
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/8/cay':
2,783.53 msec task-clock # 1.026 CPUs utilized
1,584 context-switches # 0.569 K/sec
302 cpu-migrations # 0.108 K/sec
8,032 page-faults # 0.003 M/sec
8,830,482,141 cycles # 3.172 GHz
19,640,426,332 instructions # 2.22 insn per cycle
3,136,398,887 branches # 1126.770 M/sec
10,685,521 branch-misses # 0.34% of all branches
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/8/built':
2,530.04 msec task-clock # 1.031 CPUs utilized
1,518 context-switches # 0.600 K/sec
271 cpu-migrations # 0.107 K/sec
7,503 page-faults # 0.003 M/sec
8,041,054,439 cycles # 3.178 GHz
23,121,725,391 instructions # 2.88 insn per cycle
4,859,536,122 branches # 1920.733 M/sec
2,037,432 branch-misses # 0.04% of all branches
# Benchmark size 1k
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1k/cay':
2,688.34 msec task-clock # 1.033 CPUs utilized
1,513 context-switches # 0.563 K/sec
292 cpu-migrations # 0.109 K/sec
8,183 page-faults # 0.003 M/sec
8,549,802,679 cycles # 3.180 GHz
19,139,506,643 instructions # 2.24 insn per cycle
3,029,410,736 branches # 1126.869 M/sec
4,031,192 branch-misses # 0.13% of all branches
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1k/built':
2,885.53 msec task-clock # 1.029 CPUs utilized
1,942 context-switches # 0.673 K/sec
272 cpu-migrations # 0.094 K/sec
7,385 page-faults # 0.003 M/sec
9,178,403,154 cycles # 3.181 GHz
16,996,576,330 instructions # 1.85 insn per cycle
2,965,568,826 branches # 1027.738 M/sec
33,831,083 branch-misses # 1.14% of all branches
# Benchmark size 32k
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/32k/cay':
2,727.82 msec task-clock # 1.033 CPUs utilized
1,463 context-switches # 0.536 K/sec
177 cpu-migrations # 0.065 K/sec
8,134 page-faults # 0.003 M/sec
8,726,841,458 cycles # 3.199 GHz
16,212,499,857 instructions # 1.86 insn per cycle
2,561,185,225 branches # 938.913 M/sec
3,515,472 branch-misses # 0.14% of all branches
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/32k/built':
2,680.82 msec task-clock # 1.033 CPUs utilized
1,656 context-switches # 0.618 K/sec
193 cpu-migrations # 0.072 K/sec
8,738 page-faults # 0.003 M/sec
8,522,265,149 cycles # 3.179 GHz
14,409,612,866 instructions # 1.69 insn per cycle
2,459,666,444 branches # 917.506 M/sec
31,985,482 branch-misses # 1.30% of all branches
# Benchmark size 512k
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/512k/cay':
3,502.46 msec task-clock # 1.024 CPUs utilized
1,844 context-switches # 0.526 K/sec
342 cpu-migrations # 0.098 K/sec
19,913 page-faults # 0.006 M/sec
10,549,826,964 cycles # 3.012 GHz
15,326,648,502 instructions # 1.45 insn per cycle
2,645,720,701 branches # 755.388 M/sec
6,797,453 branch-misses # 0.26% of all branches
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/512k/built':
3,206.88 msec task-clock # 1.026 CPUs utilized
1,511 context-switches # 0.471 K/sec
249 cpu-migrations # 0.078 K/sec
20,389 page-faults # 0.006 M/sec
10,039,149,266 cycles # 3.131 GHz
14,010,827,099 instructions # 1.40 insn per cycle
2,426,420,176 branches # 756.630 M/sec
19,769,838 branch-misses # 0.81% of all branches
# Benchmark size 1m
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1m/cay':
4,092.63 msec task-clock # 1.018 CPUs utilized
1,839 context-switches # 0.449 K/sec
367 cpu-migrations # 0.090 K/sec
31,515 page-faults # 0.008 M/sec
12,407,871,126 cycles # 3.032 GHz
18,415,418,260 instructions # 1.48 insn per cycle
3,352,476,604 branches # 819.151 M/sec
10,492,666 branch-misses # 0.31% of all branches
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1m/built':
3,665.66 msec task-clock # 1.022 CPUs utilized
1,901 context-switches # 0.519 K/sec
380 cpu-migrations # 0.104 K/sec
30,913 page-faults # 0.008 M/sec
11,654,925,508 cycles # 3.179 GHz
15,773,511,313 instructions # 1.35 insn per cycle
2,810,254,603 branches # 766.643 M/sec
21,821,482 branch-misses # 0.78% of all branches
Here we run with the TLB perf metric group. Looking below the dtlb_load_misses.walk_pending
metrics is lower for cay
and especially for the big sized maps. E.g., for the 1M case cay is at 88,098,107
, whereas the builtin is at 1,009,407,450
.
Thus, it does not seem that the TLB is the culprit. The iTLB does not look very different and neither does dtlb for store_misses.
$ ./perf-test.sh TLB
# Benchmark size 8
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/8/cay':
24,531,832 itlb_misses.walk_pending # 0.00 Page_Walks_Utilization
2,710,554 dtlb_store_misses.walk_pending
8,672,852,677 cycles
8,993,497 dtlb_load_misses.walk_pending
1 ept.walk_pending
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/8/built':
24,878,419 itlb_misses.walk_pending # 0.00 Page_Walks_Utilization
2,825,332 dtlb_store_misses.walk_pending
11,432,187,042 cycles
10,390,090 dtlb_load_misses.walk_pending
1 ept.walk_pending
# Benchmark size 1k
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1k/cay':
23,910,628 itlb_misses.walk_pending # 0.00 Page_Walks_Utilization
2,647,062 dtlb_store_misses.walk_pending
8,806,311,421 cycles
8,881,615 dtlb_load_misses.walk_pending
2 ept.walk_pending
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1k/built':
23,542,237 itlb_misses.walk_pending # 0.00 Page_Walks_Utilization
2,460,400 dtlb_store_misses.walk_pending
8,636,039,813 cycles
9,985,277 dtlb_load_misses.walk_pending
0 ept.walk_pending
# Benchmark size 32k
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/32k/cay':
24,763,302 itlb_misses.walk_pending # 0.00 Page_Walks_Utilization
2,866,027 dtlb_store_misses.walk_pending
11,041,198,654 cycles
10,393,644 dtlb_load_misses.walk_pending
2 ept.walk_pending
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/32k/built':
24,418,387 itlb_misses.walk_pending # 0.00 Page_Walks_Utilization
3,238,301 dtlb_store_misses.walk_pending
8,609,729,426 cycles
18,096,090 dtlb_load_misses.walk_pending
0 ept.walk_pending
# Benchmark size 512k
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/512k/cay':
25,283,132 itlb_misses.walk_pending # 0.00 Page_Walks_Utilization
5,555,387 dtlb_store_misses.walk_pending
10,472,566,848 cycles
45,438,830 dtlb_load_misses.walk_pending
0 ept.walk_pending
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/512k/built':
26,503,918 itlb_misses.walk_pending # 0.04 Page_Walks_Utilization
6,113,194 dtlb_store_misses.walk_pending
10,058,469,569 cycles
830,822,775 dtlb_load_misses.walk_pending
1 ept.walk_pending
# Benchmark size 1m
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1m/cay':
26,309,652 itlb_misses.walk_pending # 0.00 Page_Walks_Utilization
9,421,192 dtlb_store_misses.walk_pending
12,705,813,840 cycles
88,098,107 dtlb_load_misses.walk_pending
1 ept.walk_pending
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1m/built':
26,283,712 itlb_misses.walk_pending # 0.05 Page_Walks_Utilization
11,332,328 dtlb_store_misses.walk_pending
11,453,552,268 cycles
1,009,407,450 dtlb_load_misses.walk_pending
2 ept.walk_pending
The cache misses for cay looks worse than builtin and that might be the reason for the latency differences. Check out the L*MPKI values, which means (taken from perf list)
- L1MPKI: [L1 cache true misses per kilo instruction for retired demand loads]
- L2MPKI: [L2 cache true misses per kilo instruction for retired demand loads]
- L3MPKI: [L3 cache true misses per kilo instruction for retired demand loads]
- L2HPKI_All: [L2 cache hits per kilo instruction for all request types (including speculative)]
- L2MPKI_All: [L2 cache misses per kilo instruction for all request types (including speculative)]
For all five, cay is worse. For L2MPKI_All, cay is ~10% worse, and for the 1m case:
- L1MPKI is 44% higher/worse for cay
- L2MPKI is 50% higher/worse for cay
- L3MPKI is 13% higher/worse for cay.
$ ./perf-test.sh Cache_Misses
# Benchmark size 8
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/8/cay':
20,985,177,035 inst_retired.any # 0.98 L2MPKI_All
# 0.88 L2HPKI_All (53.94%)
20,592,361 l2_rqsts.miss (54.28%)
39,163,824 l2_rqsts.references (54.95%)
21,007,535,636 inst_retired.any # 0.20 L1MPKI (55.59%)
4,234,160 mem_load_retired.l1_miss (56.61%)
20,820,691,960 inst_retired.any # 0.12 L2MPKI (57.11%)
2,412,020 mem_load_retired.l2_miss (56.71%)
20,812,248,529 inst_retired.any # 0.09 L3MPKI (55.86%)
1,899,423 mem_load_retired.l3_miss (54.95%)
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/8/built':
23,164,212,366 inst_retired.any # 0.84 L2MPKI_All
# 0.78 L2HPKI_All (54.26%)
19,550,786 l2_rqsts.miss (54.75%)
37,516,125 l2_rqsts.references (55.41%)
23,060,758,516 inst_retired.any # 0.21 L1MPKI (55.96%)
4,875,197 mem_load_retired.l1_miss (56.63%)
23,011,803,612 inst_retired.any # 0.11 L2MPKI (56.86%)
2,507,324 mem_load_retired.l2_miss (56.19%)
23,161,190,056 inst_retired.any # 0.08 L3MPKI (55.32%)
1,864,263 mem_load_retired.l3_miss (54.61%)
# Benchmark size 1k
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1k/cay':
19,559,433,538 inst_retired.any # 2.13 L2MPKI_All
# 16.68 L2HPKI_All (55.08%)
41,616,379 l2_rqsts.miss (55.48%)
367,807,837 l2_rqsts.references (55.76%)
19,551,157,782 inst_retired.any # 5.32 L1MPKI (56.02%)
103,936,345 mem_load_retired.l1_miss (56.31%)
19,713,489,843 inst_retired.any # 0.22 L2MPKI (55.96%)
4,267,553 mem_load_retired.l2_miss (55.53%)
19,667,423,864 inst_retired.any # 0.09 L3MPKI (55.13%)
1,836,495 mem_load_retired.l3_miss (54.73%)
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1k/built':
16,875,829,573 inst_retired.any # 2.06 L2MPKI_All
# 19.84 L2HPKI_All (54.59%)
34,758,222 l2_rqsts.miss (55.46%)
369,643,159 l2_rqsts.references (55.93%)
16,834,950,967 inst_retired.any # 4.49 L1MPKI (56.15%)
75,571,486 mem_load_retired.l1_miss (56.36%)
16,834,082,172 inst_retired.any # 0.23 L2MPKI (56.40%)
3,898,075 mem_load_retired.l2_miss (55.46%)
16,898,834,396 inst_retired.any # 0.11 L3MPKI (54.92%)
1,830,326 mem_load_retired.l3_miss (54.73%)
# Benchmark size 32k
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/32k/cay':
15,723,857,870 inst_retired.any # 22.51 L2MPKI_All
# 5.88 L2HPKI_All (54.17%)
353,966,152 l2_rqsts.miss (54.50%)
446,495,417 l2_rqsts.references (55.16%)
15,768,811,226 inst_retired.any # 6.05 L1MPKI (55.80%)
95,393,069 mem_load_retired.l1_miss (56.42%)
15,807,752,488 inst_retired.any # 4.41 L2MPKI (57.01%)
69,643,967 mem_load_retired.l2_miss (56.40%)
15,790,548,309 inst_retired.any # 0.25 L3MPKI (55.68%)
3,940,655 mem_load_retired.l3_miss (54.87%)
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/32k/built':
14,367,825,777 inst_retired.any # 23.20 L2MPKI_All
# 5.44 L2HPKI_All (54.97%)
333,336,155 l2_rqsts.miss (55.40%)
411,510,326 l2_rqsts.references (55.96%)
14,397,381,463 inst_retired.any # 5.36 L1MPKI (56.18%)
77,111,961 mem_load_retired.l1_miss (56.47%)
14,473,547,929 inst_retired.any # 4.15 L2MPKI (56.04%)
60,061,565 mem_load_retired.l2_miss (55.48%)
14,415,442,764 inst_retired.any # 0.19 L3MPKI (54.86%)
2,790,478 mem_load_retired.l3_miss (54.64%)
# Benchmark size 512k
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/512k/cay':
15,631,316,763 inst_retired.any # 16.93 L2MPKI_All
# 5.14 L2HPKI_All (55.23%)
264,593,214 l2_rqsts.miss (55.83%)
344,995,786 l2_rqsts.references (56.15%)
15,654,062,627 inst_retired.any # 6.02 L1MPKI (56.29%)
94,255,979 mem_load_retired.l1_miss (56.36%)
15,605,701,802 inst_retired.any # 4.26 L2MPKI (55.65%)
66,489,043 mem_load_retired.l2_miss (55.14%)
15,547,654,451 inst_retired.any # 2.24 L3MPKI (54.74%)
34,758,323 mem_load_retired.l3_miss (54.60%)
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/512k/built':
14,072,355,231 inst_retired.any # 18.83 L2MPKI_All
# 5.49 L2HPKI_All (55.16%)
264,999,055 l2_rqsts.miss (55.27%)
342,252,958 l2_rqsts.references (55.55%)
13,892,173,635 inst_retired.any # 4.55 L1MPKI (55.72%)
63,269,904 mem_load_retired.l1_miss (56.14%)
14,016,947,465 inst_retired.any # 3.00 L2MPKI (56.00%)
41,990,744 mem_load_retired.l2_miss (55.67%)
14,057,421,320 inst_retired.any # 1.77 L3MPKI (55.51%)
24,931,461 mem_load_retired.l3_miss (54.98%)
# Benchmark size 1m
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1m/cay':
18,198,975,943 inst_retired.any # 19.11 L2MPKI_All
# 5.08 L2HPKI_All (55.15%)
347,763,047 l2_rqsts.miss (55.46%)
440,222,073 l2_rqsts.references (55.78%)
18,713,800,265 inst_retired.any # 7.14 L1MPKI (56.00%)
133,637,509 mem_load_retired.l1_miss (56.31%)
18,731,691,340 inst_retired.any # 5.20 L2MPKI (55.90%)
97,471,680 mem_load_retired.l2_miss (55.49%)
18,294,134,138 inst_retired.any # 1.97 L3MPKI (55.08%)
36,078,463 mem_load_retired.l3_miss (54.84%)
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1m/built':
15,527,809,018 inst_retired.any # 20.74 L2MPKI_All
# 5.44 L2HPKI_All (54.94%)
322,099,956 l2_rqsts.miss (55.41%)
406,510,919 l2_rqsts.references (55.75%)
16,020,021,219 inst_retired.any # 4.95 L1MPKI (56.18%)
79,351,682 mem_load_retired.l1_miss (56.51%)
16,109,739,903 inst_retired.any # 3.45 L2MPKI (56.07%)
55,573,102 mem_load_retired.l2_miss (55.47%)
15,846,374,536 inst_retired.any # 1.74 L3MPKI (55.09%)
27,608,897 mem_load_retired.l3_miss (54.59%)
Honestly, it does not seem to be related to TLB misses, as cay has a much lower rate.
On the other hand mem-stores is a lot higher than builtin. For 1M, cay is at 2Bn mem-stores, whereas builtin is at 1.6M. Maybe I should try to use toplev, that can visualize what the CPU is doing.
Using perf on a modified version of cay, where we don't do a __CompareNMask
function call, but instead have a few
bit operations "idx := uint16(ctrl[0])&uint16(hash>>57&3) + (1)
", emulating that we fetch the control bytes and do
some byte operations on them. It does seem that changing the (1)
factor has an impact on the latency, which could indicate that some we hit some boundary in the cache line somewhere.
Before (having __CompareNMask
):
$ for TV in 8 1k 32k 512k 1m; do sudo perf stat -e dTLB-load-misses,cache-misses -g ./cay.test -test.run=^\$ -test.cpu 1 -test.count 1 -test.bench "read_id/$TV/cay"; done
cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/8/cay':
131,117 dTLB-load-misses
14,852,562 cache-misses
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1k/cay':
121,029 dTLB-load-misses
14,873,464 cache-misses
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/32k/cay':
113,144 dTLB-load-misses
90,630,164 cache-misses
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/512k/cay':
704,105 dTLB-load-misses
276,259,860 cache-misses
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1m/cay':
1,324,025 dTLB-load-misses
316,184,011 cache-misses
After (removing __CompareNMask
):
$ for TV in 8 1k 32k 512k 1m; do sudo perf stat -e dTLB-load-misses,cache-misses -g ./cay.test -test.run=^\$ -test.cpu 1 -test.count 1 -test.bench "read_id/$TV/cay"; done
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/8/cay':
116,947 dTLB-load-misses
14,151,668 cache-misses
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1k/cay':
124,258 dTLB-load-misses
15,277,182 cache-misses
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/32k/cay':
120,314 dTLB-load-misses
54,505,594 cache-misses
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/512k/cay':
738,956 dTLB-load-misses
258,818,526 cache-misses
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1m/cay':
1,470,155 dTLB-load-misses
465,317,721 cache-misses
As a baseline these are the results from builtin:
$ for TV in 8 1k 32k 512k 1m; do sudo perf stat -e dTLB-load-misses,cache-misses -g ./cay.test -test.run=^\$ -test.cpu 1 -test.count 1 -test.bench "read_id/$TV/built"; done
...
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/8/built':
118,019 dTLB-load-misses
14,633,452 cache-misses
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1k/built':
125,581 dTLB-load-misses
14,527,774 cache-misses
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/32k/built':
133,197 dTLB-load-misses
53,978,241 cache-misses
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/512k/built':
15,324,337 dTLB-load-misses
311,375,476 cache-misses
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1m/built':
16,594,415 dTLB-load-misses
344,985,919 cache-misses
For the low number, cay has higher tlb misses and cache misses, and for the 1M size, cay has higher cache-miss.
The __CompareNMask
function call is using the old stack-based convention, where arguments are pushed and pulled
from the stack. Given these memory stores and loads, we might hit the TLB, so I want to compare the TLB-miss rate due
to :
- Stack-based function calls
- Register-based function calls
- No function call.
So far no conclusions on these functions calls, except in my dev environment the perf numbers looks reasonable:
$ go test -run=^\$ -cpu 1 -count 1 -bench 'function_call' -cpuprofile cpu.profile 10:57 12/08
goos: linux
goarch: amd64
pkg: github.com/jakobgt/cay
cpu: AMD EPYC 7B13
Benchmark_function_call_tlb_miss_rate/no_function_call 52123150 23.07 ns/op
Benchmark_function_call_tlb_miss_rate/register-based_function_call 40125406 30.02 ns/op
Benchmark_function_call_tlb_miss_rate/stack-based_ASM_function_call 29837817 41.30 ns/op
but on my Linux env (a laptop), the no function call is completely off (twice as slow as register-based one):
$ ./cay.test -test.run=^\$ -test.cpu 1 -test.count 1 -test.bench 'function_call'
goos: linux
goarch: amd64
pkg: github.com/jakobgt/cay
cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
Benchmark_function_call_tlb_miss_rate/no_function_call 35837146 115.0 ns/op
Benchmark_function_call_tlb_miss_rate/register-based_function_call 34069237 52.23 ns/op
Benchmark_function_call_tlb_miss_rate/stack-based_ASM_function_call 32606010 47.48 ns/op
PASS
I'm unsure why that is.
Further diving into the TLB misses for 1M case, we do see that after the __CompareNMask
call and it access the key of an entry in the bucket 31% of cases have a TLB miss:
Sorted summary for file /home/cinnamon/jakob/cay.test
----------------------------------------------
61.51 map_get.go:49 (# Is `ctrl := bucket.controls`, which makes sense)
31.95 map_get.go:69 (# Is `ctrl := bucket.controls`, thus access the key of the )
1.39 map_get.go:91
1.32 map_get.go:68
1.27 map_get.go:56
Percent | Source code & Disassembly of cay.test for dTLB-load-misses (1833 samples, percent: local period)
----------------------------------------------------------------------------------------------------------------
: Disassembly of section .text:
:
: 000000000059b920 <github.com/jakobgt/cay.(*Map[go.shape.[]uint8_0]).findGet>:
: github.com/jakobgt/cay.(*Map[go.shape.[]uint8_0]).findGet():
Looking into the builtin map access, we only see one place, where the TLB misses are:
Sorted summary for file /home/cinnamon/jakob/cay.test
----------------------------------------------
95.42 map_faststr.go:192
0.78 map_faststr.go:195
Percent | Source code & Disassembly of cay.test for dTLB-load-misses (4901 samples, percent: local period)
----------------------------------------------------------------------------------------------------------------
: Disassembly of section .text:
:
: 0000000000413ce0 <runtime.mapaccess2_faststr>:
: runtime.mapaccess2_faststr():
What is interesting is if I just run perf and record the TLB load misses, there's a factor 10 in favor of caymap? So maybe it is something else than TLB:
cinnamon@cinnamon-testpod:~/jakob$ sudo perf stat -e dTLB-load-misses -g ./cay.test -test.run=^\$ -test.cpu 1 -test.count 1 -test.bench 'read_id/1m/built'
goos: linux
goarch: amd64
pkg: github.com/jakobgt/cay
cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
Benchmark_read_identical_string_keys/1m/builtin 14916814 89.78 ns/op
PASS
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1m/built':
16,202,946 dTLB-load-misses #Builtin has 16M tlb-misses
3.195433605 seconds time elapsed
2.721868000 seconds user
0.545967000 seconds sys
cinnamon@cinnamon-testpod:~/jakob$ sudo perf stat -e dTLB-load-misses -g ./cay.test -test.run=^\$ -test.cpu 1 -test.count 1 -test.bench 'read_id/1m/cay'
goos: linux
goarch: amd64
pkg: github.com/jakobgt/cay
cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
Benchmark_read_identical_string_keys/1m/caymap 14394189 95.27 ns/op
PASS
Performance counter stats for './cay.test -test.run=^$ -test.cpu 1 -test.count 1 -test.bench read_id/1m/cay':
1,236,577 dTLB-load-misses #caymap has 1.2M tlb-misses
3.541897468 seconds time elapsed
3.074113000 seconds user
0.529052000 seconds sys
TODO: Check up on
- TLB misses for the 1M case.
- the fill-ratio
Looked into
- TLB misses around the __CompareNMask call (more verification there)
- Checked whether the Go tracing could be helpful (it did) - I made the test cases more sane.
By fetching the bucket.entries before and after the CompareNMask call, we can see where fetching the memory costs something. And it does seem that fetching bucket.entries after the call does cost something.
This is from findGet
via go test -run=^\$ -cpu 1 -count 1 -bench 'read_id/8/caymap' -cpuprofile cpu.profile
53 . . bEntries := &bucket.entries
54 50ms 290ms idx := __CompareNMask(grpCtrlPointer, unsafe.Pointer(hash>>57))
55 70ms 70ms bEntries = &bucket.entries
In the above we can see that the first bEntries := &bucket.entries
does not cost anything, whereas the second bEntries = &bucket.entries
does cost 70ms (out of roughly 690ms inside this function, or 10%).
TODO: Look into what perf says about the above
Looking at the Go traces it was obvious that the setup of the tests, where each subtest would create random keys, everytime it was run, was not beneficial. So I refactored the tests to create random key slices for each required size and use that.
These were the commands:
$ go test -trace trace_cay.out -run=^\$ -cpu 1 -count 1 -bench 'read_not_found_dyn/1m/caymap'
# Start trace:
$ go tool trace --http :1235 trace_cay.out
(Do similarly for builtin)
Running the benchmarks again we get:
# Not found cases:
$ go test -run=^\$ -cpu 1 -count 10 -bench 'read_not_found' | tee ~/cay_not_found_results.txt
$ benchstat ~/cay_not_found_results.txt
name time/op
_read_not_found_dynamic/8/caymap 14.1ns ± 2%
_read_not_found_dynamic/8/builtin 12.1ns ± 3%
_read_not_found_dynamic/1k/caymap 17.1ns ± 1%
_read_not_found_dynamic/1k/builtin 18.1ns ± 3%
_read_not_found_dynamic/32k/caymap 24.3ns ± 9%
_read_not_found_dynamic/32k/builtin 26.8ns ± 5%
_read_not_found_dynamic/512k/caymap 65.3ns ± 9%
_read_not_found_dynamic/512k/builtin 78.5ns ± 9%
_read_not_found_dynamic/1m/caymap 80.7ns ± 3%
_read_not_found_dynamic/1m/builtin 86.8ns ±13%
_read_not_found_static_key/8/caymap 13.5ns ± 3%
_read_not_found_static_key/8/builtin 12.3ns ± 4%
_read_not_found_static_key/1k/caymap 13.2ns ± 2%
_read_not_found_static_key/1k/builtin 14.7ns ± 3%
_read_not_found_static_key/32k/caymap 13.3ns ± 2%
_read_not_found_static_key/32k/builtin 14.7ns ± 2%
_read_not_found_static_key/512k/caymap 13.6ns ± 3%
_read_not_found_static_key/512k/builtin 14.9ns ± 4%
_read_not_found_static_key/1m/caymap 13.3ns ± 2%
_read_not_found_static_key/1m/builtin 14.7ns ± 5%
Caymap is faster for all cases, except the 8byte cases.
For the found case, builint is faster in all cases, except 32K keys:
$ go test -run=^\$ -cpu 1 -count 10 -bench 'read_iden' | tee ~/cay_identical_found_results.txt
$ benchstat ~/cay_identical_found_results.txt
name time/op
_read_identical_string_keys/8/caymap 18.0ns ± 4%
_read_identical_string_keys/8/builtin 10.6ns ± 1%
_read_identical_string_keys/1k/caymap 21.4ns ± 3%
_read_identical_string_keys/1k/builtin 14.9ns ± 2%
_read_identical_string_keys/32k/caymap 31.8ns ± 3%
_read_identical_string_keys/32k/builtin 39.3ns ± 1%
_read_identical_string_keys/512k/caymap 100ns ± 6%
_read_identical_string_keys/512k/builtin 94.7ns ± 4%
_read_identical_string_keys/1m/caymap 118ns ± 6%
_read_identical_string_keys/1m/builtin 100ns ± 6%
I wonder if we look into the fill ratio to see if the amount of memory used for caymap vs. builtin could be interesting.
Adding some memory ballast in the form of
var (
_ballast = make([]byte, int64(1)<<int64(33)) // 8GB
)
does not change anything unfortunately.
Changing the number of entries in the map makes a big difference, as for all maps smaller than 1<<15
, simdmap is faster on both Mac and Linux. And it is faster always.
Maybe the type should be changes to map<string, int>
, which is what Matt Kulundis from Google is using to benchmark his types. He has two set-ups, one with a 4 byte key/value and one with 64 bytes (he is only looking at sets, where the key and value is the same).
Generally, where the code is using the most cpu-cycles is also where there are cache misses, so I want to know what type of cache miss it is. E.g., TLB/L1/L2, etc. On the host, the L1 is 32K.
Running
./perf record -e L1-dcache-load-misses -g ./simdmap.test -test.run=^\$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench '.*simd.*get/same_keys.*'
gives that if keyP.Len != ekeyP.Len {
(map_get:75
) has:
- 80% of the L1 cache misses in
find_get
(2381 samples) - 70% of the LLC cache misses. (LLC = Last Level Cache, not sure whether that is L2 or L3LLC-load-misses) (2353 samples)
__CompareNMask
has a similar level of L1 cache misses, but not LLC misses (only 600 samples).
This is really weird, as the host has a 32KB L1 cache, 256KB L2 cache and a 20MB L3 cache and that specific line gets flushed. Could it be the GC that flushes the caches?
TODO: Try to disable GC during the test runs. That might avoid the GC. Look for deltablue benchmarks, that were used in V8/Dart, etc.
It seems that we can't optimize find
that much more, where we have to accept the second round of TLB misses, when inspecting the length of the keys. Maybe we can optimize the case where the key is found and then making sure that, once we return the value pointer we don't hit a third TLB miss there?
Inspecting the code with
./perf record -e cache-misses -g ./simdmap.test -test.run=^\$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench '.*simd.*get/same_keys.*'
it seems that all the cache misses happens where there are also TLB misses, including the return call in Get().
Two ideas: a) Move key and value closer, and/or b) reduce function calls and LOC between returning from find and using the byte[]
With
type entry struct {
key string
value []byte
}
for each entry it seems that we get a lot fewer TLB misses in the return statement (~248 samples, whereas the ekeyP.Len generates ~1K TLB page misses). This could be interesting.
As a side note, SIMD get/same keys is often times, faster than the builtin:
$ ./simdmap.test -test.run=^\$ -test.cpu 1 -test.benchmem -test.benchtime 30s -test.bench '.*get/same_keys.*'
goos: linux
goarch: amd64
pkg: github.com/jakobgt/cay
Benchmark_builtin_map_get/same_keys 199967498 194 ns/op 0 B/op 0 allocs/op
Benchmark_simdmap_get/same_keys 202134694 184 ns/op 0 B/op 0 allocs/op
Buckets: 131072, of which 1043 are full
PASS
$ ./simdmap.test -test.run=^\$ -test.cpu 1 -test.benchmem -test.benchtime 10s -test.bench '.*get/same_keys.*'
goos: linux
goarch: amd64
pkg: github.com/jakobgt/cay
Benchmark_builtin_map_get/same_keys 86886939 143 ns/op 0 B/op 0 allocs/op
Benchmark_simdmap_get/same_keys 100000000 125 ns/op 0 B/op 0 allocs/op
Buckets: 131072, of which 1081 are full
PASS
$ ./simdmap.test -test.run=^\$ -test.cpu 1 -test.benchmem -test.benchtime 10s -test.bench '.*get/same_keys.*'
goos: linux
goarch: amd64
pkg: github.com/jakobgt/cay
Benchmark_builtin_map_get/same_keys 84544454 149 ns/op 0 B/op 0 allocs/op
Benchmark_simdmap_get/same_keys 100000000 124 ns/op 0 B/op 0 allocs/op
Buckets: 131072, of which 1120 are full
PASS
$ ./simdmap.test -test.run=^\$ -test.cpu 1 -test.benchmem -test.benchtime 10s -test.bench '.*get/same_keys.*'
goos: linux
goarch: amd64
pkg: github.com/jakobgt/cay
Benchmark_builtin_map_get/same_keys 94161273 154 ns/op 0 B/op 0 allocs/op
Benchmark_simdmap_get/same_keys 100000000 172 ns/op 0 B/op 0 allocs/op
Buckets: 131072, of which 1109 are full
PASS
$ ./simdmap.test -test.run=^\$ -test.cpu 1 -test.benchmem -test.benchtime 10s -test.bench '.*get/same_keys.*'
goos: linux
goarch: amd64
pkg: github.com/jakobgt/cay
Benchmark_builtin_map_get/same_keys 93228235 145 ns/op 0 B/op 0 allocs/op
Benchmark_simdmap_get/same_keys 98207223 122 ns/op 0 B/op 0 allocs/op
Buckets: 131072, of which 1106 are full
PASS
If using grp := m.buckets[cGroup]
, I would have hoped that the bucket would have been allocated on the stack. From the generated byte code, it does not seem to be case, as runtime.newobject
is called along with runtime.duffcopy
and runtime.typedmemmove
, so maybe the Go compiler thinks that the bucket escapes and thus is allocated on the heap. (newobject
is the same as malloc).
155 . 2.17s grp := m.buckets[cGroup]
. . 11e58b0: LEAQ runtime.rodata+297792(SB), AX map.go:155
. . 11e58b7: MOVQ AX, 0(SP) map.go:155
. . 11e58bb: NOPL 0(AX)(AX*1) map.go:155
. 1.13s 11e58c0: CALL runtime.newobject(SB) map.go:155
...
(AX*1) map.go:155
. . 11e58e0: CMPQ CX, BX map.go:155
. . 11e58e3: JAE 0x11e5995 map.go:155
. . 11e58e9: MOVQ DI, 0x48(SP) map.go:155
. . 11e58ee: SHLQ $0xa, BX map.go:155
. . 11e58f2: LEAQ 0(DX)(BX*1), SI map.go:155
. . 11e58f6: CMPL $0x0, runtime.writeBarrier(SB) map.go:155
. . 11e58fd: JNE 0x11e596f map.go:155
. . 11e58ff: NOPL map.go:155
. . 11e5900: MOVQ BP, -0x10(SP) map.go:155
. . 11e5905: LEAQ -0x10(SP), BP map.go:155
. 280ms 11e590a: CALL runtime.duffcopy(SB) map.go:155
. . 11e590f: MOVQ 0(BP), BP map.go:155
...
. 760ms 11e5984: CALL runtime.typedmemmove(SB) map.go:155
Btw., go build -gcflags '-m' .
is good for seeing what escapes to the heap. By ensuring that the grp
variable does not escape, I reduce the ns/ops from 1100ns to 200ns. And thus, no mallocs and typedmemmove
. Only a duffcopy
call.
Making sure that grp
does not escape, we see a different picture, where the tlb-misses are located in (percentages)
47.10 map.go:138
14.73 map.go:231
14.66 map.go:155
10.13 map.go:153
5.17 map.go:236
3.97 map.go:151
2.34 map.go:150
where
map.go:138
ishash := z.Memhash(keyP.Str, 0, uintptr(keyP.Len))
map.go:231
isuintptr(1) << (b & (PtrSize*8 - 1))
, whereb
ism.logSize
, so I guess a lookup of them
variablemap.go:155
isgrp := m.buckets[cGroup]
So maybe the duffcopy is destroying the TLB cache? Looking in the duffcopy code, it has 14610 samples, whereas the find
method has 1412 samples. Thus duffcopy seems to kill the TLB. Thus, can we load the TLB for the full page of grp
, without a duffcopy
call? It would be interesting to see whether the grp
is page aligned? Or rather can we just copy the control byte array, which causes a copy of the array, but a full load of the page into the TLB?
By forcing it to copy the control byte array onto the stack, I've seen some movups, thus the compiler can't see that the data is byte-aligned (MOVUPS is Move Unaligned Packed Single-Precision). Furthermore copying a byte array is not the same as copying a string array (the latter leads to a duffcopy call).
Confirmed with the test case Test__FirstBucketIsPageAligned
, that the buckets are page-aligned (as in the first bucket is). Thus, the TLB misses are not coming from misaligned loads. Thus need to figure out why there is a TLB miss, when accessing unsafe.Pointer(&grp.keys[i])
.
Maybe the second TLB miss, which stands for 13% of the misses in find()
is okay? Can I find optimizations somewhere else?
For the not-found tests, TLB misses are the same for both versions:
$ ./linux-perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses ./simdmap.test -test.run=^\$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench '.*builtin.*not_found/dynamic_keys.*'
goos: linux
goarch: amd64
pkg: github.com/jakobgt/cay
Benchmark_builtin_map_get_not_found/dynamic_keys 10000000 168 ns/op 0 B/op 0 allocs/op
PASS
Performance counter stats for './simdmap.test -test.run=^$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench .*builtin.*not_found/dynamic_keys.*':
2,278,803,379 dTLB-loads:u
12,915,360 dTLB-load-misses:u # 0.57% of all dTLB cache hits # <- check this number and compare below, both are at 0.5% TLB misses
<not supported> dTLB-prefetch-misses:u
6.397644517 seconds time elapsed
$ ./linux-perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses ./simdmap.test -test.run=^\$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench '.*simd.*not_found/dynamic_keys.*'
Buckets: 131072, of which 1060 are full
goos: linux
goarch: amd64
pkg: github.com/jakobgt/cay
Benchmark_simdmap_get_not_found/dynamic_keys 10000000 117 ns/op 0 B/op 0 allocs/op
PASS
Performance counter stats for './simdmap.test -test.run=^$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench .*simd.*not_found/dynamic_keys.*':
2,474,095,181 dTLB-loads:u
13,366,113 dTLB-load-misses:u # 0.54% of all dTLB cache hits # <- check this number and compare above, both are at 0.5% TLB misses
<not supported> dTLB-prefetch-misses:u
5.389249197 seconds time elapsed
And it seems that found tests are also better in the simdmap case (at least once):
$ ./linux-perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses ./simdmap.test -test.run=^$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench '.get/same_keys.' goos: linux goarch: amd64 pkg: github.com/jakobgt/cay Benchmark_builtin_map_get/same_keys 10000000 173 ns/op 0 B/op 0 allocs/op Benchmark_simdmap_get/same_keys 10000000 131 ns/op 0 B/op 0 allocs/op # <- Simdmap is faster!!!!! Buckets: 131072, of which 1083 are full PASS
Performance counter stats for './simdmap.test -test.run=^$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench .get/same_keys.':
8,018,199,273 dTLB-loads:u
30,158,881 dTLB-load-misses:u # 0.38% of all dTLB cache hits
dTLB-prefetch-misses:u
16.724325756 seconds time elapsed
$ ./linux-perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses ./simdmap.test -test.run=^$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench '.*builtin.get/same_keys.' goos: linux goarch: amd64 pkg: github.com/jakobgt/cay Benchmark_builtin_map_get/same_keys 10000000 132 ns/op 0 B/op 0 allocs/op PASS
Performance counter stats for './simdmap.test -test.run=^$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench .*builtin.get/same_keys.':
3,547,374,496 dTLB-loads:u
16,341,849 dTLB-load-misses:u # 0.46% of all dTLB cache hits
dTLB-prefetch-misses:u
7.503344017 seconds time elapsed
$ ./linux-perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses ./simdmap.test -test.run=^$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench '.*simd.get/same_keys.' goos: linux goarch: amd64 pkg: github.com/jakobgt/cay Benchmark_simdmap_get/same_keys 10000000 133 ns/op 0 B/op 0 allocs/op Buckets: 131072, of which 1065 are full PASS
Performance counter stats for './simdmap.test -test.run=^$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench .*simd.get/same_keys.':
3,870,569,488 dTLB-loads:u
14,451,216 dTLB-load-misses:u # 0.37% of all dTLB cache hits
dTLB-prefetch-misses:u
8.990690540 seconds time elapsed
Furthermore, line 179 in map.go (if (keyP == ekeyP || keyP.Str == ekeyP.Str || z.Memequal(ekeyP.Str, keyP.Str, uintptr(keyP.Len))
) is causing a relative high number of TLB misses. It accounts for 31% of the TLB-loads, but accounts for 68% of the dTLB-load-misses. Similarly line 155 (idx := __CompareNMask(grpCtrlPointer, unsafe.Pointer(hash>>57))
) accounts for 1.6% of the loads, but 16% of the dTLB-load-misses. I'll try to add
if keyP.Len != ekeyP.Len {
continue
}
before line 179 to see if that changes the load-misses.
With the above, the above now has 30% of the dtlb-loads and 70%(!) of the dtlb-load-misses. Thus referencing keyP.Len
or ekeyP.Len
results in TLB misses. Based on my reading of the generated code the culprit is the ekeyP := (*z.StringStruct)(unsafe.Pointer(&grp.keys[i]))
that generates a bunch of extra instructions, and the question is whether these instructions mess up the TLB cache. Should I try to write these as an unsafe
array computation?
Interesting, replacing ekeyP := (*z.StringStruct)(unsafe.Pointer(&grp.keys[i]))
with ekeyP := (*z.StringStruct)(add(unsafe.Pointer(&grp.keys), uintptr(i)*stringSize))
, keeps the tlb-misses at a LEA (load-effective address) instruction (see below). Thus, it seems that &grp.keys
causes TLB-misses...
map.go:175 0.99 : 5b0d54: lea (%rcx,%rdx,1),%rdi
0.00 : 5b0d58: lea 0x10(%rdi),%rdi
0.05 : 5b0d5c: mov %rsi,%rax
0.24 : 5b0d5f: shl $0x4,%rsi
0.24 : 5b0d63: mov 0x78(%rsp),%r8
0.00 : 5b0d68: mov 0x8(%rdi,%rsi,1),%r9
: github.com/jakobgt/cay.add():
map.go:242 68.16 : 5b0d6d: lea (%rdi,%rsi,1),%r10 # <- this one in the add function
: github.com/jakobgt/cay.(*Map).find():
0.00 : 5b0d71: cmp %r8,%r9
0.00 : 5b0d74: jne 5b0d39 <github.com/jakobgt/cay.(*Map).find+0x79>
Hence next step is to figure out whether I really need grp := &m.buckets[cGroup]
or it should be grp := m.buckets[cGroup]
$ go version 20:32 30/11
go version go1.15.2 darwin/amd64
Local Mac (all tests are green)
Run (simdmap wins):
$ go test -run=^\$ -cpu 1 -benchmem -bench '.*' 20:38 30/11
goos: darwin
goarch: amd64
pkg: github.com/jakobgt/cay
Benchmark_string_len 1000000000 0.532 ns/op 0 B/op 0 allocs/op
Benchmark_string_len_reverse 772742073 1.56 ns/op 0 B/op 0 allocs/op
Benchmark_string_struct_len_reverse 761126870 1.55 ns/op 0 B/op 0 allocs/op
Benchmark_builtin_map_get/same_keys 11334487 113 ns/op 0 B/op 0 allocs/op
Benchmark_builtin_map_get/same_but_fresh_keys 7499438 166 ns/op 0 B/op 0 allocs/op
Benchmark_builtin_map_get/same,_fresh_and_random_order_keys 4416318 286 ns/op 0 B/op 0 allocs/op
Benchmark_simdmap_get/same_keys 9499071 154 ns/op 0 B/op 0 allocs/op
Benchmark_simdmap_get/same_but_fresh_keys 7277200 174 ns/op 0 B/op 0 allocs/op
Benchmark_simdmap_get/same,_fresh_and_random_order_keys 4206589 313 ns/op 0 B/op 0 allocs/op
Benchmark_builtin_map_get_not_found/static_key 92947183 13.4 ns/op 0 B/op 0 allocs/op
Benchmark_builtin_map_get_not_found/dynamic_keys 9785622 128 ns/op 0 B/op 0 allocs/op
Buckets: 131072, of which 1060 are full
Benchmark_simdmap_get_not_found/static_key 100000000 11.8 ns/op 0 B/op 0 allocs/op
Benchmark_simdmap_get_not_found/dynamic_keys 14061108 94.3 ns/op 0 B/op 0 allocs/op
For ./simdmap.test -test.run=^\$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench '.*not_found/dynamic_keys.*'
Instruments shows:
L2 Hit: ~8mm for simdmap (vs. 3mm for builtin)
DTLB walk completed: ~10.2mm for simdmap (vs. 10.5 for builtin)
DTLB STLB hit: ~3.3mm for simdmap (vs. 9.9mm for builtin)
Mispredicted branches ~0.9mm for simdmap (vs. 12mm for builtin)
seems like builtin just requests more data for not found.
For -test.run=^$ -test.cpu 1 -test.benchmem -test.benchtime 10000000x -test.bench .*get/same_but_fresh_keys.*
Intruments reports
L2 miss: ~38.6mm for simdmap (vs. 35.4 for builtin)
Cycles: ~4bn for simdmap (vs. 3.9bn for builtin)
Stalled cycles: ~11.5bn for simdmap (vs. 9.9bn for builtin)
Mispredicted branches: ~0.4mm for simdmap (vs. 9.5mm for builtin)
Thus, simdmap is much more predictable for branch predictions, but the memory fetches seems to be worse.
Making the bucket
byte-aligned (by having a filler byte array) seems to improve performance of the simdmap
Instruments command
-test.run=^$ -test.cpu 1 -test.benchmem -test.benchtime 30000000x -test.cpuprofile cpu.profile -test.bench .*get/same_keys.*
Non-byte aligned:
testing: open cpu.profile: read-only file system
goos: darwin
goarch: amd64
pkg: github.com/jakobgt/cay
Benchmark_builtin_map_get/same_keys 30000000 114 ns/op 0 B/op 0 allocs/op
Benchmark_simdmap_get/same_keys 30000000 143 ns/op 0 B/op 0 allocs/op
Buckets: 131072, of which 1112 are full
PASS
Byte-aligned:
testing: open cpu.profile: read-only file system
goos: darwin
goarch: amd64
pkg: github.com/jakobgt/cay
Benchmark_builtin_map_get/same_keys 30000000 110 ns/op 0 B/op 0 allocs/op
Benchmark_simdmap_get/same_keys 30000000 134 ns/op 0 B/op 0 allocs/op
Buckets: 131072, of which 1115 are full
PASS
<End of Run>
Comparing Instrument counters: Non-byte aligned
Cpu cycles stalled: 33.867.820.180 for simdmap (vs. 24.266.816.843 for builtin)
L2 Cache misses: 112.878.397 for simdmap (vs. 98.156.923 for builtin)
Byte-aligned
Cpu cycles stalled: 29.347.750.948 for simdmap (vs. 25.248.233.827 for builtin)
L2 Cache misses: 95.792.491 for simdmap (vs. 98.296.924 for builtin)
So byte aligned does matter.
Next up?
With a byte-aligned map, the simdmap implementation for get/same_keys
has fewer L2D misses, fewer TLB load misses causing a walk, but higher number of instructions that causes a stall:
Cycles with a stall: 25.474.354.712 (simdmap) vs. 20.477.061.167 (builtin)
L2 misses: 96.441.083 (simd) vs. 96.441.083 (builtin)
DTLB load misses: 31.015.416 (simd) vs. 34.998.766 (builtin)
So, I need to figure out what is causing these load stalls.
Turns out that DTLB is the first-level TBL cache, and STLB is the second-level. Looking at STLB, we see that simdmap does 40% more STLB misses than builtin. The MEM_INST_RETIRED.STLB_MISS_LOADS
counts this
Args:
-test.run=^$ -test.cpu 1 -test.benchmem -test.benchtime 30000000x -test.bench .*get/same_keys.*
See Intel CPU events: https://download.01.org/perfmon/index/skylake.html TLB numbers
Load instructions: 1.463.458.848 for SIMD (vs. 1.881.617.539 for builtin)
Loads with an STLB miss: 28.976.282 for SIMD (2% of loads - vs. 21.631.208 for builtin - 1.1%)
DLTB Load misses: 31.020.833 for SIMD (vs. 35.153.593 for builtint)
DLTB miss, but in STLB: 2.789.990 for SIMD (vs. 30.226.279)
It thus seems like SIMD has more TLB misses. With perf
you can see the instructions where those TLBs are, so I should somehow see if I can
get a Linux box to test perf on.
My current thesis is the more bound checks in the simd code, causes more TLB cache invalidations.
- Not inlining a function can add up to 4ns
- Adding
-cpuprofile cpu.profile
adds a few ns. - Using
groupMask(...)
to generate the group mask over (hash & m.Mask / 16) shaves off a ns/op (and the mask generate is dropped by 4.5x from 70ms to 20ms). Golang then uses theMOVZX
operation over theMOVQ
. Code went from
148 70ms 70ms sGroup := hash & m.slotMask / 16
40ms 40ms 129b7db: MOVQ 0x90(SP), CX map.go:148
20ms 20ms 129b7e3: MOVQ 0x18(CX), DX map.go:148
. . 129b7e7: ANDQ AX, DX map.go:148
10ms 10ms 129b7ea: SHRQ $0x4, DX map.go:148
To:
150 20ms 20ms groupMask := bucketMask(m.logSize)
20ms 20ms 129b7fb: MOVQ 0x90(SP), CX map.go:150
. . 129b803: MOVZX 0x18(CX), DX map.go:150
151 . . sGroup := hash & groupMask // Equal to hash & m.slotMask / 16
. . 129b818: ANDQ AX, CX
- The native hashmap does a lot to avoid bound checks (e.g., for bitmasks it uses uintptr that)
- The max key size for builtin map is 128 (bits/bytes?) before it is not inlined. Similarly for an elem size.
- Letting
__CompareNMask
return an int instead of returning via an argument does not change a lot (maybe 1 ns/op or so.) - ~0.8% of the SIMD buckets are full (~1k out of 131k bucket) causing ~10ns extra time per op, if we need to search the next bucket, instead of stopping after searching the first bucket.
- Having different sized keys result in a drop of 12ns/op (from 100ns to 88ns) just as we avoid comparing keys.
- Wow, the caching nature of an open addressing hashmap is crazy (20ns/op for varying keys with SIMD vs. 150 for builtin)
The differing factor is that it 7x more to fetch the buckets from memory (specifically the
MOVZX 0(BX)(CX*1), R8
operation onmap_faststr.go:191
.
~/g/s/c/i/p/l/simdmap (simdmap) $ go test -run=^\$ -benchmem -cpu 1 -bench '.*not_found.*' 17:35 17/03
goos: darwin
goarch: amd64
pkg: github.com/jakobgt/cay
Benchmark_simdmap_get_not_found/static_key 85533632 14.3 ns/op 0 B/op 0 allocs/op
Benchmark_simdmap_get_not_found/dynamic_keys 59230143 19.4 ns/op 0 B/op 0 allocs/op
Benchmark_builtin_map_get_not_found/static_key 87805620 13.8 ns/op 0 B/op 0 allocs/op
Benchmark_builtin_map_get_not_found/dynamic_key 8325208 149 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/jakobgt/cay 8.869s