We measure the latency it takes for a CPU to send a message to another CPU via its cache coherence protocol.
By pinning two threads on two different CPU cores, we can get them to do a bunch of compare-exchange operation, and measure the latency.
How to run:
$ cargo install core-to-core-latency
$ core-to-core-latency
CPU | Median Latency |
---|---|
AMD Ryzen 9 7950X, 16 Cores, zen4, 2022-Q3 | 68ns |
AMD EPYC 7773X, 64 Cores, Milan-X, 2022-Q1 | 115ns |
Intel Xeon Gold 6242, 16 Cores, Cascade Lake, 2019-Q2 | 48ns |
Intel Xeon Phi 7210, 64 Cores, Knights Landing, 2016-Q2 | 91ns |
HiSilicon Kunpeng 920-6426, 64 cores, ARMv8.2-A, 2019-Q1 | 72ns |
Intel Core i9-12900K, 8P+8E Cores, Alder Lake, 12th gen, 2021-Q4 | 35ns, 44ns, 50ns |
Intel Core i9-9900K, 3.60GHz, 8 Cores, Coffee Lake, 9th gen, 2018-Q4 | 21ns |
Intel Core i7-1165G7, 2.80GHz, 4 Cores, Tiger Lake, 11th gen, 2020-Q3 | 27ns |
Intel Core i7-6700K, 4.00GHz, 4 Cores, Skylake, 6th gen, 2015-Q3 | 20ns |
Intel Core i5-10310U, 4 Cores, Comet Lake, 10th gen, 2020-Q2 | 21ns |
Intel Core i5-4590, 3.30GHz 4 Cores, Haswell, 4th gen, 2014-Q2 | 21ns |
Apple M1 Pro, 6P+2E Cores, 2021-Q4 | 40ns, 53ns, 145ns |
Intel Xeon Platinum 8375C, 2.90GHz, 32 Cores, Ice Lake, 3rd gen, 2021-Q2 | 51ns |
Intel Xeon Platinum 8275CL, 3.00GHz, 24 Cores, Cascade Lake, 2nd gen, 2019-Q2 | 47ns |
Intel Xeon E5-2695 v4, 2.10GHz, 18 Cores, Broadwell, 5th gen, 2016-Q1 | 44ns |
AMD EPYC 7R13, 48 Cores, Milan, 3rd gen, 2021-Q1 | 23ns, 107ns |
AMD Ryzen Threadripper 3960X, 3.80GHz, 24 Cores, Zen 2, 3rd Gen, 2019-Q4 | 24ns, 94ns |
AMD Ryzen Threadripper 1950X, 3.40GHz, 16 Cores, Zen, 1st Gen, 2017-Q3 | 25ns, 154ns |
AMD Ryzen 9 5950X, 3.40GHz, 16 Cores, Zen3, 4th gen, 2020-Q4 | 17ns, 85ns |
AMD Ryzen 9 5900X, 3.40GHz, 12 Cores, Zen3, 4th gen, 2020-Q4 | 16ns, 84ns |
AMD Ryzen 7 5800U, 1.9GHz up to 4.4GHz, 8 Cores, Zen3, 4th gen, 2021-Q4 | 19ns |
AMD Ryzen 7 5700X, 3.40GHz, 8 Cores, Zen3, 4th gen, 2022-Q2 | 18ns |
AMD Ryzen 7 2700X, 3.70GHz, 8 Cores, Zen+, 2nd gen, 2018-Q3 | 24ns, 92ns |
AMD Ryzen 9 5900HX, 3.3GHz, 8 Cores, Zen3, 4th gen, 2021-Q1 | 8ns, 17ns, 18ns |
AWS Graviton3, 64 Cores, Arm Neoverse, 3rd gen, 2021-Q4 | 46ns |
AWS Graviton2, 64 Cores, Arm Neoverse, 2rd gen, 2020-Q1 | 47ns |
Sun/Oracle SPARC T4, 2.85GHz, 8 cores, 2011-Q3 | 98ns |
IBM Power7, 3.3GHz, 8 Cores, 2010-Q1 | 173ns |
IBM PowerPC 970, 1.8GHz, 2 Cores, 2003-Q2 | 576ns |
Data provided by Concyclics.
Data provided by Concyclics.
Data provided by Concyclics.
Data provided by zamadatix.
Data provided by zamadatix.
Data provided by SchrodingerZhu.
Data provided by Glavo.
Data provided by bizude.
This CPU has 8 performance cores, and 2 groups of 4 efficient cores. We see CPU=8 with fast access to all other cores.
Data provided by nviennot.
Data provided by Jonas Wunderlich.
Data provided by CanIGetaPR.
Data provided by Ashley Sommer.
Data provided by Felipe Lube de Bragança.
Data provided by Aditya Sharma.
We see the two efficent cores clustered together with a latency of 53ns, then two groups of 3 performance cores, with a latency of 40ns. Cross-group communication is slow at ~145ns, which is a latency typically seen in multi-socket configurations.
From an AWS c6i.metal
machine.
From an AWS c5.metal
machine.
From a machine provided by GTHost
From an AWS c6a.metal
machine.
We can see cores arranged in 6 groups of 8 in which latency is excellent within (23ns). When data crosses groups, the latency jumps to around 110ns. Note, that the last 3 groups have a better cross-group latency than the first 3 (~90ns).
Data provided by Mathias Siegel.
We see the CPUs in 8 groups of 3, and better performance for CPUS in the group [13,24].
Data provided by Jakub Okoński
We see the CPUs in 4 groups of 4, and better performance for CPUS in the group [9,16].
Data provided by John Schoenick.
We can see two groups of 8 cores with latencies of 17ns intra-group, and 85ns inter-group.
Data provided by Scott Markwell.
We see two groups of 6 cores with latencies of 16ns intra-group and 84ns inter-group.
Data provided by George Melikov.
Data provided by Ashley Sommer.
Data provided by David Hoppenbrouwers.
We can see 2 groups of 4 cores with latencies of 24ns intra-group, and 92ns inter-group.
Data provided by r4nd0m1z3r.
From an AWS c7g.16xlarge
machine.
From an AWS c6gd.metal
machine.
Data provided by Kokoa van Houten.
Data provided by Kokoa van Houten.
The following shows dual-socket configuration latency where one CPU on the first socket sends a message to another CPU on the second socket. The number in parenthesis next to the latency denotes the slowdown compared to single socket.
CPU | Median Latency |
---|---|
Intel Xeon Gold 6242, 16 Cores, Cascade Lake, 2019-Q2 | 136ns (2.8x) |
Intel Xeon Platinum 8375C, 2.90GHz, 32 Cores, Ice Lake, 3rd gen, 2021-Q2 | 108ns (2.1x) |
Intel Xeon Platinum 8275CL, 3.00GHz, 24 Cores, Cascade Lake, 2nd gen, 2019-Q2 | 134ns (2.8x) |
Intel Xeon E5-2695 v4, 2.10GHz, 18 Cores, Broadwell, 5th gen, 2016-Q1 | 118ns (2.7x) |
AMD EPYC 7R13, 48 Cores, Milan, 3rd gen, 2021-Q1 | 197ns |
Sun/Oracle SPARC T4, 2.85GHz, 8 cores, 2011-Q3 | 356ns (3.6x) |
IBM Power7, 3.3GHz, 8 Cores, 2010-Q1 | 443ns (2.5x) |
Data provided by Concyclics.
From an AWS c6i.metal
machine.
From an AWS c5.metal
machine.
From a machine provided by GTHost
From an AWS c6a.metal
machine.
This one is a bit odd. The single socket test for Socket 1 shows median latencies of 107ns cross-groups, but Socket 2 shows 200ns. It's 2x slower, very odd. The other platforms don't behave this way. In fact, the socket-to-socket latencies are than the core-to-core within Socket 2.
Anandtech have measured similar results on a Dual-Socket AMD EPYC 7763 and 7742.
Socket 2 does not behave similarly than Socket 1, it's twice as slow.
Data provided by Kokoa van Houten.
Data provided by Kokoa van Houten.
We measure the latency between two hyper-threads of the same core
CPU | Median Latency |
---|---|
AMD Ryzen 9 7950X, 16 Cores, zne4, 2022-Q3 | 5.3ns |
AMD EPYC 7773X, 64 Cores, Milan-X, 2022-Q1 | 10ns |
Intel Xeon Gold 6242, 16 Cores, Cascade Lake, 2019-Q2 | 7.4ns |
Intel Core i9-12900K, 8+8 Cores, Alder Lake, 12th gen, 2021-Q4 | 4.3ns |
Intel Core i9-9900K, 3.60GHz, 8 Cores, Coffee Lake, 9th gen, 2018-Q4 | 6.2ns |
Intel Core i7-1165G7, 2.80GHz, 4 Cores, Tiger Lake, 11th gen, 2020-Q3 | 5.9ns |
Intel Core i7-6700K, 4.00GHz, 4 Cores, Skylake, 6th gen, 2015-Q3 | 6.9ns |
Intel Core i5-10310U, 4 Cores, Comet Lake, 10th gen, 2020-Q2 | 7.3ns |
Intel Xeon Platinum 8375C, 2.90GHz, 32 Cores, Ice Lake, 3rd gen, 2021-Q2 | 8.1ns |
Intel Xeon Platinum 8275CL, 3.00GHz, 24 Cores, Cascade Lake, 2nd gen, 2019-Q2 | 7.6ns |
Intel Xeon E5-2695 v4, 2.10GHz, 18 Cores, Broadwell, 5th gen, 2016-Q1 | 7.6ns |
AMD EPYC 7R13, 48 Cores, Milan, 3rd gen, 2021-Q1 | 9.8ns |
AMD Ryzen Threadripper 3960X, 3.80GHz, 24 Cores, Zen 2, 3rd Gen, 2019-Q4 | 6.5ns |
AMD Ryzen Threadripper 1950X, 3.40GHz, 16 Cores, Zen, 1st Gen, 2017-Q3 | 10ns |
AMD Ryzen 9 5950X, 3.40GHz, 16 Cores, Zen3, 4th gen, 2020-Q4 | 7.8ns |
AMD Ryzen 9 5900X, 3.40GHz, 12 Cores, Zen3, 4th gen, 2020-Q4 | 7.6ns |
AMD Ryzen 7 5700X, 3.40GHz, 8 Cores, Zen3, 4th gen, 2022-Q2 | 7.8ns |
AMD Ryzen 7 2700X, 3.70GHz, 8 Cores, Zen+, 2nd gen, 2018-Q3 | 9.7ns |
Sun/Oracle SPARC T4, 2.85GHz, 8 cores, 2011-Q3 | 24ns |
IBM Power7, 3.3GHz, 8 Cores, 2010-Q1 | 70ns |
The notebook results/results.ipynb contains the code to generate these graphs
First install Rust and gcc
on linux, then:
$ cargo install core-to-core-latency
$ core-to-core-latency
Num cores: 10
Using RDTSC to measure time: false
Num round trips per samples: 1000
Num samples: 300
Showing latency=round-trip-time/2 in nanoseconds:
0 1 2 3 4 5 6 7 8 9
0
1 52±6
2 38±6 39±4
3 39±5 39±6 38±6
4 34±6 38±4 37±6 36±5
5 38±5 38±6 38±6 38±6 37±6
6 38±5 37±6 39±6 36±4 49±6 38±6
7 36±6 39±5 39±6 37±6 35±6 36±6 38±6
8 37±5 38±6 35±5 39±5 38±6 38±5 37±6 37±6
9 48±6 39±6 36±6 39±6 38±6 36±6 41±6 38±6 39±6
Min latency: 34.5ns ±6.1 cores: (4,0)
Max latency: 52.1ns ±9.4 cores: (1,0)
Mean latency: 38.4ns
Use core-to-core-latency 5000 --csv > output.csv
to instruct the program to use
5000 iterations per sample to reduce the noise, and save the results.
It can be used in the jupter notebook results/results.ipynb for rendering graphs.
Create a GitHub issue with the generated output.csv
file and I'll add your results.
This software is licensed under the MIT license