Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
understand performance of C versus Rust implementations #42
These tests were run on SmartOS on a single-socket Haswell server (Xeon E3-1270 v3) running at 3.50GHz. All of the tests were run bound to a processor set containing a single core; all were bound to one logical CPU within that core, with the other logical CPU forced to be idle. cpustat was used to gather CPU performance counter data, with one number denoting one run with
The input file (~30MB compressed) contains 3.9M state changes, and in the default config will generate a ~6MB SVG.
First, and to get it out of the way, here is the GCC-compiled C version relative to the Clang-compiled C version:
There is a significant delta here (a 5% improvement in run-time), which appears to to be due to fewer instructions (1.4B fewer), but also better memory behavior, with CPI dropping from 0.65 to 0.60.
Now here is the C version relative to the Rust version:
The Rust version is issuing a remarkably similar number of instructions (within less than one percent!), but with a decidedly different mix: just three quarters of the loads of the C version and (interestingly) many more stores. The CPI drops from 0.65 to 0.47, indicating much better memory behavior -- and indeed the L1 misses, L2 misses and L3 misses are all way down. The L1 hits as an absolute number are actually quite high relative to the loads, giving Rust a 96.9% L1 hit rate versus the C version's 77.9% hit rate. Rust also lives much better in the L2, where it has half the L2 misses of the C version.
Now for dtolnay's deserializing improvement (that is, leaning on
This is roughly what we would expect, though the L3 misses have been reduced by more than we would expect. Note, too, that while the loads have dropped by ~298M, the total retired instruction count has dropped by 5.3B. This is very significant! And our clock drop by 1.7B cycles shows that the instructions that we took away were (as we expected) well behaved in terms of memory -- they had a CPI of 0.32, which caused our CPI overall to actually rise (from 0.47 to 0.49). All of this was a long way of saying: the double lexing was well behaved with respect to memory, but it was a bunch of unnecessary work, and dtolnay's fix is a really nice win!
The second fix was to avoid
This is also a win -- 515M instructions: 144M loads and 180M stores. This win doesn't show up quite as much in overall wall clock time (where the delta will be in the noise, at least for this load), but it's clearly a win nonetheless and should be integrated.
For those that are curious, here is much more raw data, albeit from a different run. (That is, exact numbers will vary, but trends should be the same.)