New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

understand performance of C versus Rust implementations #42

Open
bcantrill opened this Issue Sep 22, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@bcantrill
Member

bcantrill commented Sep 22, 2018

This issue is to capture some performance analysis of the C implementation versus the Rust implementation, as well as to explore the effect of David Tolnay's proposed improvements.

These tests were run on SmartOS on a single-socket Haswell server (Xeon E3-1270 v3) running at 3.50GHz. All of the tests were run bound to a processor set containing a single core; all were bound to one logical CPU within that core, with the other logical CPU forced to be idle. cpustat was used to gather CPU performance counter data, with one number denoting one run with pic0 programmed to that CPU performance counter.

The input file (~30MB compressed) contains 3.9M state changes, and in the default config will generate a ~6MB SVG.

  • statemap-gcc is the node.js/C hybrid compiled with GCC 7.3.0.

  • is the percentage delta relative to statemap-gcc

  • statemap-clang is the node.js/C hybrid compiled with clang 6.0.1.

  • statemap-rust is the Rust implementation compiled with 1.29.0 (--release) but without any additional optimizations

  • statemap-dtolnay-deser is dtolnay's deserializing optimization, as outlined in his pull request, also compiled with 1.29.0

  • statemap-dtolnay-deser-nostring is dtolnay's deserlializing optimization and his elimination of transiet string allocation to parse the time field of StatemapInputDatum, also compiled with 1.29.0

First, and to get it out of the way, here is the GCC-compiled C version relative to the Clang-compiled C version:

Counter statemap-gcc statemap-clang
cpu_clk_unhalted.thread_p 32166437125 30479632976 -5.2%
inst_retired.any_p 49110875829 50526767701 2.9%
cpu_clk_unhalted.ref_p 918870673 870925736 -5.2%
mem_uops_retired.stlb_miss_loads 8651386 8461734 -2.2%
mem_uops_retired.stlb_miss_stores 268802 240486 -10.5%
mem_uops_retired.lock_loads 7791528 6192277 -20.5%
mem_uops_retired.split_loads 107969 121187 12.2%
mem_uops_retired.split_stores 196934 157650 -19.9%
mem_uops_retired.all_loads 11977544999 11862421792 -1.0%
mem_uops_retired.all_stores 3911589945 3717098293 -5.0%
mem_load_uops_retired.l1_hit 9337365435 9285834500 -0.6%
mem_load_uops_retired.l2_hit 1205703362 1214733389 0.7%
mem_load_uops_retired.l3_hit 66771301 67133832 0.5%
mem_load_uops_retired.l1_miss 1276311911 1290571092 1.1%
mem_load_uops_retired.l2_miss 69671774 70013995 0.5%
mem_load_uops_retired.l3_miss 2544750 2123972 -16.5%
mem_load_uops_retired.hit_lfb 1393631815 1376606034 -1.2%
mem_load_uops_l3_hit_retired.xsnp_miss 435 547 25.7%
mem_load_uops_l3_hit_retired.xsnp_hit 1269 1140 -10.2%
mem_load_uops_l3_hit_retired.xsnp_hitm 820 1100 34.1%
mem_load_uops_l3_hit_retired.xsnp_none 67846758 67620920 -0.3%
mem_load_uops_l3_miss_retired.local_dram 2543699 2082507 -18.1%

There is a significant delta here (a 5% improvement in run-time), which appears to to be due to fewer instructions (1.4B fewer), but also better memory behavior, with CPI dropping from 0.65 to 0.60.

Now here is the C version relative to the Rust version:

Counter statemap-gcc statemap-rust
cpu_clk_unhalted.thread_p 32166437125 23127271226 -28.1%
inst_retired.any_p 49110875829 48752136699 -0.7%
cpu_clk_unhalted.ref_p 918870673 660493684 -28.1%
mem_uops_retired.stlb_miss_loads 8651386 2353178 -72.8%
mem_uops_retired.stlb_miss_stores 268802 1000684 272.3%
mem_uops_retired.lock_loads 7791528 51737 -99.3%
mem_uops_retired.split_loads 107969 52745125 48752.1%
mem_uops_retired.split_stores 196934 41814301 21132.6%
mem_uops_retired.all_loads 11977544999 9035048050 -24.6%
mem_uops_retired.all_stores 3911589945 6627038769 69.4%
mem_load_uops_retired.l1_hit 9337365435 8756546174 -6.2%
mem_load_uops_retired.l2_hit 1205703362 70967580 -94.1%
mem_load_uops_retired.l3_hit 66771301 33323740 -50.1%
mem_load_uops_retired.l1_miss 1276311911 105524579 -91.7%
mem_load_uops_retired.l2_miss 69671774 34616966 -50.3%
mem_load_uops_retired.l3_miss 2544750 1364435 -46.4%
mem_load_uops_retired.hit_lfb 1393631815 157897686 -88.7%
mem_load_uops_l3_hit_retired.xsnp_miss 435 526 20.9%
mem_load_uops_l3_hit_retired.xsnp_hit 1269 740 -41.7%
mem_load_uops_l3_hit_retired.xsnp_hitm 820 517 -37.0%
mem_load_uops_l3_hit_retired.xsnp_none 67846758 33376449 -50.8%
mem_load_uops_l3_miss_retired.local_dram 2543699 1301381 -48.8%

The Rust version is issuing a remarkably similar number of instructions (within less than one percent!), but with a decidedly different mix: just three quarters of the loads of the C version and (interestingly) many more stores. The CPI drops from 0.65 to 0.47, indicating much better memory behavior -- and indeed the L1 misses, L2 misses and L3 misses are all way down. The L1 hits as an absolute number are actually quite high relative to the loads, giving Rust a 96.9% L1 hit rate versus the C version's 77.9% hit rate. Rust also lives much better in the L2, where it has half the L2 misses of the C version.

Now for dtolnay's deserializing improvement (that is, leaning on serde_json::StreamDeserializer instead of scanning ahead for the end of the current JSON blob). This improvement ditches the double lexing. We would expect the number of loads to be reduced by roughly the size of the input, but because these happen within the current JSON blob being parsed (and because these blobs by and large fit in the L1), we would expect these to mainly be L1/L2 hits. Here's the data:

Counter statemap-rust statemap-dt-deser
cpu_clk_unhalted.thread_p 23127271226 -28.1% 21390540509 -33.5%
inst_retired.any_p 48752136699 -0.7% 43365716806 -11.7%
cpu_clk_unhalted.ref_p 660493684 -28.1% 610826326 -33.5%
mem_uops_retired.stlb_miss_loads 2353178 -72.8% 2344946 -72.9%
mem_uops_retired.stlb_miss_stores 1000684 272.3% 1017404 278.5%
mem_uops_retired.lock_loads 51737 -99.3% 35968 -99.5%
mem_uops_retired.split_loads 52745125 48752.1% 54414644 50298.4%
mem_uops_retired.split_stores 41814301 21132.6% 43441378 21958.9%
mem_uops_retired.all_loads 9035048050 -24.6% 8736708559 -27.1%
mem_uops_retired.all_stores 6627038769 69.4% 6576131085 68.1%
mem_load_uops_retired.l1_hit 8756546174 -6.2% 8469374960 -9.3%
mem_load_uops_retired.l2_hit 70967580 -94.1% 71849934 -94.0%
mem_load_uops_retired.l3_hit 33323740 -50.1% 33118819 -50.4%
mem_load_uops_retired.l1_miss 105524579 -91.7% 106343931 -91.7%
mem_load_uops_retired.l2_miss 34616966 -50.3% 34514347 -50.5%
mem_load_uops_retired.l3_miss 1364435 -46.4% 1222417 -52.0%
mem_load_uops_retired.hit_lfb 157897686 -88.7% 159649530 -88.5%
mem_load_uops_l3_hit_retired.xsnp_miss 526 20.9% 408 -6.2%
mem_load_uops_l3_hit_retired.xsnp_hit 740 -41.7% 731 -42.4%
mem_load_uops_l3_hit_retired.xsnp_hitm 517 -37.0% 538 -34.4%
mem_load_uops_l3_hit_retired.xsnp_none 33376449 -50.8% 33313705 -50.9%
mem_load_uops_l3_miss_retired.local_dram 1301381 -48.8% 1241504 -51.2%

This is roughly what we would expect, though the L3 misses have been reduced by more than we would expect. Note, too, that while the loads have dropped by ~298M, the total retired instruction count has dropped by 5.3B. This is very significant! And our clock drop by 1.7B cycles shows that the instructions that we took away were (as we expected) well behaved in terms of memory -- they had a CPI of 0.32, which caused our CPI overall to actually rise (from 0.47 to 0.49). All of this was a long way of saying: the double lexing was well behaved with respect to memory, but it was a bunch of unnecessary work, and dtolnay's fix is a really nice win!

The second fix was to avoid String creation for StatemapInputDatum's time member, as it is after all going to be turned directly into a u64. We would expect this to eliminate an allocation followed by a deallocation:

Counter statemap-dt-deser statemap-dt-deser-nostring
cpu_clk_unhalted.thread_p 21390540509 -33.5% 21377152084 -33.5%
inst_retired.any_p 43365716806 -11.7% 42850414294 -12.7%
cpu_clk_unhalted.ref_p 610826326 -33.5% 610429113 -33.6%
mem_uops_retired.stlb_miss_loads 2344946 -72.9% 2310256 -73.3%
mem_uops_retired.stlb_miss_stores 1017404 278.5% 1005312 274.0%
mem_uops_retired.lock_loads 35968 -99.5% 52584 -99.3%
mem_uops_retired.split_loads 54414644 50298.4% 40418209 37335.0%
mem_uops_retired.split_stores 43441378 21958.9% 36115193 18238.7%
mem_uops_retired.all_loads 8736708559 -27.1% 8592065288 -28.3%
mem_uops_retired.all_stores 6576131085 68.1% 6395531505 63.5%
mem_load_uops_retired.l1_hit 8469374960 -9.3% 8320089097 -10.9%
mem_load_uops_retired.l2_hit 71849934 -94.0% 73450488 -93.9%
mem_load_uops_retired.l3_hit 33118819 -50.4% 34009737 -49.1%
mem_load_uops_retired.l1_miss 106343931 -91.7% 108820312 -91.5%
mem_load_uops_retired.l2_miss 34514347 -50.5% 35343590 -49.3%
mem_load_uops_retired.l3_miss 1222417 -52.0% 1276716 -49.8%
mem_load_uops_retired.hit_lfb 159649530 -88.5% 158752995 -88.6%
mem_load_uops_l3_hit_retired.xsnp_miss 408 -6.2% 247 -43.2%
mem_load_uops_l3_hit_retired.xsnp_hit 731 -42.4% 629 -50.4%
mem_load_uops_l3_hit_retired.xsnp_hitm 538 -34.4% 634 -22.7%
mem_load_uops_l3_hit_retired.xsnp_none 33313705 -50.9% 34020114 -49.9%
mem_load_uops_l3_miss_retired.local_dram 1241504 -51.2% 1368773 -46.2%

This is also a win -- 515M instructions: 144M loads and 180M stores. This win doesn't show up quite as much in overall wall clock time (where the delta will be in the noise, at least for this load), but it's clearly a win nonetheless and should be integrated.

For those that are curious, here is much more raw data, albeit from a different run. (That is, exact numbers will vary, but trends should be the same.)

@bcantrill bcantrill changed the title from parsing performance could be improved significantly to understand performance of C versus Rust implementations Sep 22, 2018

@danmcd

This comment has been minimized.

Show comment
Hide comment
@danmcd

danmcd Sep 24, 2018

What optimization levels are you using with gcc & clang?

danmcd commented Sep 24, 2018

What optimization levels are you using with gcc & clang?

@bcantrill

This comment has been minimized.

Show comment
Hide comment
@bcantrill

bcantrill Sep 24, 2018

Member

@danmcd: All with their "optimal" optimization levels (i.e., -O2 or logical equivalent). I experimented with higher levels of optimization but saw no difference in execution time.

Member

bcantrill commented Sep 24, 2018

@danmcd: All with their "optimal" optimization levels (i.e., -O2 or logical equivalent). I experimented with higher levels of optimization but saw no difference in execution time.

@jclulow

This comment has been minimized.

Show comment
Hide comment
@jclulow

jclulow Sep 25, 2018

Member

Was the C version a 64-bit binary or 32-bit binary?

Member

jclulow commented Sep 25, 2018

Was the C version a 64-bit binary or 32-bit binary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment