py/map.c: Add a cache of bytecode location to map position. #7680

jimmo · 2021-08-18T05:06:44Z

Background: On Unix (which only ever executes bytecode from RAM) we have a feature where the bytecode reserves an extra byte after load attr/global opcodes, which at runtime stores a map index of the likely location of this field. This isn't enabled on any other ports because they execute bytecode from ROM (as well as RAM), and we can't have a different bytecode format for both.

See also #7653 (comment) (reduce the number of options for mpy-cross).

This is a different approach that adds a global cache (keyed by bytecode instruction pointer) to a map offset, then all mp_map_lookup operations use it. It's much less precise than bytecode caching, but allows the cache to be external to the bytecode.

TODO / discuss:

I'm sure there's a bunch of tuning that can be done, and collecting some stats would be useful. (e.g. do we see a lot of collisions due to power-of-two cache size instead of prime number).
This doesn't work for native code. Need a way to set MP_STATE_VM(lookup_hint) in the emitted code.
Thread safety? (Might actually be fine)

On PYBV11 it's +156 bytes (+128 bytes RAM). Performance tests (in score mode) for PYBV11 and PYBD_SF6 below:

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-test             (PYBV11)
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-test         diff      diff% (error%)
bm_chaos.py                  313.53 ->     335.39 :     +21.86 =  +6.972% (+/-0.01%)
bm_fannkuch.py                78.17 ->      77.31 :      -0.86 =  -1.100% (+/-0.01%)
bm_fft.py                   2551.16 ->    2536.08 :     -15.08 =  -0.591% (+/-0.00%)
bm_float.py                 4930.98 ->    5258.34 :    +327.36 =  +6.639% (+/-0.02%)
bm_hexiom.py                  35.62 ->      42.22 :      +6.60 = +18.529% (+/-0.01%)
bm_nqueens.py               4219.69 ->    4416.12 :    +196.43 =  +4.655% (+/-0.01%)
bm_pidigits.py               647.28 ->     650.16 :      +2.88 =  +0.445% (+/-0.13%)
misc_aes.py                  384.48 ->     409.02 :     +24.54 =  +6.383% (+/-0.02%)
misc_mandel.py              3121.44 ->    3532.50 :    +411.06 = +13.169% (+/-0.00%)
misc_pystone.py             1954.67 ->    2154.75 :    +200.08 = +10.236% (+/-0.01%)
misc_raytrace.py             314.00 ->     338.71 :     +24.71 =  +7.869% (+/-0.01%)

$ ./run-perfbench.py -s ~/cache-baseline-sf6 ~/cache-test-sf6        (PYBD_SF6)
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline-sf6 -> /home/jimmo/cache-test-sf6         diff      diff% (error%)
bm_chaos.py                  382.55 ->     398.78 :     +16.23 =  +4.243% (+/-0.03%)
bm_fannkuch.py               101.40 ->      97.95 :      -3.45 =  -3.402% (+/-0.04%)
bm_fft.py                   1815.18 ->    1760.44 :     -54.74 =  -3.016% (+/-0.02%)
bm_float.py                 5939.99 ->    6470.11 :    +530.12 =  +8.925% (+/-0.05%)
bm_hexiom.py                  49.78 ->      53.66 :      +3.88 =  +7.794% (+/-0.02%)
bm_nqueens.py               5706.74 ->    5732.61 :     +25.87 =  +0.453% (+/-0.02%)
bm_pidigits.py               959.77 ->     958.86 :      -0.91 =  -0.095% (+/-0.03%)
misc_aes.py                  530.32 ->     566.37 :     +36.05 =  +6.798% (+/-0.06%)
misc_mandel.py              4037.38 ->    4197.41 :    +160.03 =  +3.964% (+/-0.17%)
misc_pystone.py             2485.09 ->    2646.59 :    +161.50 =  +6.499% (+/-0.05%)
misc_raytrace.py             401.04 ->     431.79 :     +30.75 =  +7.668% (+/-0.04%)

jimmo · 2021-08-18T13:47:22Z

For comparison, here's how the existing MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE goes on PYBV11 and PYBD_SF6

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-bytecode-pybv11 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-bytecode-pybv11         diff      diff% (error%)
bm_chaos.py                  313.53 ->     358.90 :     +45.37 = +14.471% (+/-0.01%)
bm_fannkuch.py                78.17 ->      78.48 :      +0.31 =  +0.397% (+/-0.01%)
bm_fft.py                   2551.16 ->    2576.19 :     +25.03 =  +0.981% (+/-0.00%)
bm_float.py                 4930.98 ->    6296.83 :   +1365.85 = +27.699% (+/-0.02%)
bm_hexiom.py                  35.62 ->      34.45 :      -1.17 =  -3.285% (+/-0.00%)
bm_nqueens.py               4219.69 ->    4069.66 :    -150.03 =  -3.555% (+/-0.00%)
bm_pidigits.py               647.28 ->     652.31 :      +5.03 =  +0.777% (+/-0.21%)
misc_aes.py                  384.48 ->     453.34 :     +68.86 = +17.910% (+/-0.01%)
misc_mandel.py              3121.44 ->    2960.68 :    -160.76 =  -5.150% (+/-0.01%)
misc_pystone.py             1954.67 ->    2513.17 :    +558.50 = +28.573% (+/-0.00%)
misc_raytrace.py             314.00 ->     372.69 :     +58.69 = +18.691% (+/-0.00%)

$ ./run-perfbench.py -s ~/cache-baseline-sf6 ~/cache-bytecode-sf6 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline-sf6 -> /home/jimmo/cache-bytecode-sf6         diff      diff% (error%)
bm_chaos.py                  382.55 ->     453.09 :     +70.54 = +18.439% (+/-0.03%)
bm_fannkuch.py               101.40 ->     101.85 :      +0.45 =  +0.444% (+/-0.04%)
bm_fft.py                   1815.18 ->    1800.96 :     -14.22 =  -0.783% (+/-0.02%)
bm_float.py                 5939.99 ->    8103.30 :   +2163.31 = +36.419% (+/-0.05%)
bm_hexiom.py                  49.78 ->      49.53 :      -0.25 =  -0.502% (+/-0.02%)
bm_nqueens.py               5706.74 ->    5439.30 :    -267.44 =  -4.686% (+/-0.02%)
bm_pidigits.py               959.77 ->     969.55 :      +9.78 =  +1.019% (+/-0.06%)
misc_aes.py                  530.32 ->     627.16 :     +96.84 = +18.261% (+/-0.06%)
misc_mandel.py              4037.38 ->    3722.01 :    -315.37 =  -7.811% (+/-0.17%)
misc_pystone.py             2485.09 ->    3317.66 :    +832.57 = +33.503% (+/-0.05%)
misc_raytrace.py             401.04 ->     507.98 :    +106.94 = +26.666% (+/-0.03%)

jimmo · 2021-08-18T14:02:15Z

I've updated the branch. It's now +176 bytes, but marginally faster.

bm_float and bm_fannkuch see no difference because they do zero map lookups (!!!) during the main loop, so you can't recuperate the cost of updating lookup_hint. Possibly consider only updating lookup_hint for certain ops. misc_pystone sees a lot of cache collisions (and this is why bytecode caching does so well).

By optimising for code size you can change the table to 128 entries (instead of 127), and save about 28 bytes, and put MAP_CACHE_SET into a function and save about 32 bytes. This gives you about 80% of the perf gain.

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-test-pybv11-127-inline              (PYBV11)
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-test-pybv11-127-inline         diff      diff% (error%)
bm_chaos.py                  313.53 ->     342.64 :     +29.11 =  +9.285% (+/-0.02%)
bm_fannkuch.py                78.17 ->      78.01 :      -0.16 =  -0.205% (+/-0.01%)
bm_fft.py                   2551.16 ->    2548.89 :      -2.27 =  -0.089% (+/-0.00%)
bm_float.py                 4930.98 ->    5380.83 :    +449.85 =  +9.123% (+/-0.02%)
bm_hexiom.py                  35.62 ->      42.11 :      +6.49 = +18.220% (+/-0.00%)
bm_nqueens.py               4219.69 ->    4426.41 :    +206.72 =  +4.899% (+/-0.01%)
bm_pidigits.py               647.28 ->     650.82 :      +3.54 =  +0.547% (+/-0.12%)
misc_aes.py                  384.48 ->     415.54 :     +31.06 =  +8.078% (+/-0.02%)
misc_mandel.py              3121.44 ->    3544.05 :    +422.61 = +13.539% (+/-0.01%)
misc_pystone.py             1954.67 ->    2086.51 :    +131.84 =  +6.745% (+/-0.01%)
misc_raytrace.py             314.00 ->     344.56 :     +30.56 =  +9.732% (+/-0.01%)

$ ./run-perfbench.py -s ~/cache-baseline-sf6 ~/cache-test-sf6-127-inline             (SF6)
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline-sf6 -> /home/jimmo/cache-test-sf6-127-inline         diff      diff% (error%)
bm_chaos.py                  382.55 ->     398.09 :     +15.54 =  +4.062% (+/-0.03%)
bm_fannkuch.py               101.40 ->      98.12 :      -3.28 =  -3.235% (+/-0.04%)
bm_fft.py                   1815.18 ->    1764.46 :     -50.72 =  -2.794% (+/-0.01%)
bm_float.py                 5939.99 ->    6428.43 :    +488.44 =  +8.223% (+/-0.07%)
bm_hexiom.py                  49.78 ->      53.75 :      +3.97 =  +7.975% (+/-0.03%)
bm_nqueens.py               5706.74 ->    5762.30 :     +55.56 =  +0.974% (+/-0.02%)
bm_pidigits.py               959.77 ->     961.42 :      +1.65 =  +0.172% (+/-0.02%)
misc_aes.py                  530.32 ->     565.72 :     +35.40 =  +6.675% (+/-0.07%)
misc_mandel.py              4037.38 ->    4198.66 :    +161.28 =  +3.995% (+/-0.17%)
misc_pystone.py             2485.09 ->    2545.54 :     +60.45 =  +2.433% (+/-0.05%)
misc_raytrace.py             401.04 ->     433.08 :     +32.04 =  +7.989% (+/-0.03%)

jimmo · 2021-08-18T14:23:01Z

Possibly consider only updating lookup_hint for certain ops.

This definitely helps. Now +124 bytes on PYBV11.

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-vmsel-pybv11-ro
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-vmsel-pybv11-ro         diff      diff% (error%)
bm_chaos.py                  313.53 ->     339.58 :     +26.05 =  +8.309% (+/-0.01%)
bm_fannkuch.py                78.17 ->      78.18 :      +0.01 =  +0.013% (+/-0.01%)
bm_fft.py                   2551.16 ->    2561.48 :     +10.32 =  +0.405% (+/-0.00%)
bm_float.py                 4930.98 ->    5205.65 :    +274.67 =  +5.570% (+/-0.02%)
bm_hexiom.py                  35.62 ->      41.94 :      +6.32 = +17.743% (+/-0.01%)
bm_nqueens.py               4219.69 ->    4432.77 :    +213.08 =  +5.050% (+/-0.01%)
bm_pidigits.py               647.28 ->     652.33 :      +5.05 =  +0.780% (+/-0.18%)
misc_aes.py                  384.48 ->     417.87 :     +33.39 =  +8.684% (+/-0.02%)
misc_mandel.py              3121.44 ->    3560.35 :    +438.91 = +14.061% (+/-0.01%)
misc_pystone.py             1954.67 ->    2118.40 :    +163.73 =  +8.376% (+/-0.01%)
misc_raytrace.py             314.00 ->     337.16 :     +23.16 =  +7.376% (+/-0.01%)

jimmo · 2021-08-18T14:37:53Z

Just as a random side quest, something I've been meaning to investigate for a while is the computed goto. On PYBV11, disabling it saves +1276 bytes.

The effect on performance varies a bit:

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-nogoto-pybv11 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-nogoto-pybv11         diff      diff% (error%)
bm_chaos.py                  313.53 ->     310.23 :      -3.30 =  -1.053% (+/-0.01%)
bm_fannkuch.py                78.17 ->      77.89 :      -0.28 =  -0.358% (+/-0.01%)
bm_fft.py                   2551.16 ->    2501.66 :     -49.50 =  -1.940% (+/-0.00%)
bm_float.py                 4930.98 ->    4906.96 :     -24.02 =  -0.487% (+/-0.01%)
bm_hexiom.py                  35.62 ->      35.63 :      +0.01 =  +0.028% (+/-0.00%)
bm_nqueens.py               4219.69 ->    4241.08 :     +21.39 =  +0.507% (+/-0.00%)
bm_pidigits.py               647.28 ->     643.02 :      -4.26 =  -0.658% (+/-0.19%)
misc_aes.py                  384.48 ->     383.75 :      -0.73 =  -0.190% (+/-0.01%)
misc_mandel.py              3121.44 ->    3047.03 :     -74.41 =  -2.384% (+/-0.01%)
misc_pystone.py             1954.67 ->    1952.34 :      -2.33 =  -0.119% (+/-0.01%)
misc_raytrace.py             314.00 ->     310.67 :      -3.33 =  -1.061% (+/-0.00%)

Combined with this PR (i.e. map caching without computed goto) is a net -1156 bytes for an overall performance benefit (except for fft).

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-vmsel-pybv11-ro-nogoto 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-vmsel-pybv11-ro-nogoto         diff      diff% (error%)
bm_chaos.py                  313.53 ->     337.03 :     +23.50 =  +7.495% (+/-0.01%)
bm_fannkuch.py                78.17 ->      78.29 :      +0.12 =  +0.154% (+/-0.01%)
bm_fft.py                   2551.16 ->    2513.10 :     -38.06 =  -1.492% (+/-0.00%)
bm_float.py                 4930.98 ->    5154.90 :    +223.92 =  +4.541% (+/-0.02%)
bm_hexiom.py                  35.62 ->      41.99 :      +6.37 = +17.883% (+/-0.01%)
bm_nqueens.py               4219.69 ->    4468.64 :    +248.95 =  +5.900% (+/-0.01%)
bm_pidigits.py               647.28 ->     650.73 :      +3.45 =  +0.533% (+/-0.10%)
misc_aes.py                  384.48 ->     413.51 :     +29.03 =  +7.550% (+/-0.02%)
misc_mandel.py              3121.44 ->    3496.68 :    +375.24 = +12.021% (+/-0.00%)
misc_pystone.py             1954.67 ->    2106.96 :    +152.29 =  +7.791% (+/-0.01%)
misc_raytrace.py             314.00 ->     335.69 :     +21.69 =  +6.908% (+/-0.01%)

jimmo · 2021-08-19T07:36:08Z

Based on suggestion from @dpgeorge I have changed this to use (map-pointer, key) as the cache index, and this is a signficant win for code size (now only +48 bytes) and performance.

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-mapindex-pybv11-128-shift
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-mapindex-pybv11-128-shift         diff      diff% (error%)
bm_chaos.py                  313.53 ->     350.94 :     +37.41 = +11.932% (+/-0.01%)
bm_fannkuch.py                78.17 ->      78.49 :      +0.32 =  +0.409% (+/-0.01%)
bm_fft.py                   2551.16 ->    2559.18 :      +8.02 =  +0.314% (+/-0.00%)
bm_float.py                 4930.98 ->    5507.68 :    +576.70 = +11.695% (+/-0.02%)
bm_hexiom.py                  35.62 ->      42.98 :      +7.36 = +20.663% (+/-0.00%)
bm_nqueens.py               4219.69 ->    4434.57 :    +214.88 =  +5.092% (+/-0.00%)
bm_pidigits.py               647.28 ->     651.15 :      +3.87 =  +0.598% (+/-0.13%)
misc_aes.py                  384.48 ->     426.28 :     +41.80 = +10.872% (+/-0.01%)
misc_mandel.py              3121.44 ->    3614.86 :    +493.42 = +15.807% (+/-0.00%)
misc_pystone.py             1954.67 ->    2288.38 :    +333.71 = +17.072% (+/-0.01%)
misc_raytrace.py             314.00 ->     358.68 :     +44.68 = +14.229% (+/-0.01%)

It also works for the native emitter and adds a similar perf boost on top of the gain already provided by the native emitter.

$ ./run-perfbench.py -s ~/cache-baseline-native-pybv11 ~/cache-mapindex-native-pybv11-128-shift 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline-native-pybv11 -> /home/jimmo/cache-mapindex-native-pybv11-128-shift         diff      diff% (error%)
bm_asyncio.py                107.39 ->     109.17 :      +1.78 =  +1.658% (+/-0.01%)
bm_chaos.py                  378.90 ->     420.88 :     +41.98 = +11.079% (+/-0.01%)
bm_fannkuch.py               103.20 ->     102.18 :      -1.02 =  -0.988% (+/-0.01%)
bm_fft.py                   3589.92 ->    3612.56 :     +22.64 =  +0.631% (+/-0.02%)
bm_float.py                 5895.84 ->    6564.72 :    +668.88 = +11.345% (+/-0.00%)
bm_hexiom.py                  41.64 ->      51.66 :     +10.02 = +24.063% (+/-0.00%)
bm_nqueens.py               5043.33 ->    5354.31 :    +310.98 =  +6.166% (+/-0.01%)
bm_pidigits.py               708.65 ->     704.44 :      -4.21 =  -0.594% (+/-0.30%)
misc_mandel.py              3770.39 ->    4535.38 :    +764.99 = +20.289% (+/-0.01%)
misc_pystone.py             2430.32 ->    2913.25 :    +482.93 = +19.871% (+/-0.01%)
misc_raytrace.py             364.62 ->     416.84 :     +52.22 = +14.322% (+/-0.02%)

dpgeorge · 2021-09-05T07:06:00Z

py/map.c

@@ -126,6 +126,15 @@ STATIC void mp_map_rehash(mp_map_t *map) {
    m_del(mp_map_elem_t, old_table, old_alloc);
 }

+#if MICROPY_OPT_MAP_CACHING
+uint8_t map_lookup_cache[128];
+#define MAP_CACHE_OFFSET ((((uintptr_t) + (uintptr_t)index) >> 2) % sizeof(map_lookup_cache))


I think this first uintptr_t is missing something that it casts, maybe map or map->table?

Originally I had this using (map+index)>>2, but changed it to just use index>>2 because it's overall faster. Accidentally left behind the no-op cast. I've updated the branch, and addressed the other comment about the macro args.

Here's the diff FWIW of adding back map to the expression. Makes some things marginally better, but big hit on pystone and raytrace.

$ ./run-perfbench.py -s ~/2.cache-index-pybv11 ~/2.cache-mapindex-pybv11 diff of scores (higher is better) N=100 M=100 /home/jimmo/2.cache-index-pybv11 -> /home/jimmo/2.cache-mapindex-pybv11 diff diff% (error%) bm_asyncio.py 113.73 -> 113.43 : -0.30 = -0.264% (+/-0.01%) bm_chaos.py 348.59 -> 340.54 : -8.05 = -2.309% (+/-0.01%) bm_fannkuch.py 78.53 -> 78.20 : -0.33 = -0.420% (+/-0.00%) bm_fft.py 2563.02 -> 2597.15 : +34.13 = +1.332% (+/-0.00%) bm_float.py 5544.62 -> 5440.36 : -104.26 = -1.880% (+/-0.02%) bm_hexiom.py 42.73 -> 42.74 : +0.01 = +0.023% (+/-0.00%) bm_nqueens.py 4429.89 -> 4454.66 : +24.77 = +0.559% (+/-0.00%) bm_pidigits.py 651.39 -> 649.83 : -1.56 = -0.239% (+/-0.34%) misc_aes.py 423.37 -> 428.64 : +5.27 = +1.245% (+/-0.01%) misc_mandel.py 3610.82 -> 3624.53 : +13.71 = +0.380% (+/-0.01%) misc_pystone.py 2287.95 -> 2185.46 : -102.49 = -4.480% (+/-0.01%) misc_raytrace.py 362.17 -> 341.62 : -20.55 = -5.674% (+/-0.01%)

And while we're at it, here's the latest baseline->PR after rebasing (similar to the first table in #7680 (comment))

$ ./run-perfbench.py -s ~/2.cache-baseline-pybv11 ~/2.cache-index-pybv11 diff of scores (higher is better) N=100 M=100 /home/jimmo/2.cache-baseline-pybv11 -> /home/jimmo/2.cache-index-pybv11 diff diff% (error%) bm_asyncio.py 111.34 -> 113.73 : +2.39 = +2.147% (+/-0.01%) bm_chaos.py 311.32 -> 348.59 : +37.27 = +11.972% (+/-0.02%) bm_fannkuch.py 77.82 -> 78.53 : +0.71 = +0.912% (+/-0.00%) bm_fft.py 2539.05 -> 2563.02 : +23.97 = +0.944% (+/-0.00%) bm_float.py 4955.56 -> 5544.62 : +589.06 = +11.887% (+/-0.00%) bm_hexiom.py 35.52 -> 42.73 : +7.21 = +20.298% (+/-0.00%) bm_nqueens.py 4201.40 -> 4429.89 : +228.49 = +5.438% (+/-0.00%) bm_pidigits.py 646.91 -> 651.39 : +4.48 = +0.693% (+/-0.17%) misc_aes.py 382.97 -> 423.37 : +40.40 = +10.549% (+/-0.01%) misc_mandel.py 3091.74 -> 3610.82 : +519.08 = +16.789% (+/-0.01%) misc_pystone.py 1948.00 -> 2287.95 : +339.95 = +17.451% (+/-0.01%) misc_raytrace.py 312.45 -> 362.17 : +49.72 = +15.913% (+/-0.01%)

py/map.c

This feature provided a significant performance boost for Unix, but wasn't able to be enabled for MCU targets, and added significant extra complexity to generating .mpy files. Some of the performance gain can be achieved now with MICROPY_OPT_MAP_LOOKUP_CACHE instead. See micropython#7680 for discussion. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

jimmo · 2021-09-06T03:40:47Z

I've added a commit to remove bytecode caching, would happily make this a different PR though. The main reason is so that CI will run tests with map caching (which does not do anything when bytecode caching is enabled).

codecov-commenter · 2021-09-06T03:49:07Z

Codecov Report

Merging #7680 (6123bbf) into master (bbbdef4) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #7680      +/-   ##
==========================================
- Coverage   98.26%   98.24%   -0.02%     
==========================================
  Files         154      154              
  Lines       20076    20032      -44     
==========================================
- Hits        19728    19681      -47     
- Misses        348      351       +3

Impacted Files	Coverage Δ
py/bc.c	`89.13% <ø> (+0.47%)`	⬆️
py/emitbc.c	`100.00% <ø> (ø)`
py/showbc.c	`98.51% <ø> (-0.02%)`	⬇️
py/vm.c	`98.97% <ø> (-0.07%)`	⬇️
py/map.c	`99.49% <100.00%> (-0.51%)`	⬇️
py/smallint.c	`92.00% <0.00%> (-4.00%)`	⬇️
py/objlist.c	`99.23% <0.00%> (-0.39%)`	⬇️
py/runtime.c	`99.39% <0.00%> (-0.16%)`	⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbbdef4...6123bbf. Read the comment docs.

jimmo · 2021-09-11T13:45:17Z

I set up a "quiet" Linux env to try and get some useful comparisons for Linux x64 (i3-5010U @ 2.10GHz). Normally the tests have +/- 30% uncertainty on my regular env, here they are more like +/- 3%.

baseline -> bytecode caching (enabled by default):

$ ./run-perfbench.py -s ~/unix-baseline ~/unix-bccache 
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-baseline -> /home/jimmo/unix-bccache         diff      diff% (error%)
bm_chaos.py                12049.86 ->   13742.56 :   +1692.70 = +14.047% (+/-3.11%)
bm_fannkuch.py                61.59 ->      60.13 :      -1.46 =  -2.371% (+/-1.58%)
bm_fft.py                 112941.24 ->  113083.20 :    +141.96 =  +0.126% (+/-2.08%)
bm_float.py               202443.35 ->  256552.80 :  +54109.45 = +26.728% (+/-2.22%)
bm_hexiom.py                 531.50 ->     521.93 :      -9.57 =  -1.801% (+/-0.62%)
bm_nqueens.py             208519.45 ->  197544.25 :  -10975.20 =  -5.263% (+/-2.96%)
bm_pidigits.py              8090.77 ->    8072.98 :     -17.79 =  -0.220% (+/-3.27%)
misc_aes.py                13891.95 ->   17283.45 :   +3391.50 = +24.413% (+/-0.90%)
misc_mandel.py            120134.35 ->   99083.99 :  -21050.36 = -17.522% (+/-0.70%)
misc_pystone.py            67767.06 ->   83860.10 :  +16093.04 = +23.748% (+/-1.52%)
misc_raytrace.py           17480.07 ->   21490.40 :   +4010.33 = +22.942% (+/-1.78%)

baseline -> map caching

$ ./run-perfbench.py -s ~/unix-baseline ~/unix-mapcache
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-baseline -> /home/jimmo/unix-mapcache         diff      diff% (error%)
bm_chaos.py                12049.86 ->   13171.22 :   +1121.36 =  +9.306% (+/-3.06%)
bm_fannkuch.py                61.59 ->      61.31 :      -0.28 =  -0.455% (+/-2.35%)
bm_fft.py                 112941.24 ->  113928.80 :    +987.56 =  +0.874% (+/-3.26%)
bm_float.py               202443.35 ->  229413.54 :  +26970.19 = +13.322% (+/-1.59%)
bm_hexiom.py                 531.50 ->     600.29 :     +68.79 = +12.943% (+/-3.34%)
bm_nqueens.py             208519.45 ->  216516.42 :   +7996.97 =  +3.835% (+/-2.51%)
bm_pidigits.py              8090.77 ->    8127.10 :     +36.33 =  +0.449% (+/-2.77%)
misc_aes.py                13891.95 ->   16270.39 :   +2378.44 = +17.121% (+/-0.72%)
misc_mandel.py            120134.35 ->  130021.80 :   +9887.45 =  +8.230% (+/-1.99%)
misc_pystone.py            67767.06 ->   80591.13 :  +12824.07 = +18.924% (+/-1.25%)
misc_raytrace.py           17480.07 ->   20544.16 :   +3064.09 = +17.529% (+/-1.79%)

map caching --> attr fast path (#7688)

$ ./run-perfbench.py -s ~/unix-mapcache ~/unix-attrmapcache
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-mapcache -> /home/jimmo/unix-attrmapcache         diff      diff% (error%)
bm_chaos.py                13171.22 ->   13905.67 :    +734.45 =  +5.576% (+/-3.87%)
bm_fannkuch.py                61.31 ->      61.34 :      +0.03 =  +0.049% (+/-2.71%)
bm_fft.py                 113928.80 ->  114793.68 :    +864.88 =  +0.759% (+/-2.94%)
bm_float.py               229413.54 ->  243908.29 :  +14494.75 =  +6.318% (+/-1.63%)
bm_hexiom.py                 600.29 ->     625.41 :     +25.12 =  +4.185% (+/-2.93%)
bm_nqueens.py             216516.42 ->  217713.12 :   +1196.70 =  +0.553% (+/-2.28%)
bm_pidigits.py              8127.10 ->    8198.75 :     +71.65 =  +0.882% (+/-2.69%)
misc_aes.py                16270.39 ->   16480.52 :    +210.13 =  +1.291% (+/-0.74%)
misc_mandel.py            130021.80 ->  128939.84 :   -1081.96 =  -0.832% (+/-4.80%)
misc_pystone.py            80591.13 ->   82592.56 :   +2001.43 =  +2.483% (+/-2.24%)
misc_raytrace.py           20544.16 ->   22227.23 :   +1683.07 =  +8.192% (+/-1.97%)

baseline -> map caching & attr fast path

$ ./run-perfbench.py -s ~/unix-baseline ~/unix-attrmapcache
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-baseline -> /home/jimmo/unix-attrmapcache         diff      diff% (error%)
bm_chaos.py                12049.86 ->   13905.67 :   +1855.81 = +15.401% (+/-3.66%)
bm_fannkuch.py                61.59 ->      61.34 :      -0.25 =  -0.406% (+/-2.04%)
bm_fft.py                 112941.24 ->  114793.68 :   +1852.44 =  +1.640% (+/-1.98%)
bm_float.py               202443.35 ->  243908.29 :  +41464.94 = +20.482% (+/-1.14%)
bm_hexiom.py                 531.50 ->     625.41 :     +93.91 = +17.669% (+/-0.61%)
bm_nqueens.py             208519.45 ->  217713.12 :   +9193.67 =  +4.409% (+/-2.13%)
bm_pidigits.py              8090.77 ->    8198.75 :    +107.98 =  +1.335% (+/-2.70%)
misc_aes.py                13891.95 ->   16480.52 :   +2588.57 = +18.634% (+/-0.91%)
misc_mandel.py            120134.35 ->  128939.84 :   +8805.49 =  +7.330% (+/-4.84%)
misc_pystone.py            67767.06 ->   82592.56 :  +14825.50 = +21.877% (+/-2.60%)
misc_raytrace.py           17480.07 ->   22227.23 :   +4747.16 = +27.158% (+/-1.96%)

bytecode caching -> map caching (this is probably the most important one because it's the diff created by this PR)

$ ./run-perfbench.py -s ~/unix-bccache ~/unix-mapcache
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-bccache -> /home/jimmo/unix-mapcache         diff      diff% (error%)
bm_chaos.py                13742.56 ->   13171.22 :    -571.34 =  -4.157% (+/-3.31%)
bm_fannkuch.py                60.13 ->      61.31 :      +1.18 =  +1.962% (+/-2.43%)
bm_fft.py                 113083.20 ->  113928.80 :    +845.60 =  +0.748% (+/-3.02%)
bm_float.py               256552.80 ->  229413.54 :  -27139.26 = -10.578% (+/-2.09%)
bm_hexiom.py                 521.93 ->     600.29 :     +78.36 = +15.014% (+/-3.37%)
bm_nqueens.py             197544.25 ->  216516.42 :  +18972.17 =  +9.604% (+/-3.32%)
bm_pidigits.py              8072.98 ->    8127.10 :     +54.12 =  +0.670% (+/-3.27%)
misc_aes.py                17283.45 ->   16270.39 :   -1013.06 =  -5.861% (+/-0.69%)
misc_mandel.py             99083.99 ->  130021.80 :  +30937.81 = +31.224% (+/-2.46%)
misc_pystone.py            83860.10 ->   80591.13 :   -3268.97 =  -3.898% (+/-1.32%)
misc_raytrace.py           21490.40 ->   20544.16 :    -946.24 =  -4.403% (+/-1.77%)

bytecode caching -> map caching & attr fast path (diff of this PR plus #7688) (@stinos)

$ ./run-perfbench.py -s ~/unix-bccache ~/unix-attrmapcache
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-bccache -> /home/jimmo/unix-attrmapcache         diff      diff% (error%)
bm_chaos.py                13742.56 ->   13905.67 :    +163.11 =  +1.187% (+/-3.75%)
bm_fannkuch.py                60.13 ->      61.34 :      +1.21 =  +2.012% (+/-2.11%)
bm_fft.py                 113083.20 ->  114793.68 :   +1710.48 =  +1.513% (+/-1.57%)
bm_float.py               256552.80 ->  243908.29 :  -12644.51 =  -4.929% (+/-1.90%)
bm_hexiom.py                 521.93 ->     625.41 :    +103.48 = +19.826% (+/-0.40%)
bm_nqueens.py             197544.25 ->  217713.12 :  +20168.87 = +10.210% (+/-3.01%)
bm_pidigits.py              8072.98 ->    8198.75 :    +125.77 =  +1.558% (+/-3.22%)
misc_aes.py                17283.45 ->   16480.52 :    -802.93 =  -4.646% (+/-0.82%)
misc_mandel.py             99083.99 ->  128939.84 :  +29855.85 = +30.132% (+/-5.88%)
misc_pystone.py            83860.10 ->   82592.56 :   -1267.54 =  -1.511% (+/-2.27%)
misc_raytrace.py           21490.40 ->   22227.23 :    +736.83 =  +3.429% (+/-1.88%)

baseline -> native

$ ./run-perfbench.py -s ~/unix-baseline ~/unix-baseline-native
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-baseline -> /home/jimmo/unix-baseline-native         diff      diff% (error%)
bm_chaos.py                12049.86 ->   14130.62 :   +2080.76 = +17.268% (+/-4.90%)
bm_fannkuch.py                61.59 ->      74.96 :     +13.37 = +21.708% (+/-2.02%)
bm_fft.py                 112941.24 ->  166682.99 :  +53741.75 = +47.584% (+/-3.90%)
bm_float.py               202443.35 ->  233415.23 :  +30971.88 = +15.299% (+/-2.82%)
bm_hexiom.py                 531.50 ->     628.59 :     +97.09 = +18.267% (+/-1.59%)
bm_nqueens.py             208519.45 ->  225418.44 :  +16898.99 =  +8.104% (+/-2.99%)
bm_pidigits.py              8090.77 ->    6322.00 :   -1768.77 = -21.862% (+/-3.74%)
misc_aes.py                13891.95 ->   20670.10 :   +6778.15 = +48.792% (+/-0.91%)
misc_mandel.py            120134.35 ->  138221.11 :  +18086.76 = +15.055% (+/-2.32%)
misc_pystone.py            67767.06 ->   85032.14 :  +17265.08 = +25.477% (+/-2.43%)
misc_raytrace.py           17480.07 ->   19800.01 :   +2319.94 = +13.272% (+/-2.50%)

native -> native & map caching & attr fast path

$ ./run-perfbench.py -s ~/unix-baseline-native ~/unix-attrmapcache-native
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-baseline-native -> /home/jimmo/unix-attrmapcache-native         diff      diff% (error%)
bm_chaos.py                14130.62 ->   15464.68 :   +1334.06 =  +9.441% (+/-7.11%)
bm_fannkuch.py                74.96 ->      76.16 :      +1.20 =  +1.601% (+/-1.80%)
bm_fft.py                 166682.99 ->  168221.86 :   +1538.87 =  +0.923% (+/-4.20%)
bm_float.py               233415.23 ->  265524.90 :  +32109.67 = +13.756% (+/-2.57%)
bm_hexiom.py                 628.59 ->     734.17 :    +105.58 = +16.796% (+/-1.39%)
bm_nqueens.py             225418.44 ->  232926.45 :   +7508.01 =  +3.331% (+/-3.10%)
bm_pidigits.py              6322.00 ->    6379.52 :     +57.52 =  +0.910% (+/-5.62%)
misc_aes.py                20670.10 ->   27223.18 :   +6553.08 = +31.703% (+/-1.56%)
misc_mandel.py            138221.11 ->  152014.01 :  +13792.90 =  +9.979% (+/-2.46%)
misc_pystone.py            85032.14 ->  105681.44 :  +20649.30 = +24.284% (+/-2.25%)
misc_raytrace.py           19800.01 ->   23350.73 :   +3550.72 = +17.933% (+/-2.79%)

baseline -> native & map caching & attr fast path (combination of previous two)

$ ./run-perfbench.py -s ~/unix-baseline ~/unix-attrmapcache-native
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-baseline -> /home/jimmo/unix-attrmapcache-native         diff      diff% (error%)
bm_chaos.py                12049.86 ->   15464.68 :   +3414.82 = +28.339% (+/-7.09%)
bm_fannkuch.py                61.59 ->      76.16 :     +14.57 = +23.656% (+/-1.77%)
bm_fft.py                 112941.24 ->  168221.86 :  +55280.62 = +48.946% (+/-5.38%)
bm_float.py               202443.35 ->  265524.90 :  +63081.55 = +31.160% (+/-1.13%)
bm_hexiom.py                 531.50 ->     734.17 :    +202.67 = +38.132% (+/-0.89%)
bm_nqueens.py             208519.45 ->  232926.45 :  +24407.00 = +11.705% (+/-2.73%)
bm_pidigits.py              8090.77 ->    6379.52 :   -1711.25 = -21.151% (+/-3.60%)
misc_aes.py                13891.95 ->   27223.18 :  +13331.23 = +95.964% (+/-2.27%)
misc_mandel.py            120134.35 ->  152014.01 :  +31879.66 = +26.537% (+/-1.71%)
misc_pystone.py            67767.06 ->  105681.44 :  +37914.38 = +55.948% (+/-1.81%)
misc_raytrace.py           17480.07 ->   23350.73 :   +5870.66 = +33.585% (+/-2.32%)

dpgeorge · 2021-09-12T06:00:11Z

I did some benchmarking on a Raspberry Pi 3, running Linux in 64-bit mode. This is relatively quite for running tests, max 2% error. The main two results are:

Baseline (no caching optimisations) -> inline bytecode caching (current master setting for unix port):

diff of scores (higher is better)
N=1000 M=1000        perf-rpi3-base -> perf-rpi3-bccache   diff     diff% (error%)
bm_chaos.py                 3523.25 ->    4041.71 :    +518.46 = +14.715% (+/-1.41%)
bm_fannkuch.py                15.65 ->      15.86 :      +0.21 =  +1.342% (+/-0.93%)
bm_fft.py                  27219.17 ->   28516.10 :   +1296.93 =  +4.765% (+/-1.63%)
bm_float.py                53617.62 ->   66355.25 :  +12737.63 = +23.756% (+/-1.80%)
bm_hexiom.py                 149.13 ->     156.60 :      +7.47 =  +5.009% (+/-1.43%)
bm_nqueens.py              56551.95 ->   57100.81 :    +548.86 =  +0.971% (+/-0.66%)
bm_pidigits.py              2527.64 ->    2653.38 :    +125.74 =  +4.975% (+/-0.49%)
misc_aes.py                 3944.87 ->    4612.45 :    +667.58 = +16.923% (+/-0.44%)
misc_mandel.py             34184.39 ->   31959.46 :   -2224.93 =  -6.509% (+/-1.03%)
misc_pystone.py            21559.29 ->   26260.99 :   +4701.70 = +21.808% (+/-0.76%)
misc_raytrace.py            5472.38 ->    6515.27 :   +1042.89 = +19.057% (+/-1.82%)

Baseline -> map-caching (this PR) + load-attr opt (#7688):

diff of scores (higher is better)
N=1000 M=1000        perf-rpi3-base -> perf-rpi3-mapattr   diff     diff% (error%)
bm_chaos.py                 3523.25 ->    4254.51 :    +731.26 = +20.755% (+/-1.32%)
bm_fannkuch.py                15.65 ->      16.55 :      +0.90 =  +5.751% (+/-1.08%)
bm_fft.py                  27219.17 ->   28609.96 :   +1390.79 =  +5.110% (+/-1.58%)
bm_float.py                53617.62 ->   63446.47 :   +9828.85 = +18.331% (+/-1.66%)
bm_hexiom.py                 149.13 ->     188.85 :     +39.72 = +26.634% (+/-1.66%)
bm_nqueens.py              56551.95 ->   60637.03 :   +4085.08 =  +7.224% (+/-0.60%)
bm_pidigits.py              2527.64 ->    2656.99 :    +129.35 =  +5.117% (+/-0.45%)
misc_aes.py                 3944.87 ->    4512.75 :    +567.88 = +14.395% (+/-0.47%)
misc_mandel.py             34184.39 ->   39288.92 :   +5104.53 = +14.932% (+/-0.92%)
misc_pystone.py            21559.29 ->   25539.84 :   +3980.55 = +18.463% (+/-0.70%)
misc_raytrace.py            5472.38 ->    6839.04 :   +1366.66 = +24.974% (+/-0.88%)

Then, the diff of the above, ie the change in performance switching from inline bytecode caching to map caching plus load attribute optimisation:

diff of scores (higher is better)
N=1000 M=1000     perf-rpi3-bccache -> perf-rpi3-mapattr   diff     diff% (error%)
bm_chaos.py                 4041.71 ->    4254.51 :    +212.80 =  +5.265% (+/-1.05%)
bm_fannkuch.py                15.86 ->      16.55 :      +0.69 =  +4.351% (+/-0.81%)
bm_fft.py                  28516.10 ->   28609.96 :     +93.86 =  +0.329% (+/-0.37%)
bm_float.py                66355.25 ->   63446.47 :   -2908.78 =  -4.384% (+/-1.49%)
bm_hexiom.py                 156.60 ->     188.85 :     +32.25 = +20.594% (+/-0.92%)
bm_nqueens.py              57100.81 ->   60637.03 :   +3536.22 =  +6.193% (+/-0.54%)
bm_pidigits.py              2653.38 ->    2656.99 :      +3.61 =  +0.136% (+/-0.33%)
misc_aes.py                 4612.45 ->    4512.75 :     -99.70 =  -2.162% (+/-0.31%)
misc_mandel.py             31959.46 ->   39288.92 :   +7329.46 = +22.934% (+/-1.35%)
misc_pystone.py            26260.99 ->   25539.84 :    -721.15 =  -2.746% (+/-0.71%)
misc_raytrace.py            6515.27 ->    6839.04 :    +323.77 =  +4.969% (+/-1.35%)

dpgeorge · 2021-09-12T11:15:04Z

py/map.c

@@ -136,6 +156,17 @@ mp_map_elem_t *mp_map_lookup(mp_map_t *map, mp_obj_t index, mp_map_lookup_kind_t
    // If the map is a fixed array then we must only be called for a lookup
    assert(!map->is_fixed || lookup_kind == MP_MAP_LOOKUP);

+    #if MICROPY_OPT_MAP_LOOKUP_CACHE
+    if (lookup_kind == MP_MAP_LOOKUP && map->alloc) {


I think this can apply just as well to MP_MAP_LOOKUP_ADD_IF_NOT_FOUND.

Making it lookup_kind != MP_MAP_LOOKUP_REMOVE_IF_FOUND gives a nice little boost.

Yep, good thinking. On PYBV11:

$ ./run-perfbench.py -s ~/3.cache-map-pybv11 ~/3.cache-map-add-pybv11 diff of scores (higher is better) N=100 M=100 /home/jimmo/3.cache-map-pybv11 -> /home/jimmo/3.cache-map-add-pybv11 diff diff% (error%) bm_asyncio.py 113.73 -> 113.74 : +0.01 = +0.009% (+/-0.01%) bm_chaos.py 344.82 -> 343.50 : -1.32 = -0.383% (+/-0.00%) bm_fannkuch.py 78.62 -> 78.62 : +0.00 = +0.000% (+/-0.00%) bm_fft.py 2593.26 -> 2593.13 : -0.13 = -0.005% (+/-0.00%) bm_float.py 5460.45 -> 5667.22 : +206.77 = +3.787% (+/-0.01%) bm_hexiom.py 42.78 -> 42.58 : -0.20 = -0.468% (+/-0.00%) bm_nqueens.py 4467.91 -> 4464.29 : -3.62 = -0.081% (+/-0.00%) bm_pidigits.py 648.37 -> 649.28 : +0.91 = +0.140% (+/-0.41%) misc_aes.py 426.08 -> 425.96 : -0.12 = -0.028% (+/-0.01%) misc_mandel.py 3621.79 -> 3620.72 : -1.07 = -0.030% (+/-0.01%) misc_pystone.py 2289.22 -> 2368.03 : +78.81 = +3.443% (+/-0.01%) misc_raytrace.py 357.00 -> 354.92 : -2.08 = -0.583% (+/-0.01%)

dpgeorge · 2021-09-12T11:15:51Z

py/map.c

@@ -172,6 +203,7 @@ mp_map_elem_t *mp_map_lookup(mp_map_t *map, mp_obj_t index, mp_map_lookup_kind_t
                    elem->value = value;
                }
                #endif
+                MAP_CACHE_SET(index, elem - map->table);


Maybe this should be wrapped in else { ... }, because if it was a REMOVE_IF_FOUND then the cache should not be updated (with a now-removed element).

This doesn't seem to help... I guess the relatively uncommon removes are outweighed by the extra branch on the common path.

$ ./run-perfbench.py -s ~/3.cache-map-add-pybv11 ~/3.cache-map-add-else-pybv11 diff of scores (higher is better) N=100 M=100 /home/jimmo/3.cache-map-add-pybv11 -> /home/jimmo/3.cache-map-add-else-pybv11 diff diff% (error%) bm_asyncio.py 113.74 -> 113.72 : -0.02 = -0.018% (+/-0.01%) bm_chaos.py 343.50 -> 336.44 : -7.06 = -2.055% (+/-0.00%) bm_fannkuch.py 78.62 -> 78.26 : -0.36 = -0.458% (+/-0.01%) bm_fft.py 2593.13 -> 2552.65 : -40.48 = -1.561% (+/-0.00%) bm_float.py 5667.22 -> 5569.22 : -98.00 = -1.729% (+/-0.00%) bm_hexiom.py 42.58 -> 42.75 : +0.17 = +0.399% (+/-0.00%) bm_nqueens.py 4464.29 -> 4455.36 : -8.93 = -0.200% (+/-0.00%) bm_pidigits.py 649.28 -> 650.14 : +0.86 = +0.132% (+/-0.37%) misc_aes.py 425.96 -> 420.22 : -5.74 = -1.348% (+/-0.01%) misc_mandel.py 3620.72 -> 3562.43 : -58.29 = -1.610% (+/-0.00%) misc_pystone.py 2368.03 -> 2342.79 : -25.24 = -1.066% (+/-0.01%) misc_raytrace.py 354.92 -> 351.44 : -3.48 = -0.981% (+/-0.01%)

py/map.c

This feature originally provided a significant performance boost for Unix, but wasn't able to be enabled for MCU targets, and added significant extra complexity to generating .mpy files. The equivalent performance gain is now provided by MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE. See micropython#7680 for discussion. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

jimmo · 2021-09-13T01:41:08Z

@dpgeorge I have cherry-picked #7688 into this PR and re-arranged the order.

The final diff on pybv11 (compared to master):

$ ./run-perfbench.py -s ~/3.cache-baseline-pybv11 ~/3.cache-final-pybv11 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/3.cache-baseline-pybv11 -> /home/jimmo/3.cache-final-pybv11         diff      diff% (error%)
bm_asyncio.py                111.79 ->     113.30 :      +1.51 =  +1.351% (+/-0.01%)
bm_chaos.py                  320.05 ->     361.49 :     +41.44 = +12.948% (+/-0.02%)
bm_fannkuch.py                78.96 ->      78.19 :      -0.77 =  -0.975% (+/-0.01%)
bm_fft.py                   2545.01 ->    2571.63 :     +26.62 =  +1.046% (+/-0.00%)
bm_float.py                 5088.98 ->    5990.95 :    +901.97 = +17.724% (+/-0.01%)
bm_hexiom.py                  35.64 ->      43.92 :      +8.28 = +23.232% (+/-0.01%)
bm_nqueens.py               4194.71 ->    4439.69 :    +244.98 =  +5.840% (+/-0.00%)
bm_pidigits.py               649.25 ->     646.32 :      -2.93 =  -0.451% (+/-0.07%)
misc_aes.py                  384.51 ->     422.65 :     +38.14 =  +9.919% (+/-0.01%)
misc_mandel.py              3112.82 ->    3559.87 :    +447.05 = +14.362% (+/-0.01%)
misc_pystone.py             1968.33 ->    2400.13 :    +431.80 = +21.937% (+/-0.02%)
misc_raytrace.py             318.19 ->     381.46 :     +63.27 = +19.884% (+/-0.01%)

And on unix (also compared to master, so the baseline includes bytecode caching)

$ ./run-perfbench.py -s ~/2.cache-unix-baseline ~/2.cache-unix-final 
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/2.cache-unix-baseline -> /home/jimmo/2.cache-unix-final         diff      diff% (error%)
bm_chaos.py                13578.84 ->   13771.01 :    +192.17 =  +1.415% (+/-4.33%)
bm_fannkuch.py                60.17 ->      60.72 :      +0.55 =  +0.914% (+/-2.23%)
bm_fft.py                 113053.96 ->  110575.17 :   -2478.79 =  -2.193% (+/-3.82%)
bm_float.py               253351.63 ->  246882.33 :   -6469.30 =  -2.553% (+/-2.64%)
bm_hexiom.py                 522.28 ->     624.67 :    +102.39 = +19.604% (+/-0.97%)
bm_nqueens.py             199638.88 ->  215727.17 :  +16088.29 =  +8.059% (+/-2.99%)
bm_pidigits.py              8067.60 ->    8002.27 :     -65.33 =  -0.810% (+/-4.04%)
misc_aes.py                17193.13 ->   16372.50 :    -820.63 =  -4.773% (+/-1.82%)
misc_mandel.py             98764.31 ->  130358.21 :  +31593.90 = +31.989% (+/-2.50%)
misc_pystone.py            83903.00 ->   84380.51 :    +477.51 =  +0.569% (+/-1.54%)
misc_raytrace.py           21399.20 ->   21855.17 :    +455.97 =  +2.131% (+/-2.76%)

jimmo · 2021-09-13T02:45:29Z

Forgot to add -- this PR is +104 bytes on PYBV11.

dpgeorge · 2021-09-13T13:53:19Z

py/map.c

+// objects' locals dicts), and computation of the hash (and potentially some
+// linear probing) in the case of a regular map. Note the same cache is
+// shared across all maps.
+uint8_t map_lookup_cache[MICROPY_OPT_MAP_LOOKUP_CACHE_SIZE];


IMO this should go in py/mpstate.h, in mp_state_vm_t, outside the root-pointer section of that struct. Reasoning:

keeps all state localised to that file

if there were multiple VM instances in the one address space, then I'd argue that each VM should have its own table because they run totally independent code

It shouldn't impact code size or runtime (it should compile to the same machine instructions...).

BTW, would there be any issue with GIL-less multithreading (eg rp2) accessing this table? My guess is no.

Done -- branch updated.

Interestingly this did gain a small performance improvement. Possibly the new location makes it better for the data cache?

$ ./run-perfbench.py -s ~/lookup-global-pybv11 ~/lookup-state-pybv11 diff of scores (higher is better) N=100 M=100 /home/jimmo/lookup-global-pybv11 -> /home/jimmo/lookup-state-pybv11 diff diff% (error%) bm_chaos.py 361.52 -> 372.92 : +11.40 = +3.153% (+/-0.00%) bm_fannkuch.py 78.19 -> 78.97 : +0.78 = +0.998% (+/-0.01%) bm_fft.py 2571.66 -> 2569.91 : -1.75 = -0.068% (+/-0.00%) bm_float.py 5991.20 -> 6168.25 : +177.05 = +2.955% (+/-0.01%) bm_hexiom.py 43.92 -> 44.22 : +0.30 = +0.683% (+/-0.00%) bm_nqueens.py 4439.64 -> 4444.68 : +5.04 = +0.114% (+/-0.00%) bm_pidigits.py 647.20 -> 651.21 : +4.01 = +0.620% (+/-0.38%) misc_aes.py 422.65 -> 425.99 : +3.34 = +0.790% (+/-0.01%) misc_mandel.py 3559.81 -> 3598.03 : +38.22 = +1.074% (+/-0.01%) misc_pystone.py 2400.13 -> 2429.40 : +29.27 = +1.220% (+/-0.01%) misc_raytrace.py 381.46 -> 391.24 : +9.78 = +2.564% (+/-0.01%)

This is nice because the net effect of this PR is now +ve on all tests on PYBV11.

$ ./run-perfbench.py -s ~/3.cache-baseline-pybv11 ~/lookup-state-pybv11 diff of scores (higher is better) N=100 M=100 /home/jimmo/3.cache-baseline-pybv11 -> /home/jimmo/lookup-state-pybv11 diff diff% (error%) bm_chaos.py 320.05 -> 372.92 : +52.87 = +16.519% (+/-0.02%) bm_fannkuch.py 78.96 -> 78.97 : +0.01 = +0.013% (+/-0.01%) bm_fft.py 2545.01 -> 2569.91 : +24.90 = +0.978% (+/-0.00%) bm_float.py 5088.98 -> 6168.25 : +1079.27 = +21.208% (+/-0.01%) bm_hexiom.py 35.64 -> 44.22 : +8.58 = +24.074% (+/-0.00%) bm_nqueens.py 4194.71 -> 4444.68 : +249.97 = +5.959% (+/-0.00%) bm_pidigits.py 649.25 -> 651.21 : +1.96 = +0.302% (+/-0.31%) misc_aes.py 384.51 -> 425.99 : +41.48 = +10.788% (+/-0.01%) misc_mandel.py 3112.82 -> 3598.03 : +485.21 = +15.587% (+/-0.01%) misc_pystone.py 1968.33 -> 2429.40 : +461.07 = +23.424% (+/-0.01%) misc_raytrace.py 318.19 -> 391.24 : +73.05 = +22.958% (+/-0.01%)

And overall a pretty good result compared to bytecode caching

$ ./run-perfbench.py -s ~/cache-bytecode-pybv11 ~/lookup-state-pybv11 diff of scores (higher is better) N=100 M=100 /home/jimmo/cache-bytecode-pybv11 -> /home/jimmo/lookup-state-pybv11 diff diff% (error%) bm_chaos.py 358.90 -> 372.92 : +14.02 = +3.906% (+/-0.01%) bm_fannkuch.py 78.48 -> 78.97 : +0.49 = +0.624% (+/-0.01%) bm_fft.py 2576.19 -> 2569.91 : -6.28 = -0.244% (+/-0.00%) bm_float.py 6296.83 -> 6168.25 : -128.58 = -2.042% (+/-0.01%) bm_hexiom.py 34.45 -> 44.22 : +9.77 = +28.360% (+/-0.00%) bm_nqueens.py 4069.66 -> 4444.68 : +375.02 = +9.215% (+/-0.00%) bm_pidigits.py 652.31 -> 651.21 : -1.10 = -0.169% (+/-0.36%) misc_aes.py 453.34 -> 425.99 : -27.35 = -6.033% (+/-0.01%) misc_mandel.py 2960.68 -> 3598.03 : +637.35 = +21.527% (+/-0.01%) misc_pystone.py 2513.17 -> 2429.40 : -83.77 = -3.333% (+/-0.01%) misc_raytrace.py 372.69 -> 391.24 : +18.55 = +4.977% (+/-0.01%)

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

This is an alternative to the bytecode caching (which is incompatible with bytecode-in-ROM). Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

Enabled for all variants except minimal. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

This feature originally provided a significant performance boost for Unix, but wasn't able to be enabled for MCU targets, and added significant extra complexity to generating .mpy files. The equivalent performance gain is now provided by MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE. See micropython#7680 for discussion. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

This commit removes all parts of code associated with the existing MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE optimisation option, including the -mcache-lookup-bc option to mpy-cross. This feature originally provided a significant performance boost for Unix, but wasn't able to be enabled for MCU targets (due to frozen bytecode), and added significant extra complexity to generating and distributing .mpy files. The equivalent performance gain is now provided by the combination of MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE (which has been enabled on the unix port in the previous commit). It's hard to provide precise performance numbers, but tests have been run on a wide variety of architectures (x86-64, ARM Cortex, Aarch64, RISC-V, xtensa) and they all generally agree on the qualitative improvements seen by the combination of MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE. For example, on a "quiet" Linux x64 environment (i3-5010U @ 2.10GHz) the change from CACHE_MAP_LOOKUP_IN_BYTECODE, to LOAD_ATTR_FAST_PATH combined with MAP_LOOKUP_CACHE is: diff of scores (higher is better) N=2000 M=2000 bccache -> attrmapcache diff diff% (error%) bm_chaos.py 13742.56 -> 13905.67 : +163.11 = +1.187% (+/-3.75%) bm_fannkuch.py 60.13 -> 61.34 : +1.21 = +2.012% (+/-2.11%) bm_fft.py 113083.20 -> 114793.68 : +1710.48 = +1.513% (+/-1.57%) bm_float.py 256552.80 -> 243908.29 : -12644.51 = -4.929% (+/-1.90%) bm_hexiom.py 521.93 -> 625.41 : +103.48 = +19.826% (+/-0.40%) bm_nqueens.py 197544.25 -> 217713.12 : +20168.87 = +10.210% (+/-3.01%) bm_pidigits.py 8072.98 -> 8198.75 : +125.77 = +1.558% (+/-3.22%) misc_aes.py 17283.45 -> 16480.52 : -802.93 = -4.646% (+/-0.82%) misc_mandel.py 99083.99 -> 128939.84 : +29855.85 = +30.132% (+/-5.88%) misc_pystone.py 83860.10 -> 82592.56 : -1267.54 = -1.511% (+/-2.27%) misc_raytrace.py 21490.40 -> 22227.23 : +736.83 = +3.429% (+/-1.88%) This shows that the new optimisations are at least as good as the existing inline-bytecode-caching, and are sometimes much better (because the new ones apply caching to a wider variety of map lookups). The new optimisations can also benefit code generated by the native emitter, because they apply to the runtime rather than the generated code. The improvement for the native emitter when LOAD_ATTR_FAST_PATH and MAP_LOOKUP_CACHE are enabled is (same Linux environment as above): diff of scores (higher is better) N=2000 M=2000 native -> nat-attrmapcache diff diff% (error%) bm_chaos.py 14130.62 -> 15464.68 : +1334.06 = +9.441% (+/-7.11%) bm_fannkuch.py 74.96 -> 76.16 : +1.20 = +1.601% (+/-1.80%) bm_fft.py 166682.99 -> 168221.86 : +1538.87 = +0.923% (+/-4.20%) bm_float.py 233415.23 -> 265524.90 : +32109.67 = +13.756% (+/-2.57%) bm_hexiom.py 628.59 -> 734.17 : +105.58 = +16.796% (+/-1.39%) bm_nqueens.py 225418.44 -> 232926.45 : +7508.01 = +3.331% (+/-3.10%) bm_pidigits.py 6322.00 -> 6379.52 : +57.52 = +0.910% (+/-5.62%) misc_aes.py 20670.10 -> 27223.18 : +6553.08 = +31.703% (+/-1.56%) misc_mandel.py 138221.11 -> 152014.01 : +13792.90 = +9.979% (+/-2.46%) misc_pystone.py 85032.14 -> 105681.44 : +20649.30 = +24.284% (+/-2.25%) misc_raytrace.py 19800.01 -> 23350.73 : +3550.72 = +17.933% (+/-2.79%) In summary, compared to MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE, the new MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE options: - are simpler; - take less code size; - are faster (generally); - work with code generated by the native emitter; - can be used on embedded targets with a small and constant RAM overhead; - allow the same .mpy bytecode to run on all targets. See #7680 for further discussion. And see also #7653 for a discussion about simplifying mpy-cross options. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

dpgeorge · 2021-09-16T06:11:28Z

Merged in 7b89ad8 through b326edf

Thanks @jimmo for the fantastic work on this, it's a really significant improvement! As I wrote in the commit message:

In summary, compared to MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE, the new
MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE options:
- are simpler;
- take less code size;
- are faster (generally);
- work with code generated by the native emitter;
- can be used on embedded targets with a small and constant RAM overhead;
- allow the same .mpy bytecode to run on all targets.

These two optional optimzations take a little codespace but improve bytecode performance: * micropython#7680 * micropython#7688

jimmo force-pushed the map-caching branch from e2db0ad to 69fe76e Compare August 18, 2021 13:50

jimmo force-pushed the map-caching branch from 69fe76e to 4155d2f Compare August 18, 2021 14:42

dpgeorge added the py-core label Aug 19, 2021

jimmo force-pushed the map-caching branch from 4155d2f to 3b205fc Compare August 19, 2021 07:32

This was referenced Aug 19, 2021

stm32: Disable computed goto on constrained boards. #7685

Closed

py/vm.c: Add a fast path for LOAD_ATTR on instance types. #7688

Closed

dpgeorge reviewed Sep 5, 2021

View reviewed changes

py/map.c Outdated Show resolved Hide resolved

jimmo force-pushed the map-caching branch from 3b205fc to 80931cb Compare September 6, 2021 02:35

jimmo changed the title ~~RFC: py/map.c: Add a cache of bytecode location to map position.~~ py/map.c: Add a cache of bytecode location to map position. Sep 6, 2021

jimmo force-pushed the map-caching branch from 80931cb to a47b7be Compare September 6, 2021 02:48

jimmo force-pushed the map-caching branch from a47b7be to 6123bbf Compare September 6, 2021 03:38

dpgeorge reviewed Sep 12, 2021

View reviewed changes

jimmo force-pushed the map-caching branch from 6123bbf to 93a2dd0 Compare September 13, 2021 01:16

dpgeorge reviewed Sep 13, 2021

View reviewed changes

jimmo added 7 commits September 15, 2021 16:19

py/vm.c: Add a fast path for LOAD_ATTR on instance types.

e8aff00

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

py/map.c: Add a cache of (map+index) to map position.

946f0ae

This is an alternative to the bytecode caching (which is incompatible with bytecode-in-ROM). Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

stm32: Enable LOAD_ATTR fast path.

be62b44

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

stm32: Enable map lookup caching on STM32 (>M0).

807d7e5

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

unix: Enable LOAD_ATTR fast path.

37cc3f3

Enabled for all variants except minimal. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

unix: Enable map lookup caching.

82cd216

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

jimmo force-pushed the map-caching branch from 93a2dd0 to 033ed8d Compare September 15, 2021 06:19

dpgeorge closed this Sep 16, 2021

jimmo mentioned this pull request Sep 20, 2021

py/vm.c: Optimise bytecode dispatch (non computed-goto mode). #7823

Closed

jepler added a commit to jepler/circuitpython that referenced this pull request Oct 15, 2021

core: Enable new performance optimizations

53c5bde

These two optional optimzations take a little codespace but improve bytecode performance: * micropython#7680 * micropython#7688

jepler mentioned this pull request Oct 15, 2021

Merge 1.17 adafruit/circuitpython#5466

Merged

jimmo mentioned this pull request Jul 6, 2022

py/vm.c: Document the SELECTIVE_EXC_IP optimisation. #8870

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

py/map.c: Add a cache of bytecode location to map position. #7680

py/map.c: Add a cache of bytecode location to map position. #7680

jimmo commented Aug 18, 2021

jimmo commented Aug 18, 2021

jimmo commented Aug 18, 2021

jimmo commented Aug 18, 2021

jimmo commented Aug 18, 2021

jimmo commented Aug 19, 2021

dpgeorge Sep 5, 2021

jimmo Sep 6, 2021

jimmo commented Sep 6, 2021

codecov-commenter commented Sep 6, 2021

jimmo commented Sep 11, 2021

dpgeorge commented Sep 12, 2021 •

edited

dpgeorge Sep 12, 2021

dpgeorge Sep 12, 2021

jimmo Sep 13, 2021

dpgeorge Sep 12, 2021

jimmo Sep 13, 2021

jimmo commented Sep 13, 2021

jimmo commented Sep 13, 2021

dpgeorge Sep 13, 2021

jimmo Sep 15, 2021

dpgeorge commented Sep 16, 2021

py/map.c: Add a cache of bytecode location to map position. #7680

py/map.c: Add a cache of bytecode location to map position. #7680

Conversation

jimmo commented Aug 18, 2021

jimmo commented Aug 18, 2021

jimmo commented Aug 18, 2021

jimmo commented Aug 18, 2021

jimmo commented Aug 18, 2021

jimmo commented Aug 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimmo commented Sep 6, 2021

codecov-commenter commented Sep 6, 2021

Codecov Report

jimmo commented Sep 11, 2021

dpgeorge commented Sep 12, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimmo commented Sep 13, 2021

jimmo commented Sep 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpgeorge commented Sep 16, 2021

dpgeorge commented Sep 12, 2021 •

edited