Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py/map.c: Add a cache of bytecode location to map position. #7680

Closed
wants to merge 7 commits into from

Conversation

jimmo
Copy link
Member

@jimmo jimmo commented Aug 18, 2021

Background: On Unix (which only ever executes bytecode from RAM) we have a feature where the bytecode reserves an extra byte after load attr/global opcodes, which at runtime stores a map index of the likely location of this field. This isn't enabled on any other ports because they execute bytecode from ROM (as well as RAM), and we can't have a different bytecode format for both.

See also #7653 (comment) (reduce the number of options for mpy-cross).

This is a different approach that adds a global cache (keyed by bytecode instruction pointer) to a map offset, then all mp_map_lookup operations use it. It's much less precise than bytecode caching, but allows the cache to be external to the bytecode.

TODO / discuss:

  • I'm sure there's a bunch of tuning that can be done, and collecting some stats would be useful. (e.g. do we see a lot of collisions due to power-of-two cache size instead of prime number).
  • This doesn't work for native code. Need a way to set MP_STATE_VM(lookup_hint) in the emitted code.
  • Thread safety? (Might actually be fine)

On PYBV11 it's +156 bytes (+128 bytes RAM). Performance tests (in score mode) for PYBV11 and PYBD_SF6 below:

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-test             (PYBV11)
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-test         diff      diff% (error%)
bm_chaos.py                  313.53 ->     335.39 :     +21.86 =  +6.972% (+/-0.01%)
bm_fannkuch.py                78.17 ->      77.31 :      -0.86 =  -1.100% (+/-0.01%)
bm_fft.py                   2551.16 ->    2536.08 :     -15.08 =  -0.591% (+/-0.00%)
bm_float.py                 4930.98 ->    5258.34 :    +327.36 =  +6.639% (+/-0.02%)
bm_hexiom.py                  35.62 ->      42.22 :      +6.60 = +18.529% (+/-0.01%)
bm_nqueens.py               4219.69 ->    4416.12 :    +196.43 =  +4.655% (+/-0.01%)
bm_pidigits.py               647.28 ->     650.16 :      +2.88 =  +0.445% (+/-0.13%)
misc_aes.py                  384.48 ->     409.02 :     +24.54 =  +6.383% (+/-0.02%)
misc_mandel.py              3121.44 ->    3532.50 :    +411.06 = +13.169% (+/-0.00%)
misc_pystone.py             1954.67 ->    2154.75 :    +200.08 = +10.236% (+/-0.01%)
misc_raytrace.py             314.00 ->     338.71 :     +24.71 =  +7.869% (+/-0.01%)

$ ./run-perfbench.py -s ~/cache-baseline-sf6 ~/cache-test-sf6        (PYBD_SF6)
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline-sf6 -> /home/jimmo/cache-test-sf6         diff      diff% (error%)
bm_chaos.py                  382.55 ->     398.78 :     +16.23 =  +4.243% (+/-0.03%)
bm_fannkuch.py               101.40 ->      97.95 :      -3.45 =  -3.402% (+/-0.04%)
bm_fft.py                   1815.18 ->    1760.44 :     -54.74 =  -3.016% (+/-0.02%)
bm_float.py                 5939.99 ->    6470.11 :    +530.12 =  +8.925% (+/-0.05%)
bm_hexiom.py                  49.78 ->      53.66 :      +3.88 =  +7.794% (+/-0.02%)
bm_nqueens.py               5706.74 ->    5732.61 :     +25.87 =  +0.453% (+/-0.02%)
bm_pidigits.py               959.77 ->     958.86 :      -0.91 =  -0.095% (+/-0.03%)
misc_aes.py                  530.32 ->     566.37 :     +36.05 =  +6.798% (+/-0.06%)
misc_mandel.py              4037.38 ->    4197.41 :    +160.03 =  +3.964% (+/-0.17%)
misc_pystone.py             2485.09 ->    2646.59 :    +161.50 =  +6.499% (+/-0.05%)
misc_raytrace.py             401.04 ->     431.79 :     +30.75 =  +7.668% (+/-0.04%)

@jimmo
Copy link
Member Author

jimmo commented Aug 18, 2021

For comparison, here's how the existing MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE goes on PYBV11 and PYBD_SF6

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-bytecode-pybv11 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-bytecode-pybv11         diff      diff% (error%)
bm_chaos.py                  313.53 ->     358.90 :     +45.37 = +14.471% (+/-0.01%)
bm_fannkuch.py                78.17 ->      78.48 :      +0.31 =  +0.397% (+/-0.01%)
bm_fft.py                   2551.16 ->    2576.19 :     +25.03 =  +0.981% (+/-0.00%)
bm_float.py                 4930.98 ->    6296.83 :   +1365.85 = +27.699% (+/-0.02%)
bm_hexiom.py                  35.62 ->      34.45 :      -1.17 =  -3.285% (+/-0.00%)
bm_nqueens.py               4219.69 ->    4069.66 :    -150.03 =  -3.555% (+/-0.00%)
bm_pidigits.py               647.28 ->     652.31 :      +5.03 =  +0.777% (+/-0.21%)
misc_aes.py                  384.48 ->     453.34 :     +68.86 = +17.910% (+/-0.01%)
misc_mandel.py              3121.44 ->    2960.68 :    -160.76 =  -5.150% (+/-0.01%)
misc_pystone.py             1954.67 ->    2513.17 :    +558.50 = +28.573% (+/-0.00%)
misc_raytrace.py             314.00 ->     372.69 :     +58.69 = +18.691% (+/-0.00%)
$ ./run-perfbench.py -s ~/cache-baseline-sf6 ~/cache-bytecode-sf6 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline-sf6 -> /home/jimmo/cache-bytecode-sf6         diff      diff% (error%)
bm_chaos.py                  382.55 ->     453.09 :     +70.54 = +18.439% (+/-0.03%)
bm_fannkuch.py               101.40 ->     101.85 :      +0.45 =  +0.444% (+/-0.04%)
bm_fft.py                   1815.18 ->    1800.96 :     -14.22 =  -0.783% (+/-0.02%)
bm_float.py                 5939.99 ->    8103.30 :   +2163.31 = +36.419% (+/-0.05%)
bm_hexiom.py                  49.78 ->      49.53 :      -0.25 =  -0.502% (+/-0.02%)
bm_nqueens.py               5706.74 ->    5439.30 :    -267.44 =  -4.686% (+/-0.02%)
bm_pidigits.py               959.77 ->     969.55 :      +9.78 =  +1.019% (+/-0.06%)
misc_aes.py                  530.32 ->     627.16 :     +96.84 = +18.261% (+/-0.06%)
misc_mandel.py              4037.38 ->    3722.01 :    -315.37 =  -7.811% (+/-0.17%)
misc_pystone.py             2485.09 ->    3317.66 :    +832.57 = +33.503% (+/-0.05%)
misc_raytrace.py             401.04 ->     507.98 :    +106.94 = +26.666% (+/-0.03%)

@jimmo
Copy link
Member Author

jimmo commented Aug 18, 2021

I've updated the branch. It's now +176 bytes, but marginally faster.

bm_float and bm_fannkuch see no difference because they do zero map lookups (!!!) during the main loop, so you can't recuperate the cost of updating lookup_hint. Possibly consider only updating lookup_hint for certain ops. misc_pystone sees a lot of cache collisions (and this is why bytecode caching does so well).

By optimising for code size you can change the table to 128 entries (instead of 127), and save about 28 bytes, and put MAP_CACHE_SET into a function and save about 32 bytes. This gives you about 80% of the perf gain.

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-test-pybv11-127-inline              (PYBV11)
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-test-pybv11-127-inline         diff      diff% (error%)
bm_chaos.py                  313.53 ->     342.64 :     +29.11 =  +9.285% (+/-0.02%)
bm_fannkuch.py                78.17 ->      78.01 :      -0.16 =  -0.205% (+/-0.01%)
bm_fft.py                   2551.16 ->    2548.89 :      -2.27 =  -0.089% (+/-0.00%)
bm_float.py                 4930.98 ->    5380.83 :    +449.85 =  +9.123% (+/-0.02%)
bm_hexiom.py                  35.62 ->      42.11 :      +6.49 = +18.220% (+/-0.00%)
bm_nqueens.py               4219.69 ->    4426.41 :    +206.72 =  +4.899% (+/-0.01%)
bm_pidigits.py               647.28 ->     650.82 :      +3.54 =  +0.547% (+/-0.12%)
misc_aes.py                  384.48 ->     415.54 :     +31.06 =  +8.078% (+/-0.02%)
misc_mandel.py              3121.44 ->    3544.05 :    +422.61 = +13.539% (+/-0.01%)
misc_pystone.py             1954.67 ->    2086.51 :    +131.84 =  +6.745% (+/-0.01%)
misc_raytrace.py             314.00 ->     344.56 :     +30.56 =  +9.732% (+/-0.01%)
$ ./run-perfbench.py -s ~/cache-baseline-sf6 ~/cache-test-sf6-127-inline             (SF6)
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline-sf6 -> /home/jimmo/cache-test-sf6-127-inline         diff      diff% (error%)
bm_chaos.py                  382.55 ->     398.09 :     +15.54 =  +4.062% (+/-0.03%)
bm_fannkuch.py               101.40 ->      98.12 :      -3.28 =  -3.235% (+/-0.04%)
bm_fft.py                   1815.18 ->    1764.46 :     -50.72 =  -2.794% (+/-0.01%)
bm_float.py                 5939.99 ->    6428.43 :    +488.44 =  +8.223% (+/-0.07%)
bm_hexiom.py                  49.78 ->      53.75 :      +3.97 =  +7.975% (+/-0.03%)
bm_nqueens.py               5706.74 ->    5762.30 :     +55.56 =  +0.974% (+/-0.02%)
bm_pidigits.py               959.77 ->     961.42 :      +1.65 =  +0.172% (+/-0.02%)
misc_aes.py                  530.32 ->     565.72 :     +35.40 =  +6.675% (+/-0.07%)
misc_mandel.py              4037.38 ->    4198.66 :    +161.28 =  +3.995% (+/-0.17%)
misc_pystone.py             2485.09 ->    2545.54 :     +60.45 =  +2.433% (+/-0.05%)
misc_raytrace.py             401.04 ->     433.08 :     +32.04 =  +7.989% (+/-0.03%)

@jimmo
Copy link
Member Author

jimmo commented Aug 18, 2021

Possibly consider only updating lookup_hint for certain ops.

This definitely helps. Now +124 bytes on PYBV11.

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-vmsel-pybv11-ro
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-vmsel-pybv11-ro         diff      diff% (error%)
bm_chaos.py                  313.53 ->     339.58 :     +26.05 =  +8.309% (+/-0.01%)
bm_fannkuch.py                78.17 ->      78.18 :      +0.01 =  +0.013% (+/-0.01%)
bm_fft.py                   2551.16 ->    2561.48 :     +10.32 =  +0.405% (+/-0.00%)
bm_float.py                 4930.98 ->    5205.65 :    +274.67 =  +5.570% (+/-0.02%)
bm_hexiom.py                  35.62 ->      41.94 :      +6.32 = +17.743% (+/-0.01%)
bm_nqueens.py               4219.69 ->    4432.77 :    +213.08 =  +5.050% (+/-0.01%)
bm_pidigits.py               647.28 ->     652.33 :      +5.05 =  +0.780% (+/-0.18%)
misc_aes.py                  384.48 ->     417.87 :     +33.39 =  +8.684% (+/-0.02%)
misc_mandel.py              3121.44 ->    3560.35 :    +438.91 = +14.061% (+/-0.01%)
misc_pystone.py             1954.67 ->    2118.40 :    +163.73 =  +8.376% (+/-0.01%)
misc_raytrace.py             314.00 ->     337.16 :     +23.16 =  +7.376% (+/-0.01%)

@jimmo
Copy link
Member Author

jimmo commented Aug 18, 2021

Just as a random side quest, something I've been meaning to investigate for a while is the computed goto. On PYBV11, disabling it saves +1276 bytes.

The effect on performance varies a bit:

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-nogoto-pybv11 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-nogoto-pybv11         diff      diff% (error%)
bm_chaos.py                  313.53 ->     310.23 :      -3.30 =  -1.053% (+/-0.01%)
bm_fannkuch.py                78.17 ->      77.89 :      -0.28 =  -0.358% (+/-0.01%)
bm_fft.py                   2551.16 ->    2501.66 :     -49.50 =  -1.940% (+/-0.00%)
bm_float.py                 4930.98 ->    4906.96 :     -24.02 =  -0.487% (+/-0.01%)
bm_hexiom.py                  35.62 ->      35.63 :      +0.01 =  +0.028% (+/-0.00%)
bm_nqueens.py               4219.69 ->    4241.08 :     +21.39 =  +0.507% (+/-0.00%)
bm_pidigits.py               647.28 ->     643.02 :      -4.26 =  -0.658% (+/-0.19%)
misc_aes.py                  384.48 ->     383.75 :      -0.73 =  -0.190% (+/-0.01%)
misc_mandel.py              3121.44 ->    3047.03 :     -74.41 =  -2.384% (+/-0.01%)
misc_pystone.py             1954.67 ->    1952.34 :      -2.33 =  -0.119% (+/-0.01%)
misc_raytrace.py             314.00 ->     310.67 :      -3.33 =  -1.061% (+/-0.00%)

Combined with this PR (i.e. map caching without computed goto) is a net -1156 bytes for an overall performance benefit (except for fft).

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-vmsel-pybv11-ro-nogoto 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-vmsel-pybv11-ro-nogoto         diff      diff% (error%)
bm_chaos.py                  313.53 ->     337.03 :     +23.50 =  +7.495% (+/-0.01%)
bm_fannkuch.py                78.17 ->      78.29 :      +0.12 =  +0.154% (+/-0.01%)
bm_fft.py                   2551.16 ->    2513.10 :     -38.06 =  -1.492% (+/-0.00%)
bm_float.py                 4930.98 ->    5154.90 :    +223.92 =  +4.541% (+/-0.02%)
bm_hexiom.py                  35.62 ->      41.99 :      +6.37 = +17.883% (+/-0.01%)
bm_nqueens.py               4219.69 ->    4468.64 :    +248.95 =  +5.900% (+/-0.01%)
bm_pidigits.py               647.28 ->     650.73 :      +3.45 =  +0.533% (+/-0.10%)
misc_aes.py                  384.48 ->     413.51 :     +29.03 =  +7.550% (+/-0.02%)
misc_mandel.py              3121.44 ->    3496.68 :    +375.24 = +12.021% (+/-0.00%)
misc_pystone.py             1954.67 ->    2106.96 :    +152.29 =  +7.791% (+/-0.01%)
misc_raytrace.py             314.00 ->     335.69 :     +21.69 =  +6.908% (+/-0.01%)

@jimmo
Copy link
Member Author

jimmo commented Aug 19, 2021

Based on suggestion from @dpgeorge I have changed this to use (map-pointer, key) as the cache index, and this is a signficant win for code size (now only +48 bytes) and performance.

$ ./run-perfbench.py -s ~/cache-baseline ~/cache-mapindex-pybv11-128-shift
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline -> /home/jimmo/cache-mapindex-pybv11-128-shift         diff      diff% (error%)
bm_chaos.py                  313.53 ->     350.94 :     +37.41 = +11.932% (+/-0.01%)
bm_fannkuch.py                78.17 ->      78.49 :      +0.32 =  +0.409% (+/-0.01%)
bm_fft.py                   2551.16 ->    2559.18 :      +8.02 =  +0.314% (+/-0.00%)
bm_float.py                 4930.98 ->    5507.68 :    +576.70 = +11.695% (+/-0.02%)
bm_hexiom.py                  35.62 ->      42.98 :      +7.36 = +20.663% (+/-0.00%)
bm_nqueens.py               4219.69 ->    4434.57 :    +214.88 =  +5.092% (+/-0.00%)
bm_pidigits.py               647.28 ->     651.15 :      +3.87 =  +0.598% (+/-0.13%)
misc_aes.py                  384.48 ->     426.28 :     +41.80 = +10.872% (+/-0.01%)
misc_mandel.py              3121.44 ->    3614.86 :    +493.42 = +15.807% (+/-0.00%)
misc_pystone.py             1954.67 ->    2288.38 :    +333.71 = +17.072% (+/-0.01%)
misc_raytrace.py             314.00 ->     358.68 :     +44.68 = +14.229% (+/-0.01%)

It also works for the native emitter and adds a similar perf boost on top of the gain already provided by the native emitter.

$ ./run-perfbench.py -s ~/cache-baseline-native-pybv11 ~/cache-mapindex-native-pybv11-128-shift 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-baseline-native-pybv11 -> /home/jimmo/cache-mapindex-native-pybv11-128-shift         diff      diff% (error%)
bm_asyncio.py                107.39 ->     109.17 :      +1.78 =  +1.658% (+/-0.01%)
bm_chaos.py                  378.90 ->     420.88 :     +41.98 = +11.079% (+/-0.01%)
bm_fannkuch.py               103.20 ->     102.18 :      -1.02 =  -0.988% (+/-0.01%)
bm_fft.py                   3589.92 ->    3612.56 :     +22.64 =  +0.631% (+/-0.02%)
bm_float.py                 5895.84 ->    6564.72 :    +668.88 = +11.345% (+/-0.00%)
bm_hexiom.py                  41.64 ->      51.66 :     +10.02 = +24.063% (+/-0.00%)
bm_nqueens.py               5043.33 ->    5354.31 :    +310.98 =  +6.166% (+/-0.01%)
bm_pidigits.py               708.65 ->     704.44 :      -4.21 =  -0.594% (+/-0.30%)
misc_mandel.py              3770.39 ->    4535.38 :    +764.99 = +20.289% (+/-0.01%)
misc_pystone.py             2430.32 ->    2913.25 :    +482.93 = +19.871% (+/-0.01%)
misc_raytrace.py             364.62 ->     416.84 :     +52.22 = +14.322% (+/-0.02%)

py/map.c Outdated
@@ -126,6 +126,15 @@ STATIC void mp_map_rehash(mp_map_t *map) {
m_del(mp_map_elem_t, old_table, old_alloc);
}

#if MICROPY_OPT_MAP_CACHING
uint8_t map_lookup_cache[128];
#define MAP_CACHE_OFFSET ((((uintptr_t) + (uintptr_t)index) >> 2) % sizeof(map_lookup_cache))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this first uintptr_t is missing something that it casts, maybe map or map->table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I had this using (map+index)>>2, but changed it to just use index>>2 because it's overall faster. Accidentally left behind the no-op cast. I've updated the branch, and addressed the other comment about the macro args.

Here's the diff FWIW of adding back map to the expression. Makes some things marginally better, but big hit on pystone and raytrace.

$ ./run-perfbench.py -s ~/2.cache-index-pybv11 ~/2.cache-mapindex-pybv11
diff of scores (higher is better)
N=100 M=100              /home/jimmo/2.cache-index-pybv11 -> /home/jimmo/2.cache-mapindex-pybv11         diff      diff% (error%)
bm_asyncio.py                113.73 ->     113.43 :      -0.30 =  -0.264% (+/-0.01%)
bm_chaos.py                  348.59 ->     340.54 :      -8.05 =  -2.309% (+/-0.01%)
bm_fannkuch.py                78.53 ->      78.20 :      -0.33 =  -0.420% (+/-0.00%)
bm_fft.py                   2563.02 ->    2597.15 :     +34.13 =  +1.332% (+/-0.00%)
bm_float.py                 5544.62 ->    5440.36 :    -104.26 =  -1.880% (+/-0.02%)
bm_hexiom.py                  42.73 ->      42.74 :      +0.01 =  +0.023% (+/-0.00%)
bm_nqueens.py               4429.89 ->    4454.66 :     +24.77 =  +0.559% (+/-0.00%)
bm_pidigits.py               651.39 ->     649.83 :      -1.56 =  -0.239% (+/-0.34%)
misc_aes.py                  423.37 ->     428.64 :      +5.27 =  +1.245% (+/-0.01%)
misc_mandel.py              3610.82 ->    3624.53 :     +13.71 =  +0.380% (+/-0.01%)
misc_pystone.py             2287.95 ->    2185.46 :    -102.49 =  -4.480% (+/-0.01%)
misc_raytrace.py             362.17 ->     341.62 :     -20.55 =  -5.674% (+/-0.01%)

And while we're at it, here's the latest baseline->PR after rebasing (similar to the first table in #7680 (comment))

$ ./run-perfbench.py -s ~/2.cache-baseline-pybv11 ~/2.cache-index-pybv11
diff of scores (higher is better)
N=100 M=100              /home/jimmo/2.cache-baseline-pybv11 -> /home/jimmo/2.cache-index-pybv11         diff      diff% (error%)
bm_asyncio.py                111.34 ->     113.73 :      +2.39 =  +2.147% (+/-0.01%)
bm_chaos.py                  311.32 ->     348.59 :     +37.27 = +11.972% (+/-0.02%)
bm_fannkuch.py                77.82 ->      78.53 :      +0.71 =  +0.912% (+/-0.00%)
bm_fft.py                   2539.05 ->    2563.02 :     +23.97 =  +0.944% (+/-0.00%)
bm_float.py                 4955.56 ->    5544.62 :    +589.06 = +11.887% (+/-0.00%)
bm_hexiom.py                  35.52 ->      42.73 :      +7.21 = +20.298% (+/-0.00%)
bm_nqueens.py               4201.40 ->    4429.89 :    +228.49 =  +5.438% (+/-0.00%)
bm_pidigits.py               646.91 ->     651.39 :      +4.48 =  +0.693% (+/-0.17%)
misc_aes.py                  382.97 ->     423.37 :     +40.40 = +10.549% (+/-0.01%)
misc_mandel.py              3091.74 ->    3610.82 :    +519.08 = +16.789% (+/-0.01%)
misc_pystone.py             1948.00 ->    2287.95 :    +339.95 = +17.451% (+/-0.01%)
misc_raytrace.py             312.45 ->     362.17 :     +49.72 = +15.913% (+/-0.01%)

py/map.c Outdated Show resolved Hide resolved
jimmo added a commit to jimmo/micropython that referenced this pull request Sep 6, 2021
This feature provided a significant performance boost for Unix, but wasn't
able to be enabled for MCU targets, and added significant extra complexity
to generating .mpy files. Some of the performance gain can be achieved now
with MICROPY_OPT_MAP_LOOKUP_CACHE instead.

See micropython#7680 for discussion.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@jimmo jimmo changed the title RFC: py/map.c: Add a cache of bytecode location to map position. py/map.c: Add a cache of bytecode location to map position. Sep 6, 2021
jimmo added a commit to jimmo/micropython that referenced this pull request Sep 6, 2021
This feature provided a significant performance boost for Unix, but wasn't
able to be enabled for MCU targets, and added significant extra complexity
to generating .mpy files. Some of the performance gain can be achieved now
with MICROPY_OPT_MAP_LOOKUP_CACHE instead.

See micropython#7680 for discussion.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
jimmo added a commit to jimmo/micropython that referenced this pull request Sep 6, 2021
This feature provided a significant performance boost for Unix, but wasn't
able to be enabled for MCU targets, and added significant extra complexity
to generating .mpy files. Some of the performance gain can be achieved now
with MICROPY_OPT_MAP_LOOKUP_CACHE instead.

See micropython#7680 for discussion.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@jimmo
Copy link
Member Author

jimmo commented Sep 6, 2021

I've added a commit to remove bytecode caching, would happily make this a different PR though. The main reason is so that CI will run tests with map caching (which does not do anything when bytecode caching is enabled).

@codecov-commenter
Copy link

Codecov Report

Merging #7680 (6123bbf) into master (bbbdef4) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7680      +/-   ##
==========================================
- Coverage   98.26%   98.24%   -0.02%     
==========================================
  Files         154      154              
  Lines       20076    20032      -44     
==========================================
- Hits        19728    19681      -47     
- Misses        348      351       +3     
Impacted Files Coverage Δ
py/bc.c 89.13% <ø> (+0.47%) ⬆️
py/emitbc.c 100.00% <ø> (ø)
py/showbc.c 98.51% <ø> (-0.02%) ⬇️
py/vm.c 98.97% <ø> (-0.07%) ⬇️
py/map.c 99.49% <100.00%> (-0.51%) ⬇️
py/smallint.c 92.00% <0.00%> (-4.00%) ⬇️
py/objlist.c 99.23% <0.00%> (-0.39%) ⬇️
py/runtime.c 99.39% <0.00%> (-0.16%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbbdef4...6123bbf. Read the comment docs.

@jimmo
Copy link
Member Author

jimmo commented Sep 11, 2021

I set up a "quiet" Linux env to try and get some useful comparisons for Linux x64 (i3-5010U @ 2.10GHz). Normally the tests have +/- 30% uncertainty on my regular env, here they are more like +/- 3%.

baseline -> bytecode caching (enabled by default):

$ ./run-perfbench.py -s ~/unix-baseline ~/unix-bccache 
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-baseline -> /home/jimmo/unix-bccache         diff      diff% (error%)
bm_chaos.py                12049.86 ->   13742.56 :   +1692.70 = +14.047% (+/-3.11%)
bm_fannkuch.py                61.59 ->      60.13 :      -1.46 =  -2.371% (+/-1.58%)
bm_fft.py                 112941.24 ->  113083.20 :    +141.96 =  +0.126% (+/-2.08%)
bm_float.py               202443.35 ->  256552.80 :  +54109.45 = +26.728% (+/-2.22%)
bm_hexiom.py                 531.50 ->     521.93 :      -9.57 =  -1.801% (+/-0.62%)
bm_nqueens.py             208519.45 ->  197544.25 :  -10975.20 =  -5.263% (+/-2.96%)
bm_pidigits.py              8090.77 ->    8072.98 :     -17.79 =  -0.220% (+/-3.27%)
misc_aes.py                13891.95 ->   17283.45 :   +3391.50 = +24.413% (+/-0.90%)
misc_mandel.py            120134.35 ->   99083.99 :  -21050.36 = -17.522% (+/-0.70%)
misc_pystone.py            67767.06 ->   83860.10 :  +16093.04 = +23.748% (+/-1.52%)
misc_raytrace.py           17480.07 ->   21490.40 :   +4010.33 = +22.942% (+/-1.78%)

baseline -> map caching

$ ./run-perfbench.py -s ~/unix-baseline ~/unix-mapcache
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-baseline -> /home/jimmo/unix-mapcache         diff      diff% (error%)
bm_chaos.py                12049.86 ->   13171.22 :   +1121.36 =  +9.306% (+/-3.06%)
bm_fannkuch.py                61.59 ->      61.31 :      -0.28 =  -0.455% (+/-2.35%)
bm_fft.py                 112941.24 ->  113928.80 :    +987.56 =  +0.874% (+/-3.26%)
bm_float.py               202443.35 ->  229413.54 :  +26970.19 = +13.322% (+/-1.59%)
bm_hexiom.py                 531.50 ->     600.29 :     +68.79 = +12.943% (+/-3.34%)
bm_nqueens.py             208519.45 ->  216516.42 :   +7996.97 =  +3.835% (+/-2.51%)
bm_pidigits.py              8090.77 ->    8127.10 :     +36.33 =  +0.449% (+/-2.77%)
misc_aes.py                13891.95 ->   16270.39 :   +2378.44 = +17.121% (+/-0.72%)
misc_mandel.py            120134.35 ->  130021.80 :   +9887.45 =  +8.230% (+/-1.99%)
misc_pystone.py            67767.06 ->   80591.13 :  +12824.07 = +18.924% (+/-1.25%)
misc_raytrace.py           17480.07 ->   20544.16 :   +3064.09 = +17.529% (+/-1.79%)

map caching --> attr fast path (#7688)

$ ./run-perfbench.py -s ~/unix-mapcache ~/unix-attrmapcache
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-mapcache -> /home/jimmo/unix-attrmapcache         diff      diff% (error%)
bm_chaos.py                13171.22 ->   13905.67 :    +734.45 =  +5.576% (+/-3.87%)
bm_fannkuch.py                61.31 ->      61.34 :      +0.03 =  +0.049% (+/-2.71%)
bm_fft.py                 113928.80 ->  114793.68 :    +864.88 =  +0.759% (+/-2.94%)
bm_float.py               229413.54 ->  243908.29 :  +14494.75 =  +6.318% (+/-1.63%)
bm_hexiom.py                 600.29 ->     625.41 :     +25.12 =  +4.185% (+/-2.93%)
bm_nqueens.py             216516.42 ->  217713.12 :   +1196.70 =  +0.553% (+/-2.28%)
bm_pidigits.py              8127.10 ->    8198.75 :     +71.65 =  +0.882% (+/-2.69%)
misc_aes.py                16270.39 ->   16480.52 :    +210.13 =  +1.291% (+/-0.74%)
misc_mandel.py            130021.80 ->  128939.84 :   -1081.96 =  -0.832% (+/-4.80%)
misc_pystone.py            80591.13 ->   82592.56 :   +2001.43 =  +2.483% (+/-2.24%)
misc_raytrace.py           20544.16 ->   22227.23 :   +1683.07 =  +8.192% (+/-1.97%)

baseline -> map caching & attr fast path

$ ./run-perfbench.py -s ~/unix-baseline ~/unix-attrmapcache
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-baseline -> /home/jimmo/unix-attrmapcache         diff      diff% (error%)
bm_chaos.py                12049.86 ->   13905.67 :   +1855.81 = +15.401% (+/-3.66%)
bm_fannkuch.py                61.59 ->      61.34 :      -0.25 =  -0.406% (+/-2.04%)
bm_fft.py                 112941.24 ->  114793.68 :   +1852.44 =  +1.640% (+/-1.98%)
bm_float.py               202443.35 ->  243908.29 :  +41464.94 = +20.482% (+/-1.14%)
bm_hexiom.py                 531.50 ->     625.41 :     +93.91 = +17.669% (+/-0.61%)
bm_nqueens.py             208519.45 ->  217713.12 :   +9193.67 =  +4.409% (+/-2.13%)
bm_pidigits.py              8090.77 ->    8198.75 :    +107.98 =  +1.335% (+/-2.70%)
misc_aes.py                13891.95 ->   16480.52 :   +2588.57 = +18.634% (+/-0.91%)
misc_mandel.py            120134.35 ->  128939.84 :   +8805.49 =  +7.330% (+/-4.84%)
misc_pystone.py            67767.06 ->   82592.56 :  +14825.50 = +21.877% (+/-2.60%)
misc_raytrace.py           17480.07 ->   22227.23 :   +4747.16 = +27.158% (+/-1.96%)

bytecode caching -> map caching (this is probably the most important one because it's the diff created by this PR)

$ ./run-perfbench.py -s ~/unix-bccache ~/unix-mapcache
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-bccache -> /home/jimmo/unix-mapcache         diff      diff% (error%)
bm_chaos.py                13742.56 ->   13171.22 :    -571.34 =  -4.157% (+/-3.31%)
bm_fannkuch.py                60.13 ->      61.31 :      +1.18 =  +1.962% (+/-2.43%)
bm_fft.py                 113083.20 ->  113928.80 :    +845.60 =  +0.748% (+/-3.02%)
bm_float.py               256552.80 ->  229413.54 :  -27139.26 = -10.578% (+/-2.09%)
bm_hexiom.py                 521.93 ->     600.29 :     +78.36 = +15.014% (+/-3.37%)
bm_nqueens.py             197544.25 ->  216516.42 :  +18972.17 =  +9.604% (+/-3.32%)
bm_pidigits.py              8072.98 ->    8127.10 :     +54.12 =  +0.670% (+/-3.27%)
misc_aes.py                17283.45 ->   16270.39 :   -1013.06 =  -5.861% (+/-0.69%)
misc_mandel.py             99083.99 ->  130021.80 :  +30937.81 = +31.224% (+/-2.46%)
misc_pystone.py            83860.10 ->   80591.13 :   -3268.97 =  -3.898% (+/-1.32%)
misc_raytrace.py           21490.40 ->   20544.16 :    -946.24 =  -4.403% (+/-1.77%)

bytecode caching -> map caching & attr fast path (diff of this PR plus #7688) (@stinos)

$ ./run-perfbench.py -s ~/unix-bccache ~/unix-attrmapcache
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-bccache -> /home/jimmo/unix-attrmapcache         diff      diff% (error%)
bm_chaos.py                13742.56 ->   13905.67 :    +163.11 =  +1.187% (+/-3.75%)
bm_fannkuch.py                60.13 ->      61.34 :      +1.21 =  +2.012% (+/-2.11%)
bm_fft.py                 113083.20 ->  114793.68 :   +1710.48 =  +1.513% (+/-1.57%)
bm_float.py               256552.80 ->  243908.29 :  -12644.51 =  -4.929% (+/-1.90%)
bm_hexiom.py                 521.93 ->     625.41 :    +103.48 = +19.826% (+/-0.40%)
bm_nqueens.py             197544.25 ->  217713.12 :  +20168.87 = +10.210% (+/-3.01%)
bm_pidigits.py              8072.98 ->    8198.75 :    +125.77 =  +1.558% (+/-3.22%)
misc_aes.py                17283.45 ->   16480.52 :    -802.93 =  -4.646% (+/-0.82%)
misc_mandel.py             99083.99 ->  128939.84 :  +29855.85 = +30.132% (+/-5.88%)
misc_pystone.py            83860.10 ->   82592.56 :   -1267.54 =  -1.511% (+/-2.27%)
misc_raytrace.py           21490.40 ->   22227.23 :    +736.83 =  +3.429% (+/-1.88%)

baseline -> native

$ ./run-perfbench.py -s ~/unix-baseline ~/unix-baseline-native
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-baseline -> /home/jimmo/unix-baseline-native         diff      diff% (error%)
bm_chaos.py                12049.86 ->   14130.62 :   +2080.76 = +17.268% (+/-4.90%)
bm_fannkuch.py                61.59 ->      74.96 :     +13.37 = +21.708% (+/-2.02%)
bm_fft.py                 112941.24 ->  166682.99 :  +53741.75 = +47.584% (+/-3.90%)
bm_float.py               202443.35 ->  233415.23 :  +30971.88 = +15.299% (+/-2.82%)
bm_hexiom.py                 531.50 ->     628.59 :     +97.09 = +18.267% (+/-1.59%)
bm_nqueens.py             208519.45 ->  225418.44 :  +16898.99 =  +8.104% (+/-2.99%)
bm_pidigits.py              8090.77 ->    6322.00 :   -1768.77 = -21.862% (+/-3.74%)
misc_aes.py                13891.95 ->   20670.10 :   +6778.15 = +48.792% (+/-0.91%)
misc_mandel.py            120134.35 ->  138221.11 :  +18086.76 = +15.055% (+/-2.32%)
misc_pystone.py            67767.06 ->   85032.14 :  +17265.08 = +25.477% (+/-2.43%)
misc_raytrace.py           17480.07 ->   19800.01 :   +2319.94 = +13.272% (+/-2.50%)

native -> native & map caching & attr fast path

$ ./run-perfbench.py -s ~/unix-baseline-native ~/unix-attrmapcache-native
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-baseline-native -> /home/jimmo/unix-attrmapcache-native         diff      diff% (error%)
bm_chaos.py                14130.62 ->   15464.68 :   +1334.06 =  +9.441% (+/-7.11%)
bm_fannkuch.py                74.96 ->      76.16 :      +1.20 =  +1.601% (+/-1.80%)
bm_fft.py                 166682.99 ->  168221.86 :   +1538.87 =  +0.923% (+/-4.20%)
bm_float.py               233415.23 ->  265524.90 :  +32109.67 = +13.756% (+/-2.57%)
bm_hexiom.py                 628.59 ->     734.17 :    +105.58 = +16.796% (+/-1.39%)
bm_nqueens.py             225418.44 ->  232926.45 :   +7508.01 =  +3.331% (+/-3.10%)
bm_pidigits.py              6322.00 ->    6379.52 :     +57.52 =  +0.910% (+/-5.62%)
misc_aes.py                20670.10 ->   27223.18 :   +6553.08 = +31.703% (+/-1.56%)
misc_mandel.py            138221.11 ->  152014.01 :  +13792.90 =  +9.979% (+/-2.46%)
misc_pystone.py            85032.14 ->  105681.44 :  +20649.30 = +24.284% (+/-2.25%)
misc_raytrace.py           19800.01 ->   23350.73 :   +3550.72 = +17.933% (+/-2.79%)

baseline -> native & map caching & attr fast path (combination of previous two)

$ ./run-perfbench.py -s ~/unix-baseline ~/unix-attrmapcache-native
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/unix-baseline -> /home/jimmo/unix-attrmapcache-native         diff      diff% (error%)
bm_chaos.py                12049.86 ->   15464.68 :   +3414.82 = +28.339% (+/-7.09%)
bm_fannkuch.py                61.59 ->      76.16 :     +14.57 = +23.656% (+/-1.77%)
bm_fft.py                 112941.24 ->  168221.86 :  +55280.62 = +48.946% (+/-5.38%)
bm_float.py               202443.35 ->  265524.90 :  +63081.55 = +31.160% (+/-1.13%)
bm_hexiom.py                 531.50 ->     734.17 :    +202.67 = +38.132% (+/-0.89%)
bm_nqueens.py             208519.45 ->  232926.45 :  +24407.00 = +11.705% (+/-2.73%)
bm_pidigits.py              8090.77 ->    6379.52 :   -1711.25 = -21.151% (+/-3.60%)
misc_aes.py                13891.95 ->   27223.18 :  +13331.23 = +95.964% (+/-2.27%)
misc_mandel.py            120134.35 ->  152014.01 :  +31879.66 = +26.537% (+/-1.71%)
misc_pystone.py            67767.06 ->  105681.44 :  +37914.38 = +55.948% (+/-1.81%)
misc_raytrace.py           17480.07 ->   23350.73 :   +5870.66 = +33.585% (+/-2.32%)

@dpgeorge
Copy link
Member

dpgeorge commented Sep 12, 2021

I did some benchmarking on a Raspberry Pi 3, running Linux in 64-bit mode. This is relatively quite for running tests, max 2% error. The main two results are:

Baseline (no caching optimisations) -> inline bytecode caching (current master setting for unix port):

diff of scores (higher is better)
N=1000 M=1000        perf-rpi3-base -> perf-rpi3-bccache   diff     diff% (error%)
bm_chaos.py                 3523.25 ->    4041.71 :    +518.46 = +14.715% (+/-1.41%)
bm_fannkuch.py                15.65 ->      15.86 :      +0.21 =  +1.342% (+/-0.93%)
bm_fft.py                  27219.17 ->   28516.10 :   +1296.93 =  +4.765% (+/-1.63%)
bm_float.py                53617.62 ->   66355.25 :  +12737.63 = +23.756% (+/-1.80%)
bm_hexiom.py                 149.13 ->     156.60 :      +7.47 =  +5.009% (+/-1.43%)
bm_nqueens.py              56551.95 ->   57100.81 :    +548.86 =  +0.971% (+/-0.66%)
bm_pidigits.py              2527.64 ->    2653.38 :    +125.74 =  +4.975% (+/-0.49%)
misc_aes.py                 3944.87 ->    4612.45 :    +667.58 = +16.923% (+/-0.44%)
misc_mandel.py             34184.39 ->   31959.46 :   -2224.93 =  -6.509% (+/-1.03%)
misc_pystone.py            21559.29 ->   26260.99 :   +4701.70 = +21.808% (+/-0.76%)
misc_raytrace.py            5472.38 ->    6515.27 :   +1042.89 = +19.057% (+/-1.82%)

Baseline -> map-caching (this PR) + load-attr opt (#7688):

diff of scores (higher is better)
N=1000 M=1000        perf-rpi3-base -> perf-rpi3-mapattr   diff     diff% (error%)
bm_chaos.py                 3523.25 ->    4254.51 :    +731.26 = +20.755% (+/-1.32%)
bm_fannkuch.py                15.65 ->      16.55 :      +0.90 =  +5.751% (+/-1.08%)
bm_fft.py                  27219.17 ->   28609.96 :   +1390.79 =  +5.110% (+/-1.58%)
bm_float.py                53617.62 ->   63446.47 :   +9828.85 = +18.331% (+/-1.66%)
bm_hexiom.py                 149.13 ->     188.85 :     +39.72 = +26.634% (+/-1.66%)
bm_nqueens.py              56551.95 ->   60637.03 :   +4085.08 =  +7.224% (+/-0.60%)
bm_pidigits.py              2527.64 ->    2656.99 :    +129.35 =  +5.117% (+/-0.45%)
misc_aes.py                 3944.87 ->    4512.75 :    +567.88 = +14.395% (+/-0.47%)
misc_mandel.py             34184.39 ->   39288.92 :   +5104.53 = +14.932% (+/-0.92%)
misc_pystone.py            21559.29 ->   25539.84 :   +3980.55 = +18.463% (+/-0.70%)
misc_raytrace.py            5472.38 ->    6839.04 :   +1366.66 = +24.974% (+/-0.88%)

Then, the diff of the above, ie the change in performance switching from inline bytecode caching to map caching plus load attribute optimisation:

diff of scores (higher is better)
N=1000 M=1000     perf-rpi3-bccache -> perf-rpi3-mapattr   diff     diff% (error%)
bm_chaos.py                 4041.71 ->    4254.51 :    +212.80 =  +5.265% (+/-1.05%)
bm_fannkuch.py                15.86 ->      16.55 :      +0.69 =  +4.351% (+/-0.81%)
bm_fft.py                  28516.10 ->   28609.96 :     +93.86 =  +0.329% (+/-0.37%)
bm_float.py                66355.25 ->   63446.47 :   -2908.78 =  -4.384% (+/-1.49%)
bm_hexiom.py                 156.60 ->     188.85 :     +32.25 = +20.594% (+/-0.92%)
bm_nqueens.py              57100.81 ->   60637.03 :   +3536.22 =  +6.193% (+/-0.54%)
bm_pidigits.py              2653.38 ->    2656.99 :      +3.61 =  +0.136% (+/-0.33%)
misc_aes.py                 4612.45 ->    4512.75 :     -99.70 =  -2.162% (+/-0.31%)
misc_mandel.py             31959.46 ->   39288.92 :   +7329.46 = +22.934% (+/-1.35%)
misc_pystone.py            26260.99 ->   25539.84 :    -721.15 =  -2.746% (+/-0.71%)
misc_raytrace.py            6515.27 ->    6839.04 :    +323.77 =  +4.969% (+/-1.35%)

py/map.c Outdated
@@ -136,6 +156,17 @@ mp_map_elem_t *mp_map_lookup(mp_map_t *map, mp_obj_t index, mp_map_lookup_kind_t
// If the map is a fixed array then we must only be called for a lookup
assert(!map->is_fixed || lookup_kind == MP_MAP_LOOKUP);

#if MICROPY_OPT_MAP_LOOKUP_CACHE
if (lookup_kind == MP_MAP_LOOKUP && map->alloc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can apply just as well to MP_MAP_LOOKUP_ADD_IF_NOT_FOUND.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it lookup_kind != MP_MAP_LOOKUP_REMOVE_IF_FOUND gives a nice little boost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good thinking. On PYBV11:

$ ./run-perfbench.py -s ~/3.cache-map-pybv11 ~/3.cache-map-add-pybv11 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/3.cache-map-pybv11 -> /home/jimmo/3.cache-map-add-pybv11         diff      diff% (error%)
bm_asyncio.py                113.73 ->     113.74 :      +0.01 =  +0.009% (+/-0.01%)
bm_chaos.py                  344.82 ->     343.50 :      -1.32 =  -0.383% (+/-0.00%)
bm_fannkuch.py                78.62 ->      78.62 :      +0.00 =  +0.000% (+/-0.00%)
bm_fft.py                   2593.26 ->    2593.13 :      -0.13 =  -0.005% (+/-0.00%)
bm_float.py                 5460.45 ->    5667.22 :    +206.77 =  +3.787% (+/-0.01%)
bm_hexiom.py                  42.78 ->      42.58 :      -0.20 =  -0.468% (+/-0.00%)
bm_nqueens.py               4467.91 ->    4464.29 :      -3.62 =  -0.081% (+/-0.00%)
bm_pidigits.py               648.37 ->     649.28 :      +0.91 =  +0.140% (+/-0.41%)
misc_aes.py                  426.08 ->     425.96 :      -0.12 =  -0.028% (+/-0.01%)
misc_mandel.py              3621.79 ->    3620.72 :      -1.07 =  -0.030% (+/-0.01%)
misc_pystone.py             2289.22 ->    2368.03 :     +78.81 =  +3.443% (+/-0.01%)
misc_raytrace.py             357.00 ->     354.92 :      -2.08 =  -0.583% (+/-0.01%)

@@ -172,6 +203,7 @@ mp_map_elem_t *mp_map_lookup(mp_map_t *map, mp_obj_t index, mp_map_lookup_kind_t
elem->value = value;
}
#endif
MAP_CACHE_SET(index, elem - map->table);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be wrapped in else { ... }, because if it was a REMOVE_IF_FOUND then the cache should not be updated (with a now-removed element).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to help... I guess the relatively uncommon removes are outweighed by the extra branch on the common path.

$ ./run-perfbench.py -s ~/3.cache-map-add-pybv11 ~/3.cache-map-add-else-pybv11 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/3.cache-map-add-pybv11 -> /home/jimmo/3.cache-map-add-else-pybv11         diff      diff% (error%)
bm_asyncio.py                113.74 ->     113.72 :      -0.02 =  -0.018% (+/-0.01%)
bm_chaos.py                  343.50 ->     336.44 :      -7.06 =  -2.055% (+/-0.00%)
bm_fannkuch.py                78.62 ->      78.26 :      -0.36 =  -0.458% (+/-0.01%)
bm_fft.py                   2593.13 ->    2552.65 :     -40.48 =  -1.561% (+/-0.00%)
bm_float.py                 5667.22 ->    5569.22 :     -98.00 =  -1.729% (+/-0.00%)
bm_hexiom.py                  42.58 ->      42.75 :      +0.17 =  +0.399% (+/-0.00%)
bm_nqueens.py               4464.29 ->    4455.36 :      -8.93 =  -0.200% (+/-0.00%)
bm_pidigits.py               649.28 ->     650.14 :      +0.86 =  +0.132% (+/-0.37%)
misc_aes.py                  425.96 ->     420.22 :      -5.74 =  -1.348% (+/-0.01%)
misc_mandel.py              3620.72 ->    3562.43 :     -58.29 =  -1.610% (+/-0.00%)
misc_pystone.py             2368.03 ->    2342.79 :     -25.24 =  -1.066% (+/-0.01%)
misc_raytrace.py             354.92 ->     351.44 :      -3.48 =  -0.981% (+/-0.01%)

py/map.c Show resolved Hide resolved
jimmo added a commit to jimmo/micropython that referenced this pull request Sep 13, 2021
This feature originally provided a significant performance boost for Unix,
but wasn't able to be enabled for MCU targets, and added significant extra
complexity to generating .mpy files. The equivalent performance gain is
now provided by MICROPY_OPT_LOAD_ATTR_FAST_PATH and
MICROPY_OPT_MAP_LOOKUP_CACHE.

See micropython#7680 for discussion.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@jimmo
Copy link
Member Author

jimmo commented Sep 13, 2021

@dpgeorge I have cherry-picked #7688 into this PR and re-arranged the order.

The final diff on pybv11 (compared to master):

$ ./run-perfbench.py -s ~/3.cache-baseline-pybv11 ~/3.cache-final-pybv11 
diff of scores (higher is better)
N=100 M=100              /home/jimmo/3.cache-baseline-pybv11 -> /home/jimmo/3.cache-final-pybv11         diff      diff% (error%)
bm_asyncio.py                111.79 ->     113.30 :      +1.51 =  +1.351% (+/-0.01%)
bm_chaos.py                  320.05 ->     361.49 :     +41.44 = +12.948% (+/-0.02%)
bm_fannkuch.py                78.96 ->      78.19 :      -0.77 =  -0.975% (+/-0.01%)
bm_fft.py                   2545.01 ->    2571.63 :     +26.62 =  +1.046% (+/-0.00%)
bm_float.py                 5088.98 ->    5990.95 :    +901.97 = +17.724% (+/-0.01%)
bm_hexiom.py                  35.64 ->      43.92 :      +8.28 = +23.232% (+/-0.01%)
bm_nqueens.py               4194.71 ->    4439.69 :    +244.98 =  +5.840% (+/-0.00%)
bm_pidigits.py               649.25 ->     646.32 :      -2.93 =  -0.451% (+/-0.07%)
misc_aes.py                  384.51 ->     422.65 :     +38.14 =  +9.919% (+/-0.01%)
misc_mandel.py              3112.82 ->    3559.87 :    +447.05 = +14.362% (+/-0.01%)
misc_pystone.py             1968.33 ->    2400.13 :    +431.80 = +21.937% (+/-0.02%)
misc_raytrace.py             318.19 ->     381.46 :     +63.27 = +19.884% (+/-0.01%)

And on unix (also compared to master, so the baseline includes bytecode caching)

$ ./run-perfbench.py -s ~/2.cache-unix-baseline ~/2.cache-unix-final 
diff of scores (higher is better)
N=2000 M=2000            /home/jimmo/2.cache-unix-baseline -> /home/jimmo/2.cache-unix-final         diff      diff% (error%)
bm_chaos.py                13578.84 ->   13771.01 :    +192.17 =  +1.415% (+/-4.33%)
bm_fannkuch.py                60.17 ->      60.72 :      +0.55 =  +0.914% (+/-2.23%)
bm_fft.py                 113053.96 ->  110575.17 :   -2478.79 =  -2.193% (+/-3.82%)
bm_float.py               253351.63 ->  246882.33 :   -6469.30 =  -2.553% (+/-2.64%)
bm_hexiom.py                 522.28 ->     624.67 :    +102.39 = +19.604% (+/-0.97%)
bm_nqueens.py             199638.88 ->  215727.17 :  +16088.29 =  +8.059% (+/-2.99%)
bm_pidigits.py              8067.60 ->    8002.27 :     -65.33 =  -0.810% (+/-4.04%)
misc_aes.py                17193.13 ->   16372.50 :    -820.63 =  -4.773% (+/-1.82%)
misc_mandel.py             98764.31 ->  130358.21 :  +31593.90 = +31.989% (+/-2.50%)
misc_pystone.py            83903.00 ->   84380.51 :    +477.51 =  +0.569% (+/-1.54%)
misc_raytrace.py           21399.20 ->   21855.17 :    +455.97 =  +2.131% (+/-2.76%)

@jimmo
Copy link
Member Author

jimmo commented Sep 13, 2021

Forgot to add -- this PR is +104 bytes on PYBV11.

py/map.c Outdated
// objects' locals dicts), and computation of the hash (and potentially some
// linear probing) in the case of a regular map. Note the same cache is
// shared across all maps.
uint8_t map_lookup_cache[MICROPY_OPT_MAP_LOOKUP_CACHE_SIZE];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should go in py/mpstate.h, in mp_state_vm_t, outside the root-pointer section of that struct. Reasoning:

  1. keeps all state localised to that file
  2. if there were multiple VM instances in the one address space, then I'd argue that each VM should have its own table because they run totally independent code

It shouldn't impact code size or runtime (it should compile to the same machine instructions...).

BTW, would there be any issue with GIL-less multithreading (eg rp2) accessing this table? My guess is no.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done -- branch updated.

Interestingly this did gain a small performance improvement. Possibly the new location makes it better for the data cache?

$ ./run-perfbench.py -s ~/lookup-global-pybv11 ~/lookup-state-pybv11
diff of scores (higher is better)
N=100 M=100              /home/jimmo/lookup-global-pybv11 -> /home/jimmo/lookup-state-pybv11         diff      diff% (error%)
bm_chaos.py                  361.52 ->     372.92 :     +11.40 =  +3.153% (+/-0.00%)
bm_fannkuch.py                78.19 ->      78.97 :      +0.78 =  +0.998% (+/-0.01%)
bm_fft.py                   2571.66 ->    2569.91 :      -1.75 =  -0.068% (+/-0.00%)
bm_float.py                 5991.20 ->    6168.25 :    +177.05 =  +2.955% (+/-0.01%)
bm_hexiom.py                  43.92 ->      44.22 :      +0.30 =  +0.683% (+/-0.00%)
bm_nqueens.py               4439.64 ->    4444.68 :      +5.04 =  +0.114% (+/-0.00%)
bm_pidigits.py               647.20 ->     651.21 :      +4.01 =  +0.620% (+/-0.38%)
misc_aes.py                  422.65 ->     425.99 :      +3.34 =  +0.790% (+/-0.01%)
misc_mandel.py              3559.81 ->    3598.03 :     +38.22 =  +1.074% (+/-0.01%)
misc_pystone.py             2400.13 ->    2429.40 :     +29.27 =  +1.220% (+/-0.01%)
misc_raytrace.py             381.46 ->     391.24 :      +9.78 =  +2.564% (+/-0.01%)

This is nice because the net effect of this PR is now +ve on all tests on PYBV11.

$ ./run-perfbench.py -s ~/3.cache-baseline-pybv11 ~/lookup-state-pybv11
diff of scores (higher is better)
N=100 M=100              /home/jimmo/3.cache-baseline-pybv11 -> /home/jimmo/lookup-state-pybv11         diff      diff% (error%)
bm_chaos.py                  320.05 ->     372.92 :     +52.87 = +16.519% (+/-0.02%)
bm_fannkuch.py                78.96 ->      78.97 :      +0.01 =  +0.013% (+/-0.01%)
bm_fft.py                   2545.01 ->    2569.91 :     +24.90 =  +0.978% (+/-0.00%)
bm_float.py                 5088.98 ->    6168.25 :   +1079.27 = +21.208% (+/-0.01%)
bm_hexiom.py                  35.64 ->      44.22 :      +8.58 = +24.074% (+/-0.00%)
bm_nqueens.py               4194.71 ->    4444.68 :    +249.97 =  +5.959% (+/-0.00%)
bm_pidigits.py               649.25 ->     651.21 :      +1.96 =  +0.302% (+/-0.31%)
misc_aes.py                  384.51 ->     425.99 :     +41.48 = +10.788% (+/-0.01%)
misc_mandel.py              3112.82 ->    3598.03 :    +485.21 = +15.587% (+/-0.01%)
misc_pystone.py             1968.33 ->    2429.40 :    +461.07 = +23.424% (+/-0.01%)
misc_raytrace.py             318.19 ->     391.24 :     +73.05 = +22.958% (+/-0.01%)

And overall a pretty good result compared to bytecode caching

$ ./run-perfbench.py -s ~/cache-bytecode-pybv11 ~/lookup-state-pybv11
diff of scores (higher is better)
N=100 M=100              /home/jimmo/cache-bytecode-pybv11 -> /home/jimmo/lookup-state-pybv11         diff      diff% (error%)
bm_chaos.py                  358.90 ->     372.92 :     +14.02 =  +3.906% (+/-0.01%)
bm_fannkuch.py                78.48 ->      78.97 :      +0.49 =  +0.624% (+/-0.01%)
bm_fft.py                   2576.19 ->    2569.91 :      -6.28 =  -0.244% (+/-0.00%)
bm_float.py                 6296.83 ->    6168.25 :    -128.58 =  -2.042% (+/-0.01%)
bm_hexiom.py                  34.45 ->      44.22 :      +9.77 = +28.360% (+/-0.00%)
bm_nqueens.py               4069.66 ->    4444.68 :    +375.02 =  +9.215% (+/-0.00%)
bm_pidigits.py               652.31 ->     651.21 :      -1.10 =  -0.169% (+/-0.36%)
misc_aes.py                  453.34 ->     425.99 :     -27.35 =  -6.033% (+/-0.01%)
misc_mandel.py              2960.68 ->    3598.03 :    +637.35 = +21.527% (+/-0.01%)
misc_pystone.py             2513.17 ->    2429.40 :     -83.77 =  -3.333% (+/-0.01%)
misc_raytrace.py             372.69 ->     391.24 :     +18.55 =  +4.977% (+/-0.01%)

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This is an alternative to the bytecode caching (which is incompatible
with bytecode-in-ROM).

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Enabled for all variants except minimal.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This feature originally provided a significant performance boost for Unix,
but wasn't able to be enabled for MCU targets, and added significant extra
complexity to generating .mpy files. The equivalent performance gain is
now provided by MICROPY_OPT_LOAD_ATTR_FAST_PATH and
MICROPY_OPT_MAP_LOOKUP_CACHE.

See micropython#7680 for discussion.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
dpgeorge pushed a commit that referenced this pull request Sep 16, 2021
This commit removes all parts of code associated with the existing
MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE optimisation option, including the
-mcache-lookup-bc option to mpy-cross.

This feature originally provided a significant performance boost for Unix,
but wasn't able to be enabled for MCU targets (due to frozen bytecode), and
added significant extra complexity to generating and distributing .mpy
files.

The equivalent performance gain is now provided by the combination of
MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE (which has
been enabled on the unix port in the previous commit).

It's hard to provide precise performance numbers, but tests have been run
on a wide variety of architectures (x86-64, ARM Cortex, Aarch64, RISC-V,
xtensa) and they all generally agree on the qualitative improvements seen
by the combination of MICROPY_OPT_LOAD_ATTR_FAST_PATH and
MICROPY_OPT_MAP_LOOKUP_CACHE.

For example, on a "quiet" Linux x64 environment (i3-5010U @ 2.10GHz) the
change from CACHE_MAP_LOOKUP_IN_BYTECODE, to LOAD_ATTR_FAST_PATH combined
with MAP_LOOKUP_CACHE is:

diff of scores (higher is better)
N=2000 M=2000       bccache -> attrmapcache      diff      diff% (error%)
bm_chaos.py        13742.56 ->   13905.67 :   +163.11 =  +1.187% (+/-3.75%)
bm_fannkuch.py        60.13 ->      61.34 :     +1.21 =  +2.012% (+/-2.11%)
bm_fft.py         113083.20 ->  114793.68 :  +1710.48 =  +1.513% (+/-1.57%)
bm_float.py       256552.80 ->  243908.29 : -12644.51 =  -4.929% (+/-1.90%)
bm_hexiom.py         521.93 ->     625.41 :   +103.48 = +19.826% (+/-0.40%)
bm_nqueens.py     197544.25 ->  217713.12 : +20168.87 = +10.210% (+/-3.01%)
bm_pidigits.py      8072.98 ->    8198.75 :   +125.77 =  +1.558% (+/-3.22%)
misc_aes.py        17283.45 ->   16480.52 :   -802.93 =  -4.646% (+/-0.82%)
misc_mandel.py     99083.99 ->  128939.84 : +29855.85 = +30.132% (+/-5.88%)
misc_pystone.py    83860.10 ->   82592.56 :  -1267.54 =  -1.511% (+/-2.27%)
misc_raytrace.py   21490.40 ->   22227.23 :   +736.83 =  +3.429% (+/-1.88%)

This shows that the new optimisations are at least as good as the existing
inline-bytecode-caching, and are sometimes much better (because the new
ones apply caching to a wider variety of map lookups).

The new optimisations can also benefit code generated by the native
emitter, because they apply to the runtime rather than the generated code.
The improvement for the native emitter when LOAD_ATTR_FAST_PATH and
MAP_LOOKUP_CACHE are enabled is (same Linux environment as above):

diff of scores (higher is better)
N=2000 M=2000        native -> nat-attrmapcache  diff      diff% (error%)
bm_chaos.py        14130.62 ->   15464.68 :  +1334.06 =  +9.441% (+/-7.11%)
bm_fannkuch.py        74.96 ->      76.16 :     +1.20 =  +1.601% (+/-1.80%)
bm_fft.py         166682.99 ->  168221.86 :  +1538.87 =  +0.923% (+/-4.20%)
bm_float.py       233415.23 ->  265524.90 : +32109.67 = +13.756% (+/-2.57%)
bm_hexiom.py         628.59 ->     734.17 :   +105.58 = +16.796% (+/-1.39%)
bm_nqueens.py     225418.44 ->  232926.45 :  +7508.01 =  +3.331% (+/-3.10%)
bm_pidigits.py      6322.00 ->    6379.52 :    +57.52 =  +0.910% (+/-5.62%)
misc_aes.py        20670.10 ->   27223.18 :  +6553.08 = +31.703% (+/-1.56%)
misc_mandel.py    138221.11 ->  152014.01 : +13792.90 =  +9.979% (+/-2.46%)
misc_pystone.py    85032.14 ->  105681.44 : +20649.30 = +24.284% (+/-2.25%)
misc_raytrace.py   19800.01 ->   23350.73 :  +3550.72 = +17.933% (+/-2.79%)

In summary, compared to MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE, the new
MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE options:
- are simpler;
- take less code size;
- are faster (generally);
- work with code generated by the native emitter;
- can be used on embedded targets with a small and constant RAM overhead;
- allow the same .mpy bytecode to run on all targets.

See #7680 for further discussion.  And see also #7653 for a discussion
about simplifying mpy-cross options.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@dpgeorge
Copy link
Member

Merged in 7b89ad8 through b326edf

Thanks @jimmo for the fantastic work on this, it's a really significant improvement! As I wrote in the commit message:

In summary, compared to MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE, the new
MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE options:
- are simpler;
- take less code size;
- are faster (generally);
- work with code generated by the native emitter;
- can be used on embedded targets with a small and constant RAM overhead;
- allow the same .mpy bytecode to run on all targets.

@dpgeorge dpgeorge closed this Sep 16, 2021
jepler added a commit to jepler/circuitpython that referenced this pull request Oct 15, 2021
These two optional optimzations take a little codespace but improve
bytecode performance:
 * micropython#7680
 * micropython#7688
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants