New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
py/map.c: Add a cache of bytecode location to map position. #7680
Conversation
For comparison, here's how the existing MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE goes on PYBV11 and PYBD_SF6
|
I've updated the branch. It's now +176 bytes, but marginally faster.
By optimising for code size you can change the table to 128 entries (instead of 127), and save about 28 bytes, and put
|
This definitely helps. Now +124 bytes on PYBV11.
|
Just as a random side quest, something I've been meaning to investigate for a while is the computed goto. On PYBV11, disabling it saves +1276 bytes. The effect on performance varies a bit:
Combined with this PR (i.e. map caching without computed goto) is a net -1156 bytes for an overall performance benefit (except for fft).
|
Based on suggestion from @dpgeorge I have changed this to use (map-pointer, key) as the cache index, and this is a signficant win for code size (now only +48 bytes) and performance.
It also works for the native emitter and adds a similar perf boost on top of the gain already provided by the native emitter.
|
py/map.c
Outdated
@@ -126,6 +126,15 @@ STATIC void mp_map_rehash(mp_map_t *map) { | |||
m_del(mp_map_elem_t, old_table, old_alloc); | |||
} | |||
|
|||
#if MICROPY_OPT_MAP_CACHING | |||
uint8_t map_lookup_cache[128]; | |||
#define MAP_CACHE_OFFSET ((((uintptr_t) + (uintptr_t)index) >> 2) % sizeof(map_lookup_cache)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this first uintptr_t
is missing something that it casts, maybe map
or map->table
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I had this using (map+index)>>2
, but changed it to just use index>>2
because it's overall faster. Accidentally left behind the no-op cast. I've updated the branch, and addressed the other comment about the macro args.
Here's the diff FWIW of adding back map
to the expression. Makes some things marginally better, but big hit on pystone and raytrace.
$ ./run-perfbench.py -s ~/2.cache-index-pybv11 ~/2.cache-mapindex-pybv11
diff of scores (higher is better)
N=100 M=100 /home/jimmo/2.cache-index-pybv11 -> /home/jimmo/2.cache-mapindex-pybv11 diff diff% (error%)
bm_asyncio.py 113.73 -> 113.43 : -0.30 = -0.264% (+/-0.01%)
bm_chaos.py 348.59 -> 340.54 : -8.05 = -2.309% (+/-0.01%)
bm_fannkuch.py 78.53 -> 78.20 : -0.33 = -0.420% (+/-0.00%)
bm_fft.py 2563.02 -> 2597.15 : +34.13 = +1.332% (+/-0.00%)
bm_float.py 5544.62 -> 5440.36 : -104.26 = -1.880% (+/-0.02%)
bm_hexiom.py 42.73 -> 42.74 : +0.01 = +0.023% (+/-0.00%)
bm_nqueens.py 4429.89 -> 4454.66 : +24.77 = +0.559% (+/-0.00%)
bm_pidigits.py 651.39 -> 649.83 : -1.56 = -0.239% (+/-0.34%)
misc_aes.py 423.37 -> 428.64 : +5.27 = +1.245% (+/-0.01%)
misc_mandel.py 3610.82 -> 3624.53 : +13.71 = +0.380% (+/-0.01%)
misc_pystone.py 2287.95 -> 2185.46 : -102.49 = -4.480% (+/-0.01%)
misc_raytrace.py 362.17 -> 341.62 : -20.55 = -5.674% (+/-0.01%)
And while we're at it, here's the latest baseline->PR after rebasing (similar to the first table in #7680 (comment))
$ ./run-perfbench.py -s ~/2.cache-baseline-pybv11 ~/2.cache-index-pybv11
diff of scores (higher is better)
N=100 M=100 /home/jimmo/2.cache-baseline-pybv11 -> /home/jimmo/2.cache-index-pybv11 diff diff% (error%)
bm_asyncio.py 111.34 -> 113.73 : +2.39 = +2.147% (+/-0.01%)
bm_chaos.py 311.32 -> 348.59 : +37.27 = +11.972% (+/-0.02%)
bm_fannkuch.py 77.82 -> 78.53 : +0.71 = +0.912% (+/-0.00%)
bm_fft.py 2539.05 -> 2563.02 : +23.97 = +0.944% (+/-0.00%)
bm_float.py 4955.56 -> 5544.62 : +589.06 = +11.887% (+/-0.00%)
bm_hexiom.py 35.52 -> 42.73 : +7.21 = +20.298% (+/-0.00%)
bm_nqueens.py 4201.40 -> 4429.89 : +228.49 = +5.438% (+/-0.00%)
bm_pidigits.py 646.91 -> 651.39 : +4.48 = +0.693% (+/-0.17%)
misc_aes.py 382.97 -> 423.37 : +40.40 = +10.549% (+/-0.01%)
misc_mandel.py 3091.74 -> 3610.82 : +519.08 = +16.789% (+/-0.01%)
misc_pystone.py 1948.00 -> 2287.95 : +339.95 = +17.451% (+/-0.01%)
misc_raytrace.py 312.45 -> 362.17 : +49.72 = +15.913% (+/-0.01%)
This feature provided a significant performance boost for Unix, but wasn't able to be enabled for MCU targets, and added significant extra complexity to generating .mpy files. Some of the performance gain can be achieved now with MICROPY_OPT_MAP_LOOKUP_CACHE instead. See micropython#7680 for discussion. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This feature provided a significant performance boost for Unix, but wasn't able to be enabled for MCU targets, and added significant extra complexity to generating .mpy files. Some of the performance gain can be achieved now with MICROPY_OPT_MAP_LOOKUP_CACHE instead. See micropython#7680 for discussion. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This feature provided a significant performance boost for Unix, but wasn't able to be enabled for MCU targets, and added significant extra complexity to generating .mpy files. Some of the performance gain can be achieved now with MICROPY_OPT_MAP_LOOKUP_CACHE instead. See micropython#7680 for discussion. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
I've added a commit to remove bytecode caching, would happily make this a different PR though. The main reason is so that CI will run tests with map caching (which does not do anything when bytecode caching is enabled). |
Codecov Report
@@ Coverage Diff @@
## master #7680 +/- ##
==========================================
- Coverage 98.26% 98.24% -0.02%
==========================================
Files 154 154
Lines 20076 20032 -44
==========================================
- Hits 19728 19681 -47
- Misses 348 351 +3
Continue to review full report at Codecov.
|
I set up a "quiet" Linux env to try and get some useful comparisons for Linux x64 (i3-5010U @ 2.10GHz). Normally the tests have +/- 30% uncertainty on my regular env, here they are more like +/- 3%. baseline -> bytecode caching (enabled by default):
baseline -> map caching
map caching --> attr fast path (#7688)
baseline -> map caching & attr fast path
bytecode caching -> map caching (this is probably the most important one because it's the diff created by this PR)
bytecode caching -> map caching & attr fast path (diff of this PR plus #7688) (@stinos)
baseline -> native
native -> native & map caching & attr fast path
baseline -> native & map caching & attr fast path (combination of previous two)
|
I did some benchmarking on a Raspberry Pi 3, running Linux in 64-bit mode. This is relatively quite for running tests, max 2% error. The main two results are: Baseline (no caching optimisations) -> inline bytecode caching (current master setting for unix port):
Baseline -> map-caching (this PR) + load-attr opt (#7688):
Then, the diff of the above, ie the change in performance switching from inline bytecode caching to map caching plus load attribute optimisation:
|
py/map.c
Outdated
@@ -136,6 +156,17 @@ mp_map_elem_t *mp_map_lookup(mp_map_t *map, mp_obj_t index, mp_map_lookup_kind_t | |||
// If the map is a fixed array then we must only be called for a lookup | |||
assert(!map->is_fixed || lookup_kind == MP_MAP_LOOKUP); | |||
|
|||
#if MICROPY_OPT_MAP_LOOKUP_CACHE | |||
if (lookup_kind == MP_MAP_LOOKUP && map->alloc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can apply just as well to MP_MAP_LOOKUP_ADD_IF_NOT_FOUND
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it lookup_kind != MP_MAP_LOOKUP_REMOVE_IF_FOUND
gives a nice little boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good thinking. On PYBV11:
$ ./run-perfbench.py -s ~/3.cache-map-pybv11 ~/3.cache-map-add-pybv11
diff of scores (higher is better)
N=100 M=100 /home/jimmo/3.cache-map-pybv11 -> /home/jimmo/3.cache-map-add-pybv11 diff diff% (error%)
bm_asyncio.py 113.73 -> 113.74 : +0.01 = +0.009% (+/-0.01%)
bm_chaos.py 344.82 -> 343.50 : -1.32 = -0.383% (+/-0.00%)
bm_fannkuch.py 78.62 -> 78.62 : +0.00 = +0.000% (+/-0.00%)
bm_fft.py 2593.26 -> 2593.13 : -0.13 = -0.005% (+/-0.00%)
bm_float.py 5460.45 -> 5667.22 : +206.77 = +3.787% (+/-0.01%)
bm_hexiom.py 42.78 -> 42.58 : -0.20 = -0.468% (+/-0.00%)
bm_nqueens.py 4467.91 -> 4464.29 : -3.62 = -0.081% (+/-0.00%)
bm_pidigits.py 648.37 -> 649.28 : +0.91 = +0.140% (+/-0.41%)
misc_aes.py 426.08 -> 425.96 : -0.12 = -0.028% (+/-0.01%)
misc_mandel.py 3621.79 -> 3620.72 : -1.07 = -0.030% (+/-0.01%)
misc_pystone.py 2289.22 -> 2368.03 : +78.81 = +3.443% (+/-0.01%)
misc_raytrace.py 357.00 -> 354.92 : -2.08 = -0.583% (+/-0.01%)
@@ -172,6 +203,7 @@ mp_map_elem_t *mp_map_lookup(mp_map_t *map, mp_obj_t index, mp_map_lookup_kind_t | |||
elem->value = value; | |||
} | |||
#endif | |||
MAP_CACHE_SET(index, elem - map->table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be wrapped in else { ... }
, because if it was a REMOVE_IF_FOUND
then the cache should not be updated (with a now-removed element).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to help... I guess the relatively uncommon removes are outweighed by the extra branch on the common path.
$ ./run-perfbench.py -s ~/3.cache-map-add-pybv11 ~/3.cache-map-add-else-pybv11
diff of scores (higher is better)
N=100 M=100 /home/jimmo/3.cache-map-add-pybv11 -> /home/jimmo/3.cache-map-add-else-pybv11 diff diff% (error%)
bm_asyncio.py 113.74 -> 113.72 : -0.02 = -0.018% (+/-0.01%)
bm_chaos.py 343.50 -> 336.44 : -7.06 = -2.055% (+/-0.00%)
bm_fannkuch.py 78.62 -> 78.26 : -0.36 = -0.458% (+/-0.01%)
bm_fft.py 2593.13 -> 2552.65 : -40.48 = -1.561% (+/-0.00%)
bm_float.py 5667.22 -> 5569.22 : -98.00 = -1.729% (+/-0.00%)
bm_hexiom.py 42.58 -> 42.75 : +0.17 = +0.399% (+/-0.00%)
bm_nqueens.py 4464.29 -> 4455.36 : -8.93 = -0.200% (+/-0.00%)
bm_pidigits.py 649.28 -> 650.14 : +0.86 = +0.132% (+/-0.37%)
misc_aes.py 425.96 -> 420.22 : -5.74 = -1.348% (+/-0.01%)
misc_mandel.py 3620.72 -> 3562.43 : -58.29 = -1.610% (+/-0.00%)
misc_pystone.py 2368.03 -> 2342.79 : -25.24 = -1.066% (+/-0.01%)
misc_raytrace.py 354.92 -> 351.44 : -3.48 = -0.981% (+/-0.01%)
This feature originally provided a significant performance boost for Unix, but wasn't able to be enabled for MCU targets, and added significant extra complexity to generating .mpy files. The equivalent performance gain is now provided by MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE. See micropython#7680 for discussion. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@dpgeorge I have cherry-picked #7688 into this PR and re-arranged the order. The final diff on pybv11 (compared to master):
And on unix (also compared to master, so the baseline includes bytecode caching)
|
Forgot to add -- this PR is +104 bytes on PYBV11. |
py/map.c
Outdated
// objects' locals dicts), and computation of the hash (and potentially some | ||
// linear probing) in the case of a regular map. Note the same cache is | ||
// shared across all maps. | ||
uint8_t map_lookup_cache[MICROPY_OPT_MAP_LOOKUP_CACHE_SIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should go in py/mpstate.h
, in mp_state_vm_t
, outside the root-pointer section of that struct. Reasoning:
- keeps all state localised to that file
- if there were multiple VM instances in the one address space, then I'd argue that each VM should have its own table because they run totally independent code
It shouldn't impact code size or runtime (it should compile to the same machine instructions...).
BTW, would there be any issue with GIL-less multithreading (eg rp2) accessing this table? My guess is no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done -- branch updated.
Interestingly this did gain a small performance improvement. Possibly the new location makes it better for the data cache?
$ ./run-perfbench.py -s ~/lookup-global-pybv11 ~/lookup-state-pybv11
diff of scores (higher is better)
N=100 M=100 /home/jimmo/lookup-global-pybv11 -> /home/jimmo/lookup-state-pybv11 diff diff% (error%)
bm_chaos.py 361.52 -> 372.92 : +11.40 = +3.153% (+/-0.00%)
bm_fannkuch.py 78.19 -> 78.97 : +0.78 = +0.998% (+/-0.01%)
bm_fft.py 2571.66 -> 2569.91 : -1.75 = -0.068% (+/-0.00%)
bm_float.py 5991.20 -> 6168.25 : +177.05 = +2.955% (+/-0.01%)
bm_hexiom.py 43.92 -> 44.22 : +0.30 = +0.683% (+/-0.00%)
bm_nqueens.py 4439.64 -> 4444.68 : +5.04 = +0.114% (+/-0.00%)
bm_pidigits.py 647.20 -> 651.21 : +4.01 = +0.620% (+/-0.38%)
misc_aes.py 422.65 -> 425.99 : +3.34 = +0.790% (+/-0.01%)
misc_mandel.py 3559.81 -> 3598.03 : +38.22 = +1.074% (+/-0.01%)
misc_pystone.py 2400.13 -> 2429.40 : +29.27 = +1.220% (+/-0.01%)
misc_raytrace.py 381.46 -> 391.24 : +9.78 = +2.564% (+/-0.01%)
This is nice because the net effect of this PR is now +ve on all tests on PYBV11.
$ ./run-perfbench.py -s ~/3.cache-baseline-pybv11 ~/lookup-state-pybv11
diff of scores (higher is better)
N=100 M=100 /home/jimmo/3.cache-baseline-pybv11 -> /home/jimmo/lookup-state-pybv11 diff diff% (error%)
bm_chaos.py 320.05 -> 372.92 : +52.87 = +16.519% (+/-0.02%)
bm_fannkuch.py 78.96 -> 78.97 : +0.01 = +0.013% (+/-0.01%)
bm_fft.py 2545.01 -> 2569.91 : +24.90 = +0.978% (+/-0.00%)
bm_float.py 5088.98 -> 6168.25 : +1079.27 = +21.208% (+/-0.01%)
bm_hexiom.py 35.64 -> 44.22 : +8.58 = +24.074% (+/-0.00%)
bm_nqueens.py 4194.71 -> 4444.68 : +249.97 = +5.959% (+/-0.00%)
bm_pidigits.py 649.25 -> 651.21 : +1.96 = +0.302% (+/-0.31%)
misc_aes.py 384.51 -> 425.99 : +41.48 = +10.788% (+/-0.01%)
misc_mandel.py 3112.82 -> 3598.03 : +485.21 = +15.587% (+/-0.01%)
misc_pystone.py 1968.33 -> 2429.40 : +461.07 = +23.424% (+/-0.01%)
misc_raytrace.py 318.19 -> 391.24 : +73.05 = +22.958% (+/-0.01%)
And overall a pretty good result compared to bytecode caching
$ ./run-perfbench.py -s ~/cache-bytecode-pybv11 ~/lookup-state-pybv11
diff of scores (higher is better)
N=100 M=100 /home/jimmo/cache-bytecode-pybv11 -> /home/jimmo/lookup-state-pybv11 diff diff% (error%)
bm_chaos.py 358.90 -> 372.92 : +14.02 = +3.906% (+/-0.01%)
bm_fannkuch.py 78.48 -> 78.97 : +0.49 = +0.624% (+/-0.01%)
bm_fft.py 2576.19 -> 2569.91 : -6.28 = -0.244% (+/-0.00%)
bm_float.py 6296.83 -> 6168.25 : -128.58 = -2.042% (+/-0.01%)
bm_hexiom.py 34.45 -> 44.22 : +9.77 = +28.360% (+/-0.00%)
bm_nqueens.py 4069.66 -> 4444.68 : +375.02 = +9.215% (+/-0.00%)
bm_pidigits.py 652.31 -> 651.21 : -1.10 = -0.169% (+/-0.36%)
misc_aes.py 453.34 -> 425.99 : -27.35 = -6.033% (+/-0.01%)
misc_mandel.py 2960.68 -> 3598.03 : +637.35 = +21.527% (+/-0.01%)
misc_pystone.py 2513.17 -> 2429.40 : -83.77 = -3.333% (+/-0.01%)
misc_raytrace.py 372.69 -> 391.24 : +18.55 = +4.977% (+/-0.01%)
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This is an alternative to the bytecode caching (which is incompatible with bytecode-in-ROM). Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Enabled for all variants except minimal. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This feature originally provided a significant performance boost for Unix, but wasn't able to be enabled for MCU targets, and added significant extra complexity to generating .mpy files. The equivalent performance gain is now provided by MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE. See micropython#7680 for discussion. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This commit removes all parts of code associated with the existing MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE optimisation option, including the -mcache-lookup-bc option to mpy-cross. This feature originally provided a significant performance boost for Unix, but wasn't able to be enabled for MCU targets (due to frozen bytecode), and added significant extra complexity to generating and distributing .mpy files. The equivalent performance gain is now provided by the combination of MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE (which has been enabled on the unix port in the previous commit). It's hard to provide precise performance numbers, but tests have been run on a wide variety of architectures (x86-64, ARM Cortex, Aarch64, RISC-V, xtensa) and they all generally agree on the qualitative improvements seen by the combination of MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE. For example, on a "quiet" Linux x64 environment (i3-5010U @ 2.10GHz) the change from CACHE_MAP_LOOKUP_IN_BYTECODE, to LOAD_ATTR_FAST_PATH combined with MAP_LOOKUP_CACHE is: diff of scores (higher is better) N=2000 M=2000 bccache -> attrmapcache diff diff% (error%) bm_chaos.py 13742.56 -> 13905.67 : +163.11 = +1.187% (+/-3.75%) bm_fannkuch.py 60.13 -> 61.34 : +1.21 = +2.012% (+/-2.11%) bm_fft.py 113083.20 -> 114793.68 : +1710.48 = +1.513% (+/-1.57%) bm_float.py 256552.80 -> 243908.29 : -12644.51 = -4.929% (+/-1.90%) bm_hexiom.py 521.93 -> 625.41 : +103.48 = +19.826% (+/-0.40%) bm_nqueens.py 197544.25 -> 217713.12 : +20168.87 = +10.210% (+/-3.01%) bm_pidigits.py 8072.98 -> 8198.75 : +125.77 = +1.558% (+/-3.22%) misc_aes.py 17283.45 -> 16480.52 : -802.93 = -4.646% (+/-0.82%) misc_mandel.py 99083.99 -> 128939.84 : +29855.85 = +30.132% (+/-5.88%) misc_pystone.py 83860.10 -> 82592.56 : -1267.54 = -1.511% (+/-2.27%) misc_raytrace.py 21490.40 -> 22227.23 : +736.83 = +3.429% (+/-1.88%) This shows that the new optimisations are at least as good as the existing inline-bytecode-caching, and are sometimes much better (because the new ones apply caching to a wider variety of map lookups). The new optimisations can also benefit code generated by the native emitter, because they apply to the runtime rather than the generated code. The improvement for the native emitter when LOAD_ATTR_FAST_PATH and MAP_LOOKUP_CACHE are enabled is (same Linux environment as above): diff of scores (higher is better) N=2000 M=2000 native -> nat-attrmapcache diff diff% (error%) bm_chaos.py 14130.62 -> 15464.68 : +1334.06 = +9.441% (+/-7.11%) bm_fannkuch.py 74.96 -> 76.16 : +1.20 = +1.601% (+/-1.80%) bm_fft.py 166682.99 -> 168221.86 : +1538.87 = +0.923% (+/-4.20%) bm_float.py 233415.23 -> 265524.90 : +32109.67 = +13.756% (+/-2.57%) bm_hexiom.py 628.59 -> 734.17 : +105.58 = +16.796% (+/-1.39%) bm_nqueens.py 225418.44 -> 232926.45 : +7508.01 = +3.331% (+/-3.10%) bm_pidigits.py 6322.00 -> 6379.52 : +57.52 = +0.910% (+/-5.62%) misc_aes.py 20670.10 -> 27223.18 : +6553.08 = +31.703% (+/-1.56%) misc_mandel.py 138221.11 -> 152014.01 : +13792.90 = +9.979% (+/-2.46%) misc_pystone.py 85032.14 -> 105681.44 : +20649.30 = +24.284% (+/-2.25%) misc_raytrace.py 19800.01 -> 23350.73 : +3550.72 = +17.933% (+/-2.79%) In summary, compared to MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE, the new MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE options: - are simpler; - take less code size; - are faster (generally); - work with code generated by the native emitter; - can be used on embedded targets with a small and constant RAM overhead; - allow the same .mpy bytecode to run on all targets. See #7680 for further discussion. And see also #7653 for a discussion about simplifying mpy-cross options. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Merged in 7b89ad8 through b326edf Thanks @jimmo for the fantastic work on this, it's a really significant improvement! As I wrote in the commit message:
|
These two optional optimzations take a little codespace but improve bytecode performance: * micropython#7680 * micropython#7688
Background: On Unix (which only ever executes bytecode from RAM) we have a feature where the bytecode reserves an extra byte after load attr/global opcodes, which at runtime stores a map index of the likely location of this field. This isn't enabled on any other ports because they execute bytecode from ROM (as well as RAM), and we can't have a different bytecode format for both.
See also #7653 (comment) (reduce the number of options for mpy-cross).
This is a different approach that adds a global cache (keyed by bytecode instruction pointer) to a map offset, then all mp_map_lookup operations use it. It's much less precise than bytecode caching, but allows the cache to be external to the bytecode.
TODO / discuss:
On PYBV11 it's +156 bytes (+128 bytes RAM). Performance tests (in score mode) for PYBV11 and PYBD_SF6 below: