Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proof of concept] Implement parts of the core in Python bytecode #5025

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dpgeorge
Copy link
Member

This is a proof-of-concept to show how it's possible to reimplement parts of the MicroPython core in Python bytecode. So far in this PR the builtin sum(iterable, start=0) function is changed to a pure Python implementation and "hand frozen" in to the firmware.

The main reason to do this is to reduce code size while retaining the same functionality. The code size change with this PR is:

   bare-arm:   -16 -0.024% 
minimal x86:   +32 +0.021% 
   unix x64:   -96 -0.019% [incl +32(data)]
unix nanbox:  -172 -0.039% 
      stm32:    -8 -0.002% PYBV10
     cc3200:   -16 -0.009% 
    esp8266:    -4 -0.001% 
      esp32:   -24 -0.002% GENERIC [incl +32(data)]
        nrf:    -4 -0.003% pca10040
       samd:   -12 -0.012% ADAFRUIT_ITSYBITSY_M4_EXPRESS

That's only a small decrease but this principle would scale to reimplementing a lot of built in functionality (eg string methods) to get a decent reduction in code size.

Some points to note:

  • eventually there'd be some preprocessing so the functions can be written in Python rather than bytecode by hand
  • in general I'd expect a bytecode implementation to be smaller than the corresponding C implementation, although it might not always be the case
  • in this PR there's some extra code added to the VM to handle traceback (or lack thereof) in builtin bytecode, a cost that only needs to be paid once
  • passing in keyword args to sum() will crash the VM and needs a small check added to fix this (would add a bit of code but only once)
  • the bytecode for sum() was generated with mpy-cross and can actually be optimised down by one byte
  • the prelude for the bytecode is relatively large, 10 bytes compared with 15 bytes for the actual function; it would be possible to optimise the prelude for size which would reduce the size for all bytecode functions, builtin and user defined
  • note that the bytecode for sum() includes a default argument
  • of course performance would be reduced, so ultimately there could be an option for "functions in C for speed" vs "functions in Python for size"

@stinos
Copy link
Contributor

stinos commented Aug 20, 2019

This is an interesting concept. Did you happen to measure performance differences?

@dpgeorge
Copy link
Member Author

Did you happen to measure performance differences?

No, didn't get to that yet. I guess it wouldn't be too hard to do it for sum, just passing it in various things.

@dpgeorge
Copy link
Member Author

I did some benchmarking, running a simple loop like this:

def test():
    x = range(1000)
    for _ in range(300):
        sum(x)

Running on a PYBD-SF2 @ 120MHz I get:

diff of scores (higher is better)
N=100 M=100                   sum_c -> sum_bytecode         diff      diff% (error%)
builtin_sum_range.py        1023.28 ->     512.16 :    -511.12 = -49.949% (+/-0.14%)
builtin_sum_list.py         1106.45 ->     519.57 :    -586.88 = -53.042% (+/-0.20%)

So bytecode is half the speed of the existing C implementation.

Then I used mpy-cross to compile the Python implementation of sum to native Python code, and put that in instead of the frozen bytecode. The results for the benchmark of this native Python vs C were:

diff of scores (higher is better)
N=100 M=100                   sum_c -> sum_native         diff      diff% (error%)
builtin_sum_range.py        1023.28 ->     942.19 :     -81.09 =  -7.925% (+/-0.29%)
builtin_sum_list.py         1106.45 ->     998.18 :    -108.27 =  -9.785% (+/-0.22%)

That's not too bad, the native code generator is close to the C version!

See latest commit for the native code blob, for both x86-64 and Thumb2. The size of these blobs is a bit bigger than the C version, but with a bit of optimisation in the native emitter these blobs could become a bit smaller and faster, possibly getting close to C.


There is a lot of scope here for further work. As shown, it possible to reimplement parts of the core in Python, which is compiled to either bytecode (small but slow) or native machine code (fast but bigger, and potentially on par with the existing C implementation). A benefit of writing in Python rather than C is that the underlying architecture of the system (of MicroPython, the VM, etc) is hidden. For example, whether exceptions are implemented using NLR or simple return codes is irrelevant to the Python implementation of a function (eg sum()), and the underlying architecture can be more easily changed.

This is very meta and gets into the territory of PyPy (MicroPyPy!), where the interpreter is written in itself (RPython a reduced dialect of Python).

@nevercast
Copy link
Contributor

This is a very cool demonstration Damien. I'd like to see arguments, eventually, to both the make process for building MicroPython, and to mpy-cross to favour speed vs. size, that would set a MACRO that we can use throughout the code base to prefer for example Python vs C, like you have here.

Perhaps an option to favour less heap or more heap too?

@dpgeorge
Copy link
Member Author

I'd like to see arguments, eventually, to both the make process for building MicroPython, and to mpy-cross to favour speed vs. size, that would set a MACRO that we can use throughout the code base to prefer for example Python vs C, like you have here.

Yes, that makes sense, speed vs size (like -Os vs -O2).

Perhaps an option to favour less heap or more heap too?

It might be tricky to have that option orthogonal to speed-vs-size. In general the code always attempts to reduce heap usage.

@projectgus
Copy link
Contributor

This is an automated heads-up that we've just merged a Pull Request
that removes the STATIC macro from MicroPython's C API.

See #13763

A search suggests this PR might apply the STATIC macro to some C code. If it
does, then next time you rebase the PR (or merge from master) then you should
please replace all the STATIC keywords with static.

Although this is an automated message, feel free to @-reply to me directly if
you have any questions about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants