Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py/modmicropython: Add micropython.memmove() and micropython.memset(). #12487

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

projectgus
Copy link
Contributor

@projectgus projectgus commented Sep 20, 2023

This was based on a discussion about providing a more optimal way to copy data between buffers, however based on benchmarks so far it seems like it might not be worth it compared to optimising "copy to/from slice" code paths written in idiomatic Python.

Summary

Adds two functions to micropython module, gated behind a new config option:

  • micropython.memmove(dest, dest_idx, src, src_idx, [len]) - an optimised equivalent of dest[dest_idx:dest_idx+len] = src[src_idx:src_idx+len]. Copies memory contents with semantics of C memmove, hence the name. len argument is optional, length defaults to the minimum of the length of the source and destination regions.
  • micropython.memset(dest, dest_idx=0, c=0, len=len(dest)-dest_idx) - an optimised equivalent of dest[dest_idx:] = bytes([c]*len). Modelled on C's memset.

Unlike assigning to a slice, the destination buffer size never changes as a result of calling either of these functions. Out of bounds assignment raises an exception.

Benchmarks - memmove

Comparing memmove to current MicroPython "best practices" (unix port, i5-1248P CPU):

❯ ./run-internalbench.py --user-time internal_bench/slice_copy*.py
internal_bench/slice_copy:
    1.143s (+00.00%) internal_bench/slice_copy-1-lvalue_start.py
    1.248s (+09.19%) internal_bench/slice_copy-2-lvalue_start_end.py
    2.607s (+128.08%) internal_bench/slice_copy-3-lvalue_rvalue_start.py
    1.144s (+00.09%) internal_bench/slice_copy-4-lvalue_memoryview.py
    2.047s (+79.09%) internal_bench/slice_copy-5-lvalue_rvalue_memoryview.py
    1.072s (-06.21%) internal_bench/slice_copy-6-memmove.py
1 tests performed (6 individual testcases)

Honestly I found this a little underwhelming! Admittedly, slice_copy-6-memmove.py can do the equivalent of slice_copy-5-lvalue_rvalue_memoryview.py (slices on both sides of the assignment) and it's almost twice as fast, but it's only twice as fast (in a tight loop that does nothing else, working with pretty short buffers.)

Maybe the C implementation of memmove() needs some tweaks to streamline the error checking 🤷 .

When rebased against PR #10160 things get even closer:

❯ ./run-internalbench.py --user-time internal_bench/slice_copy*.py
internal_bench/slice_copy:
    0.969s (+00.00%) internal_bench/slice_copy-1-lvalue_start.py
    1.062s (+09.60%) internal_bench/slice_copy-2-lvalue_start_end.py
    2.218s (+128.90%) internal_bench/slice_copy-3-lvalue_rvalue_start.py
    0.968s (-00.10%) internal_bench/slice_copy-4-lvalue_memoryview.py
    1.743s (+79.88%) internal_bench/slice_copy-5-lvalue_rvalue_memoryview.py
    1.092s (+12.69%) internal_bench/slice_copy-6-memmove.py
1 tests performed (6 individual testcases)

Now slice_copy-6-memmove.py is only 1.6x faster than slice_copy-5-lvalue_rvalue_memoryview.py, and no faster than assigning a buffer to an lvalue slice...

Benchmarks - memset

❯ ./run-internalbench.py internal_bench/set_bytearray*
internal_bench/set_bytearray:
    10.591s (+00.00%) internal_bench/set_bytearray-1-naive.py
    9.405s (-11.20%) internal_bench/set_bytearray-2-naive-while.py
    1.110s (-89.52%) internal_bench/set_bytearray-3-copy_bytes.py
    0.935s (-91.17%) internal_bench/set_bytearray-4-memset.py
1 tests performed (4 individual testcases)

Kind of the same story with memset, writing out a bytes array (which can be frozen to flash) is basically as fast as using the memset() function. The naive versions of this are a lot slower, though!

Disclaimer: The new test file names take some liberties with the meaning of lvalue and rvalue, happy to take suggestions for more accurate term to use.

This work was funded through GitHub Sponsors.

@github-actions
Copy link

github-actions bot commented Sep 20, 2023

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:  +968 +0.121% standard[incl +96(data)]
      stm32:  +348 +0.089% PYBV10
     mimxrt:  +328 +0.091% TEENSY40
        rp2:  +376 +0.115% RPI_PICO
       samd:  +348 +0.134% ADAFRUIT_ITSYBITSY_M4_EXPRESS

@codecov
Copy link

codecov bot commented Sep 20, 2023

Codecov Report

Merging #12487 (b55ac53) into master (00930b2) will decrease coverage by 0.20%.
Report is 2 commits behind head on master.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #12487      +/-   ##
==========================================
- Coverage   98.38%   98.18%   -0.20%     
==========================================
  Files         158      158              
  Lines       20940    20981      +41     
==========================================
- Hits        20602    20601       -1     
- Misses        338      380      +42     
Files Changed Coverage Δ
py/modmicropython.c 51.68% <0.00%> (-48.32%) ⬇️

... and 1 file with indirect coverage changes

This was based on a discussion about providing a more optimal way to
copy data between buffers, however based on the benchmarking so far
it seems like it might not be worth the overhead.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
@projectgus
Copy link
Contributor Author

projectgus commented Sep 20, 2023

Now slice_copy-6-memmove.py is only 1.6x faster than slice_copy-5-lvalue_rvalue_memoryview.py, and no faster than assigning a buffer to an lvalue slice...

After applying the thread-local slice optimisation and poking around with flamegraph and gdb, all of the time is spent allocating the "load" side slice when the LOAD_SUBSCR opcode is executed. This ends up at objarray.c:531 which needs to heap allocate another mp_obj_array_t (memoryview copy) in order to hold the slice parameters wrapped around the underlying memoryview.

This is another very short-lived heap allocation, but it looks like it would be much harder to optimise than the thread-local slice case.

@projectgus
Copy link
Contributor Author

This is an automated heads-up that we've just merged a Pull Request
that removes the STATIC macro from MicroPython's C API.

See #13763

A search suggests this PR might apply the STATIC macro to some C code. If it
does, then next time you rebase the PR (or merge from master) then you should
please replace all the STATIC keywords with static.

Although this is an automated message, feel free to @-reply to me directly if
you have any questions about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants