Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rp2: Fix recursive atomic sections when core1 is active. #15264

Merged
merged 1 commit into from
Jun 25, 2024

Conversation

projectgus
Copy link
Contributor

@projectgus projectgus commented Jun 12, 2024

Summary

mp_thread_begin_atomic_section() is expected to be recursive (i.e. for nested machine.disable_irq() calls, or if Python code calls disable_irq() and then the Python runtime calls mp_handle_pending() which also enters an atomic section to check the scheduler state).

On rp2 when not using core1 the atomic sections are recursive.

However when core1 was active (i.e. _thread) then there was a bug that caused the core to live-lock if an atomic section recursed.

Adds a test case specifically for mutual exclusion and recursive atomic sections when using two threads. Without this fix the test immediately hangs on rp2.

This was found while testing a fix for micropython/micropython-lib#874 (but it's only a partial fix for that issue).

Testing

  • Re-ran test suite, including the new unit test, on rp2 with this change.
  • Re-ran the test code from the linked issue, it no longer randomly hangs in live-lock.
  • Also ran the new thread/disable_irq.py test on esp32 port and verified correct output (via mpremote run, currently thread tests are disabled on this port.)

Trade-offs and Alternatives

  • recursive_mutex_enter_blocking is also compiled into the firmware, so it might be possible to call save_and_disable_interrupts and then recursive_mutex_enter_blocking in order to save a little code size. Not sure, though.
  • Possibly shouldn't be calling the scheduler hook at all when interrupts are disabled. However this fix would still be needed for the recursive disable_irq() case.

This work was funded through GitHub Sponsors.

Copy link

codecov bot commented Jun 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.42%. Comparing base (908ab1c) to head (cfa55b4).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #15264   +/-   ##
=======================================
  Coverage   98.42%   98.42%           
=======================================
  Files         161      161           
  Lines       21248    21248           
=======================================
  Hits        20914    20914           
  Misses        334      334           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
     mimxrt:    +0 +0.000% TEENSY40
        rp2:   +16 +0.002% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS

@projectgus
Copy link
Contributor Author

Updated so the test is a bit more aggressive about testing the nesting of disable_irq.

@dpgeorge
Copy link
Member

Thanks, this looks like a necessary fix. The machine.disable_irq/enable_irq functions should be able to be nested.

recursive_mutex_enter_blocking is also compiled into the firmware, so it might be possible to call save_and_disable_interrupts and then recursive_mutex_enter_blocking in order to save a little code size. Not sure, though.

I'm pretty sure the reason we needed custom mutex+irq functions is still valid, so we can't separate them again. See dc2a4e3

mp_thread_begin_atomic_section() is expected to be recursive (i.e. for
nested machine.disable_irq() calls, or if Python code calls disable_irq()
and then the Python runtime calls mp_handle_pending() which also enters an
atomic section to check the scheduler state).

On rp2 when not using core1 the atomic sections are recursive.

However when core1 was active (i.e. _thread) then there was a bug that
caused the core to live-lock if an atomic section recursed.

Adds a test case specifically for mutual exclusion and recursive atomic
sections when using two threads. Without this fix the test immediately
hangs on rp2.

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
@dpgeorge dpgeorge merged commit cfa55b4 into micropython:master Jun 25, 2024
28 checks passed
@projectgus
Copy link
Contributor Author

I'm pretty sure the reason we needed custom mutex+irq functions is still valid, so we can't separate them again. See dc2a4e3

That deadlock happens from taking the mutex before disabling interrupts, is that right? I don't think there's a similar deadlock from disabling interrupts before trying to take the mutex, because it's no longer possible for that core to be interrupted at the same point.

I think the main limitation of changing it to two calls is that if a core is waiting for the mutex, the current code restores interrupts each time around the loop so it doesn't starve interrupts on that core. If we disable interrupts before trying to take the mutex then interrupts will remain disabled on that core until the mutex is taken. Which I think is probably often a small period of time, but it could be longer in some cases.

However it's a small enough piece of code that keeping it like it is seems like the best course of action. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants