rp2: Fix recursive atomic sections when core1 is active. #15264

projectgus · 2024-06-12T09:06:32Z

Summary

mp_thread_begin_atomic_section() is expected to be recursive (i.e. for nested machine.disable_irq() calls, or if Python code calls disable_irq() and then the Python runtime calls mp_handle_pending() which also enters an atomic section to check the scheduler state).

On rp2 when not using core1 the atomic sections are recursive.

However when core1 was active (i.e. _thread) then there was a bug that caused the core to live-lock if an atomic section recursed.

Adds a test case specifically for mutual exclusion and recursive atomic sections when using two threads. Without this fix the test immediately hangs on rp2.

This was found while testing a fix for micropython/micropython-lib#874 (but it's only a partial fix for that issue).

Testing

Re-ran test suite, including the new unit test, on rp2 with this change.
Re-ran the test code from the linked issue, it no longer randomly hangs in live-lock.
Also ran the new thread/disable_irq.py test on esp32 port and verified correct output (via mpremote run, currently thread tests are disabled on this port.)

Trade-offs and Alternatives

recursive_mutex_enter_blocking is also compiled into the firmware, so it might be possible to call save_and_disable_interrupts and then recursive_mutex_enter_blocking in order to save a little code size. Not sure, though.
Possibly shouldn't be calling the scheduler hook at all when interrupts are disabled. However this fix would still be needed for the recursive disable_irq() case.

This work was funded through GitHub Sponsors.

codecov · 2024-06-12T09:09:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.42%. Comparing base (908ab1c) to head (cfa55b4).

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #15264   +/-   ##
=======================================
  Coverage   98.42%   98.42%           
=======================================
  Files         161      161           
  Lines       21248    21248           
=======================================
  Hits        20914    20914           
  Misses        334      334

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-06-12T09:13:01Z

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
     mimxrt:    +0 +0.000% TEENSY40
        rp2:   +16 +0.002% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS

projectgus · 2024-06-18T01:20:58Z

Updated so the test is a bit more aggressive about testing the nesting of disable_irq.

dpgeorge · 2024-06-25T01:00:45Z

Thanks, this looks like a necessary fix. The machine.disable_irq/enable_irq functions should be able to be nested.

recursive_mutex_enter_blocking is also compiled into the firmware, so it might be possible to call save_and_disable_interrupts and then recursive_mutex_enter_blocking in order to save a little code size. Not sure, though.

I'm pretty sure the reason we needed custom mutex+irq functions is still valid, so we can't separate them again. See dc2a4e3

mp_thread_begin_atomic_section() is expected to be recursive (i.e. for nested machine.disable_irq() calls, or if Python code calls disable_irq() and then the Python runtime calls mp_handle_pending() which also enters an atomic section to check the scheduler state). On rp2 when not using core1 the atomic sections are recursive. However when core1 was active (i.e. _thread) then there was a bug that caused the core to live-lock if an atomic section recursed. Adds a test case specifically for mutual exclusion and recursive atomic sections when using two threads. Without this fix the test immediately hangs on rp2. This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <angus@redyak.com.au>

projectgus · 2024-06-25T02:32:49Z

I'm pretty sure the reason we needed custom mutex+irq functions is still valid, so we can't separate them again. See dc2a4e3

That deadlock happens from taking the mutex before disabling interrupts, is that right? I don't think there's a similar deadlock from disabling interrupts before trying to take the mutex, because it's no longer possible for that core to be interrupted at the same point.

I think the main limitation of changing it to two calls is that if a core is waiting for the mutex, the current code restores interrupts each time around the loop so it doesn't starve interrupts on that core. If we disable interrupts before trying to take the mutex then interrupts will remain disabled on that core until the mutex is taken. Which I think is probably often a small period of time, but it could be longer in some cases.

However it's a small enough piece of code that keeping it like it is seems like the best course of action. 👍

projectgus added bug port-rp2 labels Jun 12, 2024

projectgus force-pushed the bugfix/rp2_thread_atomic branch from fef17c7 to 4c3c51c Compare June 18, 2024 01:20

dpgeorge force-pushed the bugfix/rp2_thread_atomic branch from 4c3c51c to cfa55b4 Compare June 25, 2024 01:03

dpgeorge merged commit cfa55b4 into micropython:master Jun 25, 2024
28 checks passed

projectgus deleted the bugfix/rp2_thread_atomic branch June 25, 2024 02:32

projectgus mentioned this pull request Jul 11, 2024

usb: Fix USB device library transfer thread safety micropython/micropython-lib#896

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rp2: Fix recursive atomic sections when core1 is active. #15264

rp2: Fix recursive atomic sections when core1 is active. #15264

projectgus commented Jun 12, 2024 •

edited

Loading

codecov bot commented Jun 12, 2024 •

edited

Loading

github-actions bot commented Jun 12, 2024

projectgus commented Jun 18, 2024

dpgeorge commented Jun 25, 2024

projectgus commented Jun 25, 2024

rp2: Fix recursive atomic sections when core1 is active. #15264

rp2: Fix recursive atomic sections when core1 is active. #15264

Conversation

projectgus commented Jun 12, 2024 • edited Loading

Summary

Testing

Trade-offs and Alternatives

codecov bot commented Jun 12, 2024 • edited Loading

Codecov Report

github-actions bot commented Jun 12, 2024

projectgus commented Jun 18, 2024

dpgeorge commented Jun 25, 2024

projectgus commented Jun 25, 2024

projectgus commented Jun 12, 2024 •

edited

Loading

codecov bot commented Jun 12, 2024 •

edited

Loading