Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fiber] Add more concurrency classes #1026

Draft
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

salkinium
Copy link
Member

@salkinium salkinium commented May 20, 2023

Fibers need a standard set of efficient constructs.
I modelled these classes after the C++ thread concurrency interfaces. Not sure if useful, but also probably not really much room for improvement anyways.

  • Implement <atomics> for AVR
  • Bring back <mutex> and <shared_mutex> for AVR
  • Add atomic, mutex and shared_mutex headers back avr-libstdcpp#34
  • modm::fiber::sleep -> modm::this_fiber::sleep_for
  • modm::this_fiber::sleep_until
  • modm::this_fiber::get_id
  • Align modm::fiber::Task to std::jthread interface, incl. stop_token.
  • stop_token and stop_source (simplified)
  • mutex
  • timed_mutex
  • recursive_mutex
  • timed_recursive_mutex
  • counting_semaphore (=binary_semaphore)
  • shared_mutex
  • latch
  • barrier
  • condition_variable
  • unit tests
  • More documentation

@salkinium salkinium added this to the 2023q2 milestone May 20, 2023
@salkinium salkinium removed this from the 2023q2 milestone Jul 10, 2023
@salkinium salkinium force-pushed the feature/fiber_concurrency branch 2 times, most recently from 97c4cb4 to d9e13d0 Compare April 14, 2024 10:03
@salkinium salkinium added this to the 2024q2 milestone Apr 14, 2024
@salkinium salkinium force-pushed the feature/fiber_concurrency branch 3 times, most recently from 3526783 to 3f372b6 Compare April 14, 2024 21:19
@salkinium
Copy link
Member Author

salkinium commented Apr 14, 2024

It's still missing unit tests for latch, barrier and condition_variable and a bunch of documentation, however, it would be amazing if this could get reviewed with a special focus on C++ "conformity".
This implementation should be interrupt-safe (when it makes sense for the functions), so that we can use then in the HAL as signalling primitives for operation completions in the future.

I also noticed that avrlibstd++ does not have <atomic>, I assume it was really annoying to port, @chris-durand? If it's an issue with the builtin gcc atomics not being implemented for avr-gcc, we can reuse the Cortex-M0 ones now.

@chris-durand
Copy link
Member

I also noticed that avrlibstd++ does not have <atomic>, I assume it was really annoying to port, @chris-durand? If it's an issue with the builtin gcc atomics not being implemented for avr-gcc, we can reuse the Cortex-M0 ones now.

I haven't tried to be honest but I expect that the same solution as for Cortex-M0 will work. The header is mostly implemented using compiler builtins. I'd expect it to fall back to calling a library function in case the operation is not atomic on AVR. If you are lucky just dropping in the headers and compiling atomics_c11_cortex.cpp.in for AVR could be sufficient.

C++20 atomic wait which requires OS support shouldn't be an issue because it is only enabled when _GLIBCXX_HAS_GTHREADS is set.

@chris-durand
Copy link
Member

I'll take a look at the PR over the weekend.

@salkinium salkinium force-pushed the feature/fiber_concurrency branch 2 times, most recently from c8575fb to 377553c Compare April 20, 2024 16:44
@salkinium
Copy link
Member Author

If you are lucky just dropping in the headers and compiling atomics_c11_cortex.cpp.in for AVR could be sufficient.

Yup that just works, thanks!

@salkinium salkinium force-pushed the feature/fiber_concurrency branch 2 times, most recently from 0d3fa53 to 6803ce0 Compare April 21, 2024 16:01
Copy link
Member

@chris-durand chris-durand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

I haven't managed to look at all the synchronization primitives, but will do tomorrow.

src/modm/processing/fiber/functions.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/functions.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/functions.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/barrier.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/mutex.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/shared_mutex.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/shared_mutex.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/shared_mutex.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/shared_mutex.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/shared_mutex.hpp Outdated Show resolved Hide resolved
@salkinium salkinium force-pushed the feature/fiber_concurrency branch 2 times, most recently from b3d3dee to 48ff96d Compare May 1, 2024 16:21
- modm::fiber -> modm::this_fiber.
- modm::fiber::sleep() -> modm::this_fiber::sleep_for().
@salkinium salkinium force-pushed the feature/fiber_concurrency branch 5 times, most recently from d9b837a to 72ab66d Compare May 6, 2024 14:51
src/modm/processing/fiber/barrier.hpp Outdated Show resolved Hide resolved
src/modm/processing/fiber/barrier.hpp Outdated Show resolved Hide resolved
void inline
notify_one()
{
sequence.fetch_add(1, std::memory_order_acquire);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a release operation? Some store happens before the notify and it shouldn't be reordered behind it. On the consumer side, which is the wait, we need an acquire load or barrier such that no memory access is reordered before the wait is complete.

An acquire operation always has to be paired with a matching release. Otherwise it doesn't provide any ordering guarantees.

Actually, is there any use case where this code would ever have to deal with more than one thread on a multi-core CPU or interrupts running on different cores? I assume not, because the reason for fibers is to have a single-threaded execution model.
In that case we don't even need the actual memory barrier instructions to be emitted. You could use relaxed order on the atomic operation plus a compiler fence (std::atomic_signal_fence). The only thing that can reorder operations within a single thread is the compiler. Hardware barriers on atomics are only ever needed if you have multithreading and multiple cores with their own caches and view on memory.

I don't really know how expensive a few additional dmb instructions are on Cortex-M and if it would be worth caring about it. Anyway I'm not sure if any of fibers would work in the presence of multithreading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, is there any use case where this code would ever have to deal with more than one thread on a multi-core CPU or interrupts running on different cores? I assume not, because the reason for fibers is to have a single-threaded execution model.

Well… my secret plan is to allow running one fiber scheduler per (FreeRTOS) thread and then do some kind of message passing between threads like Qt does with their event loops. I also want to support the dual-core H745 in the future.

According to the Programming Guide to Memory Barrier Instructions on Cortex-M0 to Cortex-M4, all (internal) memory accesses are always in-order, thus DMBs are not required. However, this is only an implementation choice, the architecture explicitly allows for reordering of memory accesses!

And indeed, there seem to be several caveats for single-core devices:

  1. There is no guarantee of ordering on external memories, e.g. SDRAM connected over FMC. It's up to the implementation to correctly implement DMB, they even go as far as saying that you should do a dummy read on something you wrote to enforce the write order. For example, on STM32F4, the FMC has several FIFOs:
  • Write Data FIFO with 16 x33-bit depth

  • Write Address FIFO with 16x30-bit depth

  • Cacheable Read FIFO with 6 x32-bit depth (6 x14-bit address tag) for SDRAM controller.

    However, it's unclear to me if these FIFOs are in-order relative to each other. (I would expect it, but who knows?)

  1. The document does not talk about Cortex-M7, which has this 64-bit internal AXI bus, that seems to allow for reordering of memory accesses. Specifically, it can do speculative reads at any time. So I think you still need DMBs on single-core Cortex-M7.

I would just leave the DMBs in the code for now. For Cortex-M0(+), I implemented atomics without DMBs anyways, so the optimization of leaving out the DMBs would only apply to CM3/CM4 without external memories.

I don't really know how expensive a few additional dmb instructions are on Cortex-M and if it would be worth caring about it

I can do a comparison of the DWT->LSUCNT (cycles spent waiting for loads and stores to complete) with and without DMBs once I have a representative example later on to get an idea of the overhead of having "unnecessary" DMBs in the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And indeed, there seem to be several caveats for single-core devices:

  1. There is no guarantee of ordering on external memories
  2. The document does not talk about Cortex-M7, which has this 64-bit internal AXI bus, that seems to allow for reordering of memory accesses

And none of those should matter when there is only one core. The problem with multi-core CPUs is that you can observe modifications to multiple memory locations in a different order across cores.

Let's say you have a single-producer single-consumer atomic queue, you write some data to the end and increment the last element pointer. What can happen is that the pointer modification is visible before the actual data change. So you would read invalid values on the consumer thread because they are not visible yet. Acquire-release ordering on the pointer solves that issue.

If you are on a single-core machine this is impossible by definition. Writing a variable and reading back the value must give the same result. There is nothing to be out of order with. This is just single-threaded programming, I can't see what the dmbs could accomplish unless you are doing DMA or other similar things where peripherals interact with the memory directly. But that's the peripheral driver's job and out of scope for atomics.

The only things we need from the atomic is the atomicity and the compiler optimization barrier stopping it from doing the same reorderings as the multi-core system could do.

As a side note that is why the volatile-based modm interrupt "atomic" queue is broken. The compiler is free to reorder non-volatile accesses around volatile ones and gcc is known to do that (https://blog.regehr.org/archives/28, item 5). We should fix it.

For Cortex-M0(+), I implemented atomics without DMBs anyways

For plain atomic loads and stores the compiler will still emit ldr/str plus dmb (https://gcc.godbolt.org/z/Enb6179Pb) instead of turning off interrupts. The fallback implementation is only used for read-modify-write operations the M0 can't do atomically due to the lack of ldrex and strex.

I can do a comparison of the DWT->LSUCNT

Nice, I didn't even know that existed. That would be a good benchmark. Most likely we can just accept having the dmb instructions.

I also want to support the dual-core H745 in the future

On the H7 dual-core you have to be careful with the caches if you are sharing data between both cores. If a value in SRAM is cached in the M7 DCACHE and has not been written back, an access from the M4 will read the old value from RAM. There is no automatic cache coherency.
ST's own programming model is to write data to a shared SRAM, trigger a cross-core interrupt and memcpy the data over. Not sure what to expect if we share atomics across both CPUs, but it should somehow work if there is no caching involved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see what the dmbs could accomplish

Yes that makes sense to me. The CM7 AXI is not free to reorder memory access if it modifies the single-threaded view of memory (ie. swapping write and read of the same memory address). And I assume that ST didn't screw up the access order for external memory either.

The only things we need from the atomic is the atomicity and the compiler optimization barrier stopping it from doing the same reorderings as the multi-core system could do.

I thought about defining a bunch of modm::memory_order_acquire = std::memory_order_relaxed aliases for single-core devices and then using those everywhere, however, I looked into how GCC actually implements the builtins, and it seems to me like no barrier is generated for std::memory_order_relaxed, not even a compiler barrier:

  1. std::atomic<T>::load() calls the __atomic_load() builtins.
  2. I guess this then somehow calls ‎expand_builtin_atomic_load‎() which calls
  3. expand_atomic_load() and this then generates the actual instructions.
  4. this generates a memory barrier via expand_memory_blockage() either as dmb or asm volatile("" ::: "memory").
  5. it only emits a (compiler) memory barrier if NOT relaxed.

So I think we should still just use it the way it is intended even if it costs a few cycles more.

Not sure what to expect if we share atomics across both CPUs, but it should somehow work if there is no caching involved.

We should add a MPU driver and mark that SRAM as non-cachable. We need to do this also for memories that are DMA-able on CM7, then we could enable the D-Cache by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to me like no barrier is generated for std::memory_order_relaxed, not even a compiler barrier

Yes, if you want the compiler barrier only you'll have to do a relaxed access on the atomic plus std::atomic_signal_fence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct way to approach this is to modify the atomic builtin implementation and not fix it from the outside. Since I'm not motivated to shave that yak, I'll just keep using DMBs for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

2 participants