[fiber] Add more concurrency classes #1026

salkinium · 2023-05-20T16:40:57Z

Fibers need a standard set of efficient constructs.
I modelled these classes after the C++ thread concurrency interfaces. Not sure if useful, but also probably not really much room for improvement anyways.

salkinium · 2024-04-14T21:22:28Z

It's still missing unit tests for latch, barrier and condition_variable and a bunch of documentation, however, it would be amazing if this could get reviewed with a special focus on C++ "conformity".
This implementation should be interrupt-safe (when it makes sense for the functions), so that we can use then in the HAL as signalling primitives for operation completions in the future.

I also noticed that avrlibstd++ does not have <atomic>, I assume it was really annoying to port, @chris-durand? If it's an issue with the builtin gcc atomics not being implemented for avr-gcc, we can reuse the Cortex-M0 ones now.

chris-durand · 2024-04-18T19:04:37Z

I also noticed that avrlibstd++ does not have <atomic>, I assume it was really annoying to port, @chris-durand? If it's an issue with the builtin gcc atomics not being implemented for avr-gcc, we can reuse the Cortex-M0 ones now.

I haven't tried to be honest but I expect that the same solution as for Cortex-M0 will work. The header is mostly implemented using compiler builtins. I'd expect it to fall back to calling a library function in case the operation is not atomic on AVR. If you are lucky just dropping in the headers and compiling atomics_c11_cortex.cpp.in for AVR could be sufficient.

C++20 atomic wait which requires OS support shouldn't be an issue because it is only enabled when _GLIBCXX_HAS_GTHREADS is set.

chris-durand · 2024-04-18T19:06:54Z

I'll take a look at the PR over the weekend.

salkinium · 2024-04-20T16:58:33Z

If you are lucky just dropping in the headers and compiling atomics_c11_cortex.cpp.in for AVR could be sufficient.

Yup that just works, thanks!

chris-durand

Very nice!

I haven't managed to look at all the synchronization primitives, but will do tomorrow.

src/modm/processing/fiber/functions.hpp

src/modm/processing/fiber/barrier.hpp

src/modm/processing/fiber/mutex.hpp

src/modm/processing/fiber/shared_mutex.hpp

- modm::fiber -> modm::this_fiber. - modm::fiber::sleep() -> modm::this_fiber::sleep_for().

src/modm/processing/fiber/barrier.hpp

chris-durand · 2024-05-06T20:38:38Z

src/modm/processing/fiber/condition_variable.hpp

+	void inline
+	notify_one()
+	{
+		sequence.fetch_add(1, std::memory_order_acquire);


Shouldn't this be a release operation? Some store happens before the notify and it shouldn't be reordered behind it. On the consumer side, which is the wait, we need an acquire load or barrier such that no memory access is reordered before the wait is complete.

An acquire operation always has to be paired with a matching release. Otherwise it doesn't provide any ordering guarantees.

Actually, is there any use case where this code would ever have to deal with more than one thread on a multi-core CPU or interrupts running on different cores? I assume not, because the reason for fibers is to have a single-threaded execution model.
In that case we don't even need the actual memory barrier instructions to be emitted. You could use relaxed order on the atomic operation plus a compiler fence (std::atomic_signal_fence). The only thing that can reorder operations within a single thread is the compiler. Hardware barriers on atomics are only ever needed if you have multithreading and multiple cores with their own caches and view on memory.

I don't really know how expensive a few additional dmb instructions are on Cortex-M and if it would be worth caring about it. Anyway I'm not sure if any of fibers would work in the presence of multithreading.

Actually, is there any use case where this code would ever have to deal with more than one thread on a multi-core CPU or interrupts running on different cores? I assume not, because the reason for fibers is to have a single-threaded execution model.

Well… my secret plan is to allow running one fiber scheduler per (FreeRTOS) thread and then do some kind of message passing between threads like Qt does with their event loops. I also want to support the dual-core H745 in the future.

According to the Programming Guide to Memory Barrier Instructions on Cortex-M0 to Cortex-M4, all (internal) memory accesses are always in-order, thus DMBs are not required. However, this is only an implementation choice, the architecture explicitly allows for reordering of memory accesses!

And indeed, there seem to be several caveats for single-core devices:

There is no guarantee of ordering on external memories, e.g. SDRAM connected over FMC. It's up to the implementation to correctly implement DMB, they even go as far as saying that you should do a dummy read on something you wrote to enforce the write order. For example, on STM32F4, the FMC has several FIFOs:

Write Data FIFO with 16 x33-bit depth

Write Address FIFO with 16x30-bit depth

Cacheable Read FIFO with 6 x32-bit depth (6 x14-bit address tag) for SDRAM controller.

However, it's unclear to me if these FIFOs are in-order relative to each other. (I would expect it, but who knows?)

The document does not talk about Cortex-M7, which has this 64-bit internal AXI bus, that seems to allow for reordering of memory accesses. Specifically, it can do speculative reads at any time. So I think you still need DMBs on single-core Cortex-M7.

I would just leave the DMBs in the code for now. For Cortex-M0(+), I implemented atomics without DMBs anyways, so the optimization of leaving out the DMBs would only apply to CM3/CM4 without external memories.

I don't really know how expensive a few additional dmb instructions are on Cortex-M and if it would be worth caring about it

I can do a comparison of the DWT->LSUCNT (cycles spent waiting for loads and stores to complete) with and without DMBs once I have a representative example later on to get an idea of the overhead of having "unnecessary" DMBs in the code.

And indeed, there seem to be several caveats for single-core devices:

There is no guarantee of ordering on external memories

The document does not talk about Cortex-M7, which has this 64-bit internal AXI bus, that seems to allow for reordering of memory accesses

And none of those should matter when there is only one core. The problem with multi-core CPUs is that you can observe modifications to multiple memory locations in a different order across cores.

Let's say you have a single-producer single-consumer atomic queue, you write some data to the end and increment the last element pointer. What can happen is that the pointer modification is visible before the actual data change. So you would read invalid values on the consumer thread because they are not visible yet. Acquire-release ordering on the pointer solves that issue.

If you are on a single-core machine this is impossible by definition. Writing a variable and reading back the value must give the same result. There is nothing to be out of order with. This is just single-threaded programming, I can't see what the dmbs could accomplish unless you are doing DMA or other similar things where peripherals interact with the memory directly. But that's the peripheral driver's job and out of scope for atomics.

The only things we need from the atomic is the atomicity and the compiler optimization barrier stopping it from doing the same reorderings as the multi-core system could do.

As a side note that is why the volatile-based modm interrupt "atomic" queue is broken. The compiler is free to reorder non-volatile accesses around volatile ones and gcc is known to do that (https://blog.regehr.org/archives/28, item 5). We should fix it.

For Cortex-M0(+), I implemented atomics without DMBs anyways

For plain atomic loads and stores the compiler will still emit ldr/str plus dmb (https://gcc.godbolt.org/z/Enb6179Pb) instead of turning off interrupts. The fallback implementation is only used for read-modify-write operations the M0 can't do atomically due to the lack of ldrex and strex.

I can do a comparison of the DWT->LSUCNT

Nice, I didn't even know that existed. That would be a good benchmark. Most likely we can just accept having the dmb instructions.

I also want to support the dual-core H745 in the future

On the H7 dual-core you have to be careful with the caches if you are sharing data between both cores. If a value in SRAM is cached in the M7 DCACHE and has not been written back, an access from the M4 will read the old value from RAM. There is no automatic cache coherency.
ST's own programming model is to write data to a shared SRAM, trigger a cross-core interrupt and memcpy the data over. Not sure what to expect if we share atomics across both CPUs, but it should somehow work if there is no caching involved.

I can't see what the dmbs could accomplish

Yes that makes sense to me. The CM7 AXI is not free to reorder memory access if it modifies the single-threaded view of memory (ie. swapping write and read of the same memory address). And I assume that ST didn't screw up the access order for external memory either.

The only things we need from the atomic is the atomicity and the compiler optimization barrier stopping it from doing the same reorderings as the multi-core system could do.

I thought about defining a bunch of modm::memory_order_acquire = std::memory_order_relaxed aliases for single-core devices and then using those everywhere, however, I looked into how GCC actually implements the builtins, and it seems to me like no barrier is generated for std::memory_order_relaxed, not even a compiler barrier:

std::atomic<T>::load() calls the __atomic_load() builtins.

I guess this then somehow calls ‎expand_builtin_atomic_load‎() which calls

expand_atomic_load() and this then generates the actual instructions.

this generates a memory barrier via expand_memory_blockage() either as dmb or asm volatile("" ::: "memory").

it only emits a (compiler) memory barrier if NOT relaxed.

So I think we should still just use it the way it is intended even if it costs a few cycles more.

Not sure what to expect if we share atomics across both CPUs, but it should somehow work if there is no caching involved.

We should add a MPU driver and mark that SRAM as non-cachable. We need to do this also for memories that are DMA-able on CM7, then we could enable the D-Cache by default.

it seems to me like no barrier is generated for std::memory_order_relaxed, not even a compiler barrier

Yes, if you want the compiler barrier only you'll have to do a relaxed access on the atomic plus std::atomic_signal_fence.

I think the correct way to approach this is to modify the atomic builtin implementation and not fix it from the outside. Since I'm not motivated to shave that yak, I'll just keep using DMBs for now.

salkinium added the feature 🚧 label May 20, 2023

salkinium added this to the 2023q2 milestone May 20, 2023

salkinium removed this from the 2023q2 milestone Jul 10, 2023

salkinium added 2 commits April 13, 2024 14:43

[amnb] Fix mismatched delete[] operator

193b0e8

[test] Add function names to hosted unit test runner

85d1324

salkinium force-pushed the feature/fiber_concurrency branch 2 times, most recently from 97c4cb4 to d9e13d0 Compare April 14, 2024 10:03

salkinium added this to the 2024q2 milestone Apr 14, 2024

salkinium added the advanced 🤯 label Apr 14, 2024

salkinium force-pushed the feature/fiber_concurrency branch 3 times, most recently from 3526783 to 3f372b6 Compare April 14, 2024 21:19

salkinium requested a review from chris-durand April 14, 2024 21:20

salkinium force-pushed the feature/fiber_concurrency branch from 3f372b6 to b662dc3 Compare April 14, 2024 21:58

salkinium force-pushed the feature/fiber_concurrency branch 2 times, most recently from c8575fb to 377553c Compare April 20, 2024 16:44

salkinium force-pushed the feature/fiber_concurrency branch 2 times, most recently from 0d3fa53 to 6803ce0 Compare April 21, 2024 16:01

chris-durand reviewed Apr 21, 2024

View reviewed changes

[gcc] Enable atomic builtins for AVR

028db97

salkinium force-pushed the feature/fiber_concurrency branch 2 times, most recently from b3d3dee to 48ff96d Compare May 1, 2024 16:21

[fiber] Align fiber namespace and functions with std naming

987f2f8

- modm::fiber -> modm::this_fiber. - modm::fiber::sleep() -> modm::this_fiber::sleep_for().

salkinium force-pushed the feature/fiber_concurrency branch from 48ff96d to 0cce1bc Compare May 1, 2024 16:50

salkinium added 2 commits May 1, 2024 19:14

[fiber] Add fiber identifier type

d2141b3

[fiber] Implement sleep_for and sleep_until with polling

3f0eb6c

salkinium force-pushed the feature/fiber_concurrency branch 5 times, most recently from d9b837a to 72ab66d Compare May 6, 2024 14:51

chris-durand reviewed May 6, 2024

View reviewed changes

salkinium mentioned this pull request May 7, 2024

Add atomic, mutex and shared_mutex headers back modm-io/avr-libstdcpp#34

Draft

salkinium force-pushed the feature/fiber_concurrency branch from 72ab66d to 76e26ca Compare May 8, 2024 20:28

salkinium added 2 commits May 9, 2024 23:21

[fiber] Implement stop_token interface for Task

05fe86b

[fiber] Implement std concurrency interfaces

694f3a0

salkinium force-pushed the feature/fiber_concurrency branch from 76e26ca to 694f3a0 Compare May 9, 2024 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fiber] Add more concurrency classes #1026

[fiber] Add more concurrency classes #1026

salkinium commented May 20, 2023 •

edited

salkinium commented Apr 14, 2024 •

edited

chris-durand commented Apr 18, 2024

chris-durand commented Apr 18, 2024

salkinium commented Apr 20, 2024

chris-durand left a comment

chris-durand May 6, 2024

salkinium May 9, 2024

chris-durand May 9, 2024 •

edited

salkinium May 9, 2024

chris-durand May 9, 2024

salkinium May 9, 2024

[fiber] Add more concurrency classes #1026

Are you sure you want to change the base?

[fiber] Add more concurrency classes #1026

Conversation

salkinium commented May 20, 2023 • edited

salkinium commented Apr 14, 2024 • edited

chris-durand commented Apr 18, 2024

chris-durand commented Apr 18, 2024

salkinium commented Apr 20, 2024

chris-durand left a comment

Choose a reason for hiding this comment

chris-durand May 6, 2024

Choose a reason for hiding this comment

salkinium May 9, 2024

Choose a reason for hiding this comment

chris-durand May 9, 2024 • edited

Choose a reason for hiding this comment

salkinium May 9, 2024

Choose a reason for hiding this comment

chris-durand May 9, 2024

Choose a reason for hiding this comment

salkinium May 9, 2024

Choose a reason for hiding this comment

salkinium commented May 20, 2023 •

edited

salkinium commented Apr 14, 2024 •

edited

chris-durand May 9, 2024 •

edited