New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fiber] Add more concurrency classes #1026
base: develop
Are you sure you want to change the base?
Conversation
97c4cb4
to
d9e13d0
Compare
3526783
to
3f372b6
Compare
It's still missing unit tests for latch, barrier and condition_variable and a bunch of documentation, however, it would be amazing if this could get reviewed with a special focus on C++ "conformity". I also noticed that avrlibstd++ does not have |
3f372b6
to
b662dc3
Compare
I haven't tried to be honest but I expect that the same solution as for Cortex-M0 will work. The header is mostly implemented using compiler builtins. I'd expect it to fall back to calling a library function in case the operation is not atomic on AVR. If you are lucky just dropping in the headers and compiling C++20 atomic wait which requires OS support shouldn't be an issue because it is only enabled when |
I'll take a look at the PR over the weekend. |
c8575fb
to
377553c
Compare
Yup that just works, thanks! |
0d3fa53
to
6803ce0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
I haven't managed to look at all the synchronization primitives, but will do tomorrow.
b3d3dee
to
48ff96d
Compare
- modm::fiber -> modm::this_fiber. - modm::fiber::sleep() -> modm::this_fiber::sleep_for().
48ff96d
to
0cce1bc
Compare
d9b837a
to
72ab66d
Compare
void inline | ||
notify_one() | ||
{ | ||
sequence.fetch_add(1, std::memory_order_acquire); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a release operation? Some store happens before the notify and it shouldn't be reordered behind it. On the consumer side, which is the wait, we need an acquire load or barrier such that no memory access is reordered before the wait is complete.
An acquire operation always has to be paired with a matching release. Otherwise it doesn't provide any ordering guarantees.
Actually, is there any use case where this code would ever have to deal with more than one thread on a multi-core CPU or interrupts running on different cores? I assume not, because the reason for fibers is to have a single-threaded execution model.
In that case we don't even need the actual memory barrier instructions to be emitted. You could use relaxed order on the atomic operation plus a compiler fence (std::atomic_signal_fence). The only thing that can reorder operations within a single thread is the compiler. Hardware barriers on atomics are only ever needed if you have multithreading and multiple cores with their own caches and view on memory.
I don't really know how expensive a few additional dmb
instructions are on Cortex-M and if it would be worth caring about it. Anyway I'm not sure if any of fibers would work in the presence of multithreading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, is there any use case where this code would ever have to deal with more than one thread on a multi-core CPU or interrupts running on different cores? I assume not, because the reason for fibers is to have a single-threaded execution model.
Well… my secret plan is to allow running one fiber scheduler per (FreeRTOS) thread and then do some kind of message passing between threads like Qt does with their event loops. I also want to support the dual-core H745 in the future.
According to the Programming Guide to Memory Barrier Instructions on Cortex-M0 to Cortex-M4, all (internal) memory accesses are always in-order, thus DMBs are not required. However, this is only an implementation choice, the architecture explicitly allows for reordering of memory accesses!
And indeed, there seem to be several caveats for single-core devices:
- There is no guarantee of ordering on external memories, e.g. SDRAM connected over FMC. It's up to the implementation to correctly implement DMB, they even go as far as saying that you should do a dummy read on something you wrote to enforce the write order. For example, on STM32F4, the FMC has several FIFOs:
-
Write Data FIFO with 16 x33-bit depth
-
Write Address FIFO with 16x30-bit depth
-
Cacheable Read FIFO with 6 x32-bit depth (6 x14-bit address tag) for SDRAM controller.
However, it's unclear to me if these FIFOs are in-order relative to each other. (I would expect it, but who knows?)
- The document does not talk about Cortex-M7, which has this 64-bit internal AXI bus, that seems to allow for reordering of memory accesses. Specifically, it can do speculative reads at any time. So I think you still need DMBs on single-core Cortex-M7.
I would just leave the DMBs in the code for now. For Cortex-M0(+), I implemented atomics without DMBs anyways, so the optimization of leaving out the DMBs would only apply to CM3/CM4 without external memories.
I don't really know how expensive a few additional dmb instructions are on Cortex-M and if it would be worth caring about it
I can do a comparison of the DWT->LSUCNT (cycles spent waiting for loads and stores to complete) with and without DMBs once I have a representative example later on to get an idea of the overhead of having "unnecessary" DMBs in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And indeed, there seem to be several caveats for single-core devices:
- There is no guarantee of ordering on external memories
- The document does not talk about Cortex-M7, which has this 64-bit internal AXI bus, that seems to allow for reordering of memory accesses
And none of those should matter when there is only one core. The problem with multi-core CPUs is that you can observe modifications to multiple memory locations in a different order across cores.
Let's say you have a single-producer single-consumer atomic queue, you write some data to the end and increment the last element pointer. What can happen is that the pointer modification is visible before the actual data change. So you would read invalid values on the consumer thread because they are not visible yet. Acquire-release ordering on the pointer solves that issue.
If you are on a single-core machine this is impossible by definition. Writing a variable and reading back the value must give the same result. There is nothing to be out of order with. This is just single-threaded programming, I can't see what the dmb
s could accomplish unless you are doing DMA or other similar things where peripherals interact with the memory directly. But that's the peripheral driver's job and out of scope for atomics.
The only things we need from the atomic is the atomicity and the compiler optimization barrier stopping it from doing the same reorderings as the multi-core system could do.
As a side note that is why the volatile
-based modm interrupt "atomic" queue is broken. The compiler is free to reorder non-volatile accesses around volatile ones and gcc is known to do that (https://blog.regehr.org/archives/28, item 5). We should fix it.
For Cortex-M0(+), I implemented atomics without DMBs anyways
For plain atomic loads and stores the compiler will still emit ldr
/str
plus dmb
(https://gcc.godbolt.org/z/Enb6179Pb) instead of turning off interrupts. The fallback implementation is only used for read-modify-write operations the M0 can't do atomically due to the lack of ldrex
and strex
.
I can do a comparison of the DWT->LSUCNT
Nice, I didn't even know that existed. That would be a good benchmark. Most likely we can just accept having the dmb
instructions.
I also want to support the dual-core H745 in the future
On the H7 dual-core you have to be careful with the caches if you are sharing data between both cores. If a value in SRAM is cached in the M7 DCACHE and has not been written back, an access from the M4 will read the old value from RAM. There is no automatic cache coherency.
ST's own programming model is to write data to a shared SRAM, trigger a cross-core interrupt and memcpy the data over. Not sure what to expect if we share atomics across both CPUs, but it should somehow work if there is no caching involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see what the
dmb
s could accomplish
Yes that makes sense to me. The CM7 AXI is not free to reorder memory access if it modifies the single-threaded view of memory (ie. swapping write and read of the same memory address). And I assume that ST didn't screw up the access order for external memory either.
The only things we need from the atomic is the atomicity and the compiler optimization barrier stopping it from doing the same reorderings as the multi-core system could do.
I thought about defining a bunch of modm::memory_order_acquire = std::memory_order_relaxed
aliases for single-core devices and then using those everywhere, however, I looked into how GCC actually implements the builtins, and it seems to me like no barrier is generated for std::memory_order_relaxed
, not even a compiler barrier:
std::atomic<T>::load()
calls the__atomic_load()
builtins.- I guess this then somehow calls
expand_builtin_atomic_load()
which calls expand_atomic_load()
and this then generates the actual instructions.- this generates a memory barrier via
expand_memory_blockage()
either asdmb
orasm volatile("" ::: "memory")
. - it only emits a (compiler) memory barrier if NOT relaxed.
So I think we should still just use it the way it is intended even if it costs a few cycles more.
Not sure what to expect if we share atomics across both CPUs, but it should somehow work if there is no caching involved.
We should add a MPU driver and mark that SRAM as non-cachable. We need to do this also for memories that are DMA-able on CM7, then we could enable the D-Cache by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to me like no barrier is generated for std::memory_order_relaxed, not even a compiler barrier
Yes, if you want the compiler barrier only you'll have to do a relaxed access on the atomic plus std::atomic_signal_fence
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct way to approach this is to modify the atomic builtin implementation and not fix it from the outside. Since I'm not motivated to shave that yak, I'll just keep using DMBs for now.
72ab66d
to
76e26ca
Compare
76e26ca
to
694f3a0
Compare
Fibers need a standard set of efficient constructs.
I modelled these classes after the C++ thread concurrency interfaces. Not sure if useful, but also probably not really much room for improvement anyways.
<atomics>
for AVR<mutex>
and<shared_mutex>
for AVRmodm::fiber::sleep
->modm::this_fiber::sleep_for
modm::this_fiber::sleep_until
modm::this_fiber::get_id
modm::fiber::Task
tostd::jthread
interface, incl. stop_token.