-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spurious wake-ups in static_thread_pool cause deadlock #114
Comments
OK, this actually doesn't help :(. If a different thread attempts to schedule itself and ends up stuck spinning for the thread-that-isn't-asleep-but-is-claiming-to-be-asleep to go to sleep (which was a state it wasn't supposed to have ever entered: the operation in question wasn't supposed to block, making what I'm about to describe safe), and while it is spinning something in the running thread from the thread pool in turn blocks back on it (which again, should have been safe, as the operations the other thread is doing should not have blocked: if m_sleepingThreadCount were 0, it would return immediately, and it should only be >1 if there is a sleeping thread for it to wake up) then the thread pool is now in a dead lock and the other thread that wanted to schedule something is stuck in the livelock. |
Thanks for the detailed analysis. So to summarise the problem occurs in the following situation:
I can see two possible solutions here: This has the downside of incurring an extra atomic-exchange operation to set the This also has the effect of having an active thread look like it's asleep. This can mean that another idle thread might not be woken up next time an item is queued as the thread that enqueued a task might "wake up" the thread that is already awake. The second is to have the
This would have the downside that it might introduce an extra context switch and OS call in the (hopefully rare) case where another thread that is waking up this thread is context-switched out between setting |
So, I ended up not having to care about this bug, because I simply ended up writing my own scheduler (and in fact later ended up writing my own task, and kind of have my usage of cppcoro only around now to provide me some support for generators)... but I have friends who want to be able to start using all of these fun features of C++ co_await, and I'm thereby at a loss as to what to tell them... they all know my stories about this static_thread_pool issue--which, as far as I can tell, was never worked on--and so cppcoro can't really be trusted under any kind of heavy load :(. Is there a different library I should be referring people to in 2021? (It is extra-disappointing, as I just hoping to use multi_producer_sequencer for something I was writing... but I see that that for some reason cares about the scheduler, and all the examples use static_thread_pool--which I know doesn't work--and so I'm guessing I'll need to write my own mechanism.) |
In static_thread_pool.cpp, there is a comment about how "the auto_reset_event used to wake up this thread may already be in the 'set' state" and a defense that this "won't really hurt".
cppcoro/lib/static_thread_pool.cpp
Lines 469 to 473 in 18e02bd
However, what this means is that m_sleepingThreadCount won't get updated correctly: for a thread pool with a single thread (an easy test), m_sleepingThreadCount should be 1 as we enter localState.sleep_until_woken and 0 when we exit it, but if the auto_reset_event used to wake up the thread is already in the set state, we will exit the wait with m_sleepingThreadCount still equal to 1.
m_isSleeping is also true, which is awkward, but that will resolve itself (as in, will become the right value) when we next go to sleep... but that will also bump m_sleepingThreadCount up to 2 :(. When this thread next gets woken up, m_isSleeping will get set to false; but the decrement of m_sleepingThreadCount will only bring it back to 1: so we are now falsely counting a sleeping thread.
At this point, if another thread attempts to schedule into this thread pool while this thread is running, wake_one_thread will see m_sleepingThreadCount as 1. It will decrement that to 0, and think it is in charge of waking up a thread. However, there are no threads to wake up, as the m_isSleeping of all threads is false. This thread will then spin loop inside of wake_one_thread until we go back to sleep.
This is at least sort of OK? That spin loop might be going on for arbitrarily long (which sucks), but at least this will decrement m_sleepingThreadCount back to 0, and the situation will resolve itself... BUT, if the thread attempting to schedule itself to this pool is the current thread (which is normally an idempotent operation), it will livelock in that spin loop: the current thread will never sleep.
Put simply, if the auto_reset_event used to wake up a thread is already in the 'set' state, leaving it in that state actually hurts quite a bit ;P, and is leading to lots of wasted CPU in spinloops and even livelocks. I can add checks in my code "don't re-schedule to the current thread" to work around the latter, but that won't help with the former: I'd argue the tracking variables should always just be correct.
(I have no clue what an obvious fix is for this issue, and so will not attempt to propose one... the solution I would have come up with would have almost certainly been much slower than your library, as I have not yet attempted to learn how to do lock-free concurrency ;P.)
The text was updated successfully, but these errors were encountered: