Let's try new suspend again. #1600

kumpera · 2015-02-26T00:11:36Z

No description provided.

…iminate known races in the process. The new suspend machinery is based around a single word state machine that is manipulated using CAS. This solves the first set of problems with the current approach, the lack of atomicity and the impossibility of doing race free manipulation of protected state. This is specially acute when doing a self suspend. Suspend data must be atomically protected while we prepare to self suspend so a resume request won't race with the current thread putting itself to sleep. This is a classical problem that calls for a mutex/conditional variable pair. Except that all locks in the suspend path must be suspend safe. Wait, WHAT? A Suspend Safe Lock is a primitive that is obstruction free in the face of the kernel suspending a thread either performing a lock operation or waiting on a lock to become available. This, unfortunately, is not possible with pthread_mutex on OSX. The only safe primitive are kernel semaphores, for which doesn't exist conditional variables for. So back to locking and self-suspend. We need a lock because the existing thread state is not atomic but we can't use a mutex/condvar pair so we're left with racy code. Yay! The fix is using CAS over a single variable. Another change was the hardening of the suspend/resume code on posix targets. First we replace a semaphore wait with sigsuspend for async suspend. This is needed since sem_wait is not async-signal safe (yet sem_post is). Second, we now respect interruption requests by having a pair of suspend signals, one with SA_RESTART and one without. Onto the design of the suspension system, a few concepts that are useful looking around. * Self suspend: This is when the thread decides to suspend itself. * Async suspend: This is when a thread decides to suspend another thread. * Suspend Initiator: This the name of a thread that decided to suspend one or more other threads. This is important since all suspending threads will notify it when they are suspended. There can be only one initiator in the system at a given time. * Suspension count: Number of resume calls before a thread is back to running state. There are 7 states that can happen while a thread is runnable: * Running: just running... * (Async|Self) Suspended: Suspended, the difference comes into play when resuming. * (Async|Self) Suspend Requested: Suspend requested, but not completed. See more below for the discussion on why the request state is needed. * Suspend In Progress: Self suspend started saving its state thus async suspend should not modify it. * Suspend Promoted to Async: Async tried to suspend a thread in the middle of a self suspend, it wants to be notified. Now to the suspension protocol and how it happens on a high level view. Suspension starts with a suspend request, which bumps the suspend count by one. This suspend request can be fulfilled either by an async or self suspend action. This is confusing as there's only one initiator but any number of threads can be self suspending. Async suspend then performs a platform specific action such as posix signals that forces a transition on the target. In the case of self suspend, it depends on the thread polling its state and trigger the transition. This works fine except that the suspend state is one huge struct with tons of fields. To control concurrent access to it we use a single initiator in the case of async suspend and we put the thread in a transitional "saving my state" in the case of self suspend. In the case of async suspend, the initiator must wait for all 1+ threads to notify back that they have suspended. Only after that it's possible to know if the async suspend request actually worked or not (we might have hit a dying thread). Together with this, there's an optional implementation of STW in sgen that can use this new machinery. To enable it set the MONO_ENABLE_UNIFIED_SUSPEND env var for now while it gets more testing.

…f one to account for async racing to the middle of a self suspend. When begin async suspend lands in STATE_SUSPEND_IN_PROGRESS, the initiator don't need to perform an async suspend, it just needs to wait. If a self suspend is promoted to an async suspend, it must be added to the pending ops set. This ensure that it will be waited otherwise the suspend initiator might witness an unfinished self suspend and assert when fetching the thread state. The alternative would be to change finish async suspend to account for the possibility of witnessing STATE_SUSPEND_PROMOTED_TO_ASYNC. Solving this way requires less async suspends and doesn't increase the valid state space.

Having a single unwind state means writes to it must be synchronized between sync and async suspend. We can't have the state be broken as it could lead to bad unwinding or GC marking. The original solution was to have an additional state in the self suspend path that signals it's writing to the thread state and thus any async suspend must give up and let it finish. This ended up been overly complicated as this requires two additional states instead. By having a pair of unwind states both can write concurrently without the fear of clashing. This makes fetching the state a little bit trickier but worth the trouble. The simplified design has a much smaller state space, which is easier to reason about.

alexrp · 2015-05-04T12:58:43Z

I think this is merged now, so closing this.

kumpera added 3 commits March 2, 2015 17:36

kumpera force-pushed the suspend_take2 branch from eafbdfe to d07d3cd Compare March 2, 2015 22:49

alexrp closed this May 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let's try new suspend again. #1600

Let's try new suspend again. #1600

kumpera commented Feb 26, 2015

alexrp commented May 4, 2015

Let's try new suspend again. #1600

Let's try new suspend again. #1600

Conversation

kumpera commented Feb 26, 2015

alexrp commented May 4, 2015