Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let's try new suspend again. #1600

Closed
wants to merge 3 commits into from
Closed

Conversation

kumpera
Copy link
Contributor

@kumpera kumpera commented Feb 26, 2015

No description provided.

…iminate known races in the process.

The new suspend machinery is based around a single word state machine that is manipulated using CAS.

This solves the first set of problems with the current approach, the lack of atomicity and the impossibility of
doing race free manipulation of protected state. This is specially acute when doing a self suspend.
Suspend data must be atomically protected while we prepare to self suspend so a resume request won't race with the current
thread putting itself to sleep. This is a classical problem that calls for a mutex/conditional variable pair. Except that all
locks in the suspend path must be suspend safe. Wait, WHAT?

A Suspend Safe Lock is a primitive that is obstruction free in the face of the kernel suspending a thread either performing a
lock operation or waiting on a lock to become available. This, unfortunately, is not possible with pthread_mutex on OSX. The
only safe primitive are kernel semaphores, for which doesn't exist conditional variables for.

So back to locking and self-suspend. We need a lock because the existing thread state is not atomic but we can't use a
mutex/condvar pair so we're left with racy code. Yay! The fix is using CAS over a single variable.

Another change was the hardening of the suspend/resume code on posix targets. First we replace a semaphore wait with
sigsuspend for async suspend. This is needed since sem_wait is not async-signal safe (yet sem_post is). Second, we now
respect interruption requests by having a pair of suspend signals, one with SA_RESTART and one without.

Onto the design of the suspension system, a few concepts that are useful looking around.

* Self suspend: This is when the thread decides to suspend itself.

* Async suspend: This is when a thread decides to suspend another thread.

* Suspend Initiator: This the name of a thread that decided to suspend one or more other threads. This is important
since all suspending threads will notify it when they are suspended. There can be only one initiator in the system
at a given time.

* Suspension count: Number of resume calls before a thread is back to running state.

There are 7 states that can happen while a thread is runnable:

* Running: just running...

* (Async|Self) Suspended: Suspended, the difference comes into play when resuming.

* (Async|Self) Suspend Requested: Suspend requested, but not completed. See more below for the discussion on why the request state is needed.

* Suspend In Progress: Self suspend started saving its state thus async suspend should not modify it.
* Suspend Promoted to Async: Async tried to suspend a thread in the middle of a self suspend, it wants to be notified.

Now to the suspension protocol and how it happens on a high level view.

Suspension starts with a suspend request, which bumps the suspend count by one. This suspend request can be fulfilled either
by an async or self suspend action. This is confusing as there's only one initiator but any number of threads can be self suspending.

Async suspend then performs a platform specific action such as posix signals that forces a transition on the target. In the case
of self suspend, it depends on the thread polling its state and trigger the transition.

This works fine except that the suspend state is one huge struct with tons of fields. To control concurrent access to it we use a
single initiator in the case of async suspend and we put the thread in a transitional "saving my state" in the case of self suspend.

In the case of async suspend, the initiator must wait for all 1+ threads to notify back that they have suspended. Only after that
it's possible to know if the async suspend request actually worked or not (we might have hit a dying thread).

Together with this, there's an optional implementation of STW in sgen that can use this new machinery. To enable it set the
MONO_ENABLE_UNIFIED_SUSPEND env var for now while it gets more testing.
…f one to account for async racing to the middle of a self suspend.

When begin async suspend lands in STATE_SUSPEND_IN_PROGRESS, the initiator don't need to perform an async suspend, it just needs to wait.

If a self suspend is promoted to an async suspend, it must be added to the pending ops set. This ensure that it will be waited otherwise
the suspend initiator might witness an unfinished self suspend and assert when fetching the thread state.

The alternative would be to change finish async suspend to account for the possibility of witnessing STATE_SUSPEND_PROMOTED_TO_ASYNC.

Solving this way requires less async suspends and doesn't increase the valid state space.
Having a single unwind state means writes to it must be synchronized between sync and async suspend. We can't
have the state be broken as it could lead to bad unwinding or GC marking.

The original solution was to have an additional state in the self suspend path that signals it's writing to the
thread state and thus any async suspend must give up and let it finish. This ended up been overly complicated
as this requires two additional states instead.

By having a pair of unwind states both can write concurrently without the fear of clashing. This makes fetching the state
a little bit trickier but worth the trouble.

The simplified design has a much smaller state space, which is easier to reason about.
@alexrp
Copy link
Contributor

alexrp commented May 4, 2015

I think this is merged now, so closing this.

@alexrp alexrp closed this May 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants