[threads] Use refcounts for coordinating finalization and detaching #12391

lambdageek · 2019-01-11T22:12:17Z

Reverts a29ad08 (#9914)

The basic problem we want to solve is the following:

All access to InternalThread:state must be protected by the
InternalThread:synch_cs mutex
We must destroy the mutex when we are done with the thread.
We don't know which happens later - detaching the machine thread or
finalizing its InternalThread managed object.

The solution is to replace InternalThread:synch_cs by InternalThread:longlived
which is a refcounted struct that holds the synch_cs. The refcount starts out
at 2 when the thread is attached to the runtime and when we create the managed
InternalThread object that represents it.
Both detaching and finalizing the managed object will decrement the refounct,
and whichever one happens last will be responsible for destroying the mutex.

This addresses #11956 which was a race
condition due to the previous attempt to fix this lifetime problem. The
previous attempt incorrectly used CAS in mono_thread_detach_internal while
continuing to use locking of synch_cs elsewhere. In particular
mono_thread_suspend_all_other_threads could race with
mono_thread_detach_internal: it expects to take the thread lock and test
thread->state and use the thread->suspended event, while detaching deletes
thread->suspended without taking a lock.

As a result we had a concurrency bug: in suspend_all_other_threads it's
possible to see both the old (non-Stopped) value of thread->state and the
new (NULL) value of thread->suspended. Which leads to crashes.

Background - why we don't know if detaching or finalization happens first.

InternalThread normally outlives the machine thread. This can happen because
when one thread starts another it can hold a reference to the fresh thread's
Thread object which holds a reference to the InternalThread. So after the
machine thread is done, the older thread can query the state of the younger
Thread object. This is the normal situation.
During shutdown we can have the opposite situation: the InternalThread
objects are finalized first (this happens during root domain finalization), but
the machine threads are still running, and they may still return to
start_wrapper_internal and call detach_internal. So in this case we have an
InternalThread whose finalizer ran first and detach will run second.

Reverts a29ad08 (mono#9914) The basic problem we want to solve is the following: 1. All access to InternalThread:state must be protected by the InternalThread:synch_cs mutex 2. We must destroy the mutex when we are done with the thread. 3. We don't know which happens later - detaching the machine thread or finalizing its InternalThread managed object. The solution is to replace InternalThread:synch_cs by InternalThread:longlived which is a refcounted struct that holds the synch_cs. The refcount starts out at 2 when the thread is attached to the runtime and when we create the managed InternalThread object that represents it. Both detaching and finalizing the managed object will decrement the refounct, and whichever one happens last will be responsible for destroying the mutex. This addresses mono#11956 which was a race condition due to the previous attempt to fix this lifetime problem. The previous attempt incorrectly used CAS in mono_thread_detach_internal while continuing to use locking of synch_cs elsewhere. In particular mono_thread_suspend_all_other_threads could race with mono_thread_detach_internal: it expects to take the thread lock and test thread->state and use the thread->suspended event, while detaching deletes thread->suspended without taking a lock. As a result we had a concurrency bug: in suspend_all_other_threads it's possible to see both the old (non-Stopped) value of thread->state and the new (NULL) value of thread->suspended. Which leads to crashes. --- Background - why we don't know if detaching or finalization happens first. 1. InternalThread normally outlives the machine thread. This can happen because when one thread starts another it can hold a reference to the fresh thread's Thread object which holds a reference to the InternalThread. So after the machine thread is done, the older thread can query the state of the younger Thread object. This is the normal situation. 2. During shutdown we can have the opposite situation: the InternalThread objects are finalized first (this happens during root domain finalization), but the machine threads are still running, and they may still return to start_wrapper_internal and call detach_internal. So in this case we have an InternalThread whose finalizer ran first and detach will run second.

BrzVlad

nice

lambdageek · 2019-01-14T14:26:33Z

@monojenkins backport to 2018-12

lambdageek · 2019-01-14T14:26:43Z

@monojenkins backport to 2018-10

monojenkins · 2019-01-14T14:27:46Z

@lambdageek backporting to 2018-10 failed, the patch results in conflicts:

Applying: [threads] Use refcounts for coordinating finalization and detaching
Using index info to reconstruct a base tree...
M	mcs/class/corlib/System.Threading/Thread.cs
M	mono/metadata/object-internals.h
M	mono/metadata/threads-types.h
M	mono/metadata/threads.c
Falling back to patching base and 3-way merge...
Auto-merging mono/metadata/threads.c
CONFLICT (content): Merge conflict in mono/metadata/threads.c
Auto-merging mono/metadata/threads-types.h
Auto-merging mono/metadata/object-internals.h
Auto-merging mcs/class/corlib/System.Threading/Thread.cs
error: Failed to merge in the changes.
Patch failed at 0001 [threads] Use refcounts for coordinating finalization and detaching

Please backport manually!

lambdageek requested review from akoeplinger, lateralusX, luhenry, marek-safar and vargaz as code owners January 11, 2019 22:12

lambdageek mentioned this pull request Jan 11, 2019

[threading] Try to fix the os-event-unix.c:82 assert(event) failure #12345

Closed

BrzVlad approved these changes Jan 14, 2019

View reviewed changes

monojenkins mentioned this pull request Jan 14, 2019

[2018-12] [threads] Use refcounts for coordinating finalization and detaching #12415

Merged

lambdageek mentioned this pull request Jan 15, 2019

[2018-10] [threads] Use refcounts for coordinating finalization and detaching #12426

Merged

lambdageek merged commit 67c5a87 into mono:master Jan 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[threads] Use refcounts for coordinating finalization and detaching #12391

[threads] Use refcounts for coordinating finalization and detaching #12391

lambdageek commented Jan 11, 2019 •

edited

BrzVlad left a comment

lambdageek commented Jan 14, 2019

lambdageek commented Jan 14, 2019

monojenkins commented Jan 14, 2019

[threads] Use refcounts for coordinating finalization and detaching #12391

[threads] Use refcounts for coordinating finalization and detaching #12391

Conversation

lambdageek commented Jan 11, 2019 • edited

BrzVlad left a comment

Choose a reason for hiding this comment

lambdageek commented Jan 14, 2019

lambdageek commented Jan 14, 2019

monojenkins commented Jan 14, 2019

lambdageek commented Jan 11, 2019 •

edited