Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os-event-unix.c:82 assert(event) failure #11956

Closed
jaykrell opened this issue Dec 6, 2018 · 11 comments

Comments

@jaykrell
Copy link
Collaborator

@jaykrell jaykrell commented Dec 6, 2018

Steps to Reproduce

Build mono and run some tests from mono/tests.
Observed in:

  • test-unhandled-exception-2-255-with-managed-handled
  • delegate-delegate-exit
  • threadpool_exceptions4
  • threadpool_exceptions2

Current Behavior

https://jenkins.mono-project.com/job/test-mono-pull-request-coop/10660/testReport/junit/MonoTests/test-unhandled-exception-2-255-with-managed-handler/unhandled_exception_2_exe/

Thread 1 (Thread 0x7fc3f6123740 (LWP 21274)):
#0  0x00007fc3f55f1b3a in waitpid () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000055d64e68c540 in dump_native_stacktrace (ctx=ctx@entry=0x7fffebab31c0, signal=0x55d64e8a2656 "SIGABRT") at mini-posix.c:1051
#2  0x000055d64e68c710 in mono_dump_native_crash_info (signal=signal@entry=0x55d64e8a2656 "SIGABRT", ctx=ctx@entry=0x7fffebab31c0, info=info@entry=0x7fffebab32f0) at mini-posix.c:1080
#3  0x000055d64e61e005 in mono_handle_native_crash (signal=0x55d64e8a2656 "SIGABRT", ctx=0x7fffebab31c0, info=0x7fffebab32f0) at mini-exceptions.c:3222
#4  <signal handler called>
#5  0x00007fc3f505dfff in raise () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007fc3f505f42a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x000055d64e865214 in mono_log_write_logfile (log_domain=<optimized out>, level=<optimized out>, hdr=<optimized out>, message=0x55d64ffa4cb0 "* Assertion at os-event-unix.c:82, condition `event' not met\n") at mono-log-common.c:136
#8  0x000055d64e87df8e in monoeg_g_logstr (msg=<optimized out>, log_level=G_LOG_LEVEL_ERROR, log_domain=0x0) at goutput.c:117
#9  monoeg_g_logv_nofree (log_domain=log_domain@entry=0x0, log_level=log_level@entry=G_LOG_LEVEL_ERROR, format=format@entry=0x55d64e887070 "* Assertion at %s:%d, condition `%s' not met\n", args=args@entry=0x7fffebab3a18) at goutput.c:128
#10 0x000055d64e87e396 in monoeg_assertion_message (format=format@entry=0x55d64e887070 "* Assertion at %s:%d, condition `%s' not met\n") at goutput.c:163
#11 0x000055d64e85d991 in mono_os_event_reset (event=0x0) at os-event-unix.c:82
#12 0x000055d64e7aed02 in mono_thread_suspend_all_other_threads () at threads.c:3746
#13 0x000055d64e73c986 in ves_icall_System_Environment_Exit (result=255) at icall.c:7003
#

Expected Behavior

Success.

On which platforms did you notice this

Linux, osx

@jaykrell

This comment has been minimized.

@marek-safar

This comment has been minimized.

Copy link
Member

@marek-safar marek-safar commented Dec 17, 2018

@akoeplinger is this a new failure?

@jaykrell

This comment has been minimized.

Copy link
Collaborator Author

@jaykrell jaykrell commented Dec 20, 2018

https://jenkins.mono-project.com/job/test-mono-pull-request-coop-arm64/7112/testReport/junit/MonoTests/runtime/threadpool_exceptions4_exe/


Skip to content
[Jenkins]Jenkins log in

search
 
ENABLE AUTO REFRESH
Jenkins
Mono - pull requests - Linux AArch64 - Cooperative Suspend
#7112
Test Results
Test Results
MonoTests
runtime
threadpool-exceptions4.exe
 Back to Project
 Status
 Changes
 Console Output
 View as plain text
 View Build Information
 History
 Parameters
 Timings
 Parsed Console Output
 Azure Artifacts
 Test Result
 Open Blue Ocean
 Previous Build
 Next Build
Failed
MonoTests.runtime.threadpool-exceptions4.exe

Failing for the past 1 build (Since Unstable#7112 )
Took 5.2 sec.
Error Message
* Assertion at os-event-unix.c:82, condition `event' not met

Pkilling 0xffffab65e8c0 from 0xffffa84eb1d0
Pkilling 0xffffab2561d0 from 0xffffa84eb1d0
Pkilling 0xffffa86eb1d0 from 0xffffa84eb1d0
Entering thread summarizer pause from 0xffffa84eb1d0
Finished thread summarizer pause from 0xffffa84eb1d0.
Stacktrace
Stacktrace:

  at <unknown> <0xffffffff>
  at (wrapper managed-to-native) System.Environment.Exit (int) <0x00007>
  at Test.OnUnhandledException (object,System.UnhandledExceptionEventArgs) [0x00001] in <adc3936fa2644b15965c5a31ec3e75ed>:0
  at (wrapper runtime-invoke) <Module>.runtime_invoke_void__this___object_object (object,intptr,intptr,intptr) [0x0002a] in <dd09314782394617a504d11d7e538959>:0
/proc/self/maps:
aaaab6e24000-aaaab71cd000 r-xp 00000000 08:02 48771802                   /home/builder/jenkins/workspace/test-mono-pull-request-coop-arm64/mono/mini/mono-sgen
aaaab71dd000-aaaab71e4000 r--p 003a9000 08:02 48771802                   /home/builder/jenkins/workspace/test-mono-pull-request-coop-arm64/mono/mini/mono-sgen
aaaab71e4000-aaaab71e9000 rw-p 003b0000 08:02 48771802                   /home/builder/jenkins/workspace/test-mono-pull-request-coop-arm64/mono/mini/mono-sgen
aaaab71e9000-aaaab727a000 rw-p 00000000 00:00 0 
aaaad2450000-aaaad2504000 rw-p 00000000 00:00 0                          [heap]
ffff90000000-ffff90022000 rw-p 00000000 00:00 0 
ffff90022000-ffff94000000 ---p 00000000 00:00 0 
ffff94000000-ffff94022000 rw-p 00000000 00:00 0 
ffff94022000-ffff98000000 ---p 00000000 00:00 0 
ffff9be00000-ffff9be01000 ---p 00000000 00:00 0 
ffff9be01000-ffff9c000000 rw-p 00000000 00:00 0 
ffff9c000000-ffff9c022000 rw-p 00000000 00:00 0 
ffff9c022000-ffffa0000000 ---p 00000000 00:00 0 
ffffa0000000-ffffa0022000 rw-p 00000000 00:00 0 
ffffa0022000-ffffa4000000 ---p 00000000 00:00 0 
ffffa4000000-ffffa4022000 rw-p 00000000 00:00 0 
ffffa4022000-ffffa8000000 ---p 00000000 00:00 0 
ffffa80ec000-ffffa80ed000 ---p 00000000 00:00 0 
ffffa80ed000-ffffa82ec000 rw-p 00000000 00:00 0 
ffffa82ec000-ffffa82ed000 ---p 00000000 00:00 0 
ffffa82ed000-ffffa84ec000 rw-p 00000000 00:00 0 
ffffa84ec000-ffffa84ed000 ---p 00000000 00:00 0 
ffffa84ed000-ffffa86ec000 rw-p 00000000 00:00 0 
ffffa86ec000-ffffa8f6f000 r-xp 00000000 08:02 49811405                   /home/builder/jenkins/workspace/test-mono-pull-request-coop-arm64/mcs/class/lib/net_4_x-linux/mscorlib.dll.so
ffffa8f6f000-ffffa8f7e000 ---p 00883000 08:02 49811405                   /home/builder/jenkins/workspace/test-mono-pull-request-coop-arm64/mcs/class/lib/net_4_x-linux/mscorlib.dll.so
Memory around native instruction pointer (0xffffab4249fc):
0xffffab4249ec  02 00 80 d2 03 01 80 d2 e8 10 80 d2 01 00 00 d4  ................
0xffffab4249fc  f3 0b 40 f9 e0 03 04 2a fd 7b d2 a8 c0 03 5f d6  ..@....*.{...._.
0xffffab424a0c  1f 20 03 d5 81 08 00 f0 21 20 47 f9 42 d0 3b d5  . ......! G.B.;.
0xffffab424a1c  e0 03 00 4b 04 00 80 12 40 68 21 b8 ef ff ff 17  ...K....@h!.....

Native stacktrace:

	/home/builder/jenkins/workspace/test-mono-pull-request-coop-arm64/mono/mini/mono(+0xeb324) [0xaaaab6f0f324]

Waiting for dumping threads to resume


Debug info from gdb:

[New LWP 7888]
[New LWP 7894]
[New LWP 7905]
[New LWP 7906]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffffab54b2c8 in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
  Id   Target Id         Frame 
* 1    Thread 0xffffab65e8c0 (LWP 7697) "mono" 0x0000ffffab54b2c8 in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
  2    Thread 0xffffaabff1d0 (LWP 7888) "SGen worker" 0x0000ffffab54b2c8 in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
  3    Thread 0xffffab2561d0 (LWP 7894) "Finalizer" 0x0000ffffab54d870 in do_futex_wait.constprop () from /lib/aarch64-linux-gnu/libpthread.so.0
  4    Thread 0xffffa86eb1d0 (LWP 7905) "mono" 0x0000ffffab54b2c8 in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
  5    Thread 0xffffa84eb1d0 (LWP 7906) "Thread Pool Wor" 0x0000ffffab54fa38 in waitpid () from /lib/aarch64-linux-gnu/libpthread.so.0

Thread 5 (Thread 0xffffa84eb1d0 (LWP 7906)):
#0  0x0000ffffab54fa38 in waitpid () from /lib/aarch64-linux-gnu/libpthread.so.0
#1  0x0000aaaab6f0f430 in dump_native_stacktrace (ctx=ctx@entry=0xffffa84e8690, signal=0xaaaab711a150 "SIGABRT") at mini-posix.c:1053
#2  0x0000aaaab6f0f5f0 in mono_dump_native_crash_info (signal=signal@entry=0xaaaab711a150 "SIGABRT", ctx=ctx@entry=0xffffa84e8690, info=info@entry=0xffffa84e8610) at mini-posix.c:1082
#3  0x0000aaaab6eccd0c in mono_handle_native_crash (signal=0xaaaab711a150 "SIGABRT", ctx=0xffffa84e8690, info=0xffffa84e8610) at mini-exceptions.c:3221
#4  <signal handler called>
#5  0x0000ffffab4249fc in raise () from /lib/aarch64-linux-gnu/libc.so.6
#6  0x0000ffffab425df4 in abort () from /lib/aarch64-linux-gnu/libc.so.6
#7  0x0000aaaab70dfcbc in mono_log_write_logfile (log_domain=<optimized out>, level=G_LOG_LEVEL_ERROR, hdr=<optimized out>, message=0xffffa00060f0 "* Assertion at os-event-unix.c:82, condition `event' not met\n") at mono-log-common.c:136
#8  0x0000aaaab70f9548 in monoeg_g_logstr (msg=<optimized out>, log_level=G_LOG_LEVEL_ERROR, log_domain=0x0) at goutput.c:117
#9  monoeg_g_logv_nofree (log_domain=log_domain@entry=0x0, log_level=log_level@entry=G_LOG_LEVEL_ERROR, format=format@entry=0xaaaab7102418 "* Assertion at %s:%d, condition `%s' not met\n", args=<error reading variable: Cannot access memory at address 0x8>) at goutput.c:128
#10 0x0000aaaab70f9910 in monoeg_assertion_message (format=format@entry=0xaaaab7102418 "* Assertion at %s:%d, condition `%s' not met\n") at goutput.c:163
#11 0x0000aaaab70d8710 in mono_os_event_reset (event=0x0) at os-event-unix.c:82
#12 0x0000aaaab7027ffc in mono_thread_suspend_all_other_threads () at threads.c:3747
#13 0x0000aaaab6fb5b50 in ves_icall_System_Environment_Exit (result=0) at icall.c:7010
#14 0x0000ffffab31a890 in ?? ()
#15 0x0000ffffa0002140 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 4 (Thread 0xffffa86eb1d0 (LWP 7905)):
#0  0x0000ffffab54b2c8 in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
#1  0x0000aaaab70d8abc in mono_os_cond_wait (mutex=0xaaaab71fd760 <signal_mutex>, cond=0xffffa86ea6d0) at mono-os-mutex.h:168
#2  mono_os_event_wait_multiple (events=events@entry=0xffffa86ea718, nevents=nevents@entry=1, waitall=waitall@entry=1, timeout=timeout@entry=4294967295, alertable=alertable@entry=1) at os-event-unix.c:190
#3  0x0000aaaab70d8ce8 in mono_os_event_wait_one (event=<optimized out>, event@entry=0xaaaad24d4da0, timeout=timeout@entry=4294967295, alertable=alertable@entry=1) at os-event-unix.c:94
#4  0x0000aaaab70249f8 in self_suspend_internal () at threads.c:5518
#5  0x0000aaaab7025638 in mono_thread_execute_interruption (pexc=0xffffa86ea858, pexc@entry=0xffffa86ea868) at threads.c:4912
#6  0x0000aaaab70257c8 in mono_thread_execute_interruption_ptr () at threads.c:4945
#7  0x0000aaaab70258d0 in mono_thread_interruption_checkpoint_request (bypass_abort_protection=bypass_abort_protection@entry=0) at threads.c:5052
#8  0x0000aaaab7029ac0 in mono_thread_interruption_checkpoint () at threads.c:5074
#9  mono_thread_interruption_checkpoint_void () at threads.c:5075
#10 0x0000aaaab708bd20 in monitor_thread (unused=<optimized out>) at threadpool-worker-default.c:712
#11 0x0000aaaab702cd5c in start_wrapper_internal (stack_ptr=<optimized out>, start_info=0x0) at threads.c:1178
#12 start_wrapper (data=0xaaaad247a480) at threads.c:1238
#13 0x0000ffffab5450a0 in start_thread () from /lib/aarch64-linux-gnu/libpthread.so.0
#14 0x0000ffffab4bbeac in ?? () from /lib/aarch64-linux-gnu/libc.so.6

Thread 3 (Thread 0xffffab2561d0 (LWP 7894)):
#0  0x0000ffffab54d870 in do_futex_wait.constprop () from /lib/aarch64-linux-gnu/libpthread.so.0
#1  0x0000ffffab54d928 in __new_sem_wait_slow.constprop.0 () from /lib/aarch64-linux-gnu/libpthread.so.0
#2  0x0000aaaab7075350 in mono_os_sem_wait (flags=MONO_SEM_FLAGS_ALERTABLE, sem=0xaaaab71eeac0 <finalizer_sem>) at ../../mono/utils/mono-os-semaphore.h:203
#3  mono_coop_sem_wait (flags=MONO_SEM_FLAGS_ALERTABLE, sem=0xaaaab71eeac0 <finalizer_sem>) at ../../mono/utils/mono-coop-semaphore.h:41
#4  finalizer_thread (unused=<optimized out>) at gc.c:920
#5  0x0000aaaab702cd5c in start_wrapper_internal (stack_ptr=<optimized out>, start_info=0x0) at threads.c:1178
#6  start_wrapper (data=0xaaaad24ac430) at threads.c:1238
#7  0x0000ffffab5450a0 in start_thread () from /lib/aarch64-linux-gnu/libpthread.so.0
#8  0x0000ffffab4bbeac in ?? () from /lib/aarch64-linux-gnu/libc.so.6

Thread 2 (Thread 0xffffaabff1d0 (LWP 7888)):
#0  0x0000ffffab54b2c8 in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
#1  0x0000aaaab70d5944 in mono_os_cond_wait (mutex=0xaaaab71fd4d8 <lock>, cond=0xaaaab71fd528 <work_cond>) at ../../mono/utils/mono-os-mutex.h:168
#2  get_work (job=<synthetic pointer>, do_idle=<synthetic pointer>, work_context=<synthetic pointer>, worker_index=0) at sgen-thread-pool.c:165
#3  thread_func (data=<optimized out>) at sgen-thread-pool.c:196
#4  0x0000ffffab5450a0 in start_thread () from /lib/aarch64-linux-gnu/libpthread.so.0
#5  0x0000ffffab4bbeac in ?? () from /lib/aarch64-linux-gnu/libc.so.6

Thread 1 (Thread 0xffffab65e8c0 (LWP 7697)):
#0  0x0000ffffab54b2c8 in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
#1  0x0000aaaab70d8abc in mono_os_cond_wait (mutex=0xaaaab71fd760 <signal_mutex>, cond=0xffffe2001ec0) at mono-os-mutex.h:168
#2  mono_os_event_wait_multiple (events=events@entry=0xffffe2001f08, nevents=nevents@entry=1, waitall=waitall@entry=1, timeout=timeout@entry=4294967295, alertable=alertable@entry=1) at os-event-unix.c:190
#3  0x0000aaaab70d8ce8 in mono_os_event_wait_one (event=<optimized out>, event@entry=0xaaaad2482370, timeout=timeout@entry=4294967295, alertable=alertable@entry=1) at os-event-unix.c:94
#4  0x0000aaaab70249f8 in self_suspend_internal () at threads.c:5518
#5  0x0000aaaab7025638 in mono_thread_execute_interruption (pexc=pexc@entry=0xffffe2002058) at threads.c:4912
#6  0x0000aaaab7029dc0 in ves_icall_System_Threading_Thread_Sleep_internal (ms=ms@entry=10000, error=error@entry=0xffffe2002128) at threads.c:1703
#7  0x0000aaaab6fc9098 in ves_icall_System_Threading_Thread_Sleep_internal_raw (a0=10000, error=0xffffe2002128) at ../../mono/metadata/icall-def.h:1062
#8  0x0000ffffab31918c in ?? ()
#9  0x0000aaaad24bc480 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

Page generated: Dec 20, 2018 8:22:11 AM UTCREST APIJenkins ver. 2.150.1```
@akoeplinger

This comment has been minimized.

Copy link
Member

@akoeplinger akoeplinger commented Dec 20, 2018

@marek-safar it doesn't seem to be new, #10889 is pretty much the same as far as I can see. Seems to be coop-specific.

@akoeplinger

This comment has been minimized.

Copy link
Member

@akoeplinger akoeplinger commented Dec 21, 2018

@lambdageek

This comment has been minimized.

Copy link
Member

@lambdageek lambdageek commented Jan 3, 2019

This is a race condition between mono_thread_suspend_all_other_threads and mono_thread_detach_internal during shutdown.

The issue is that mono_thread_detach_internal calls mono_os_event_destroy (thread->suspended) and thread->suspended=NULL and later mono_thread_suspend_all_other_threads calls mono_os_event_reset (thread->suspended).

This happens because mono_thread_suspend_all_other_threads first collects an array of candidate threads, and then checks the MonoInternalThread:status field without using an atomic operation, to decide if they really need to be shutdown, and if so locks the thread object. On the other hand mono_thread_detach_internal atomically updates MonoInternalThread:status but then doesn't lock the thread.

I think it's possible for this set of operations to show a stale MonoInternalThread:status value to the shutdown thread before it sees a null MonoInternalThread:suspended event.

@lambdageek

This comment has been minimized.

Copy link
Member

@lambdageek lambdageek commented Jan 9, 2019

I think this is due to #9914 switching mono_thread_detach_internal from using LOCK_THREAD(thread) for synchronization when updating the thread->state to using a CAS.

Previously both mono_thread_detach_internal and mono_thread_suspend_all_other_threads would try to lock the thread object before changing the state, which means if mono_thread_suspend_all_other_threads held the lock it could be certain that the thread->suspended event still existed because mono_thread_detach_internal would've blocked at the state change.

I think we need to revert part of #9914 - the reference queue for destroying thread->synch_cs is good, but the CAS is bad and we should've kept the locking.

lambdageek added a commit to lambdageek/mono that referenced this issue Jan 10, 2019
Partly revert a29ad08 (mono#9914)

The problem is a race between mono_thread_detach_internal and
mono_thread_suspend_all_other_threads:

In mono_thread_suspend_all_other_threads, we use LOCK_THREAD (thread) (which
locks MonoInternalThread:synch_cs) and then test and update the thread->state and also
call mono_os_reset_event (thread->suspended) while holding the lock.

In mono_thread_detach_internal, we do not lock the thread, and then CAS the
thread->state and then continue and set thread->suspended to NULL.

As a result we have a concurrency bug: in suspend_all_other_threads it's
possible to see both the old (non-Stopped) value of thread->state and the
new (NULL) value of thread->suspended.  Hence the crashes seen in
mono#11956

This switches back to using LOCK_THREAD in mono_thread_detach_internal when
manipulating the thread state.

As a result, mono_thread_suspend_all_other_threads will see a consistent
thread->state and thread->suspended: when it takes the lock, if the thread is
not stopped, the suspended event is not NULL, or else it will see the thread is
Stopped and skip trying to suspend it.
lambdageek added a commit to lambdageek/mono that referenced this issue Jan 11, 2019
Reverts a29ad08 (mono#9914)

The basic problem we want to solve is the following:
  1. All access to InternalThread:state must be protected by the
     InternalThread:synch_cs mutex
  2. We must destroy the mutex when we are done with the thread.
  3. We don't know which happens later - detaching the machine thread or
     finalizing its InternalThread managed object.

The solution is to replace InternalThread:synch_cs by InternalThread:longlived
which is a refcounted struct that holds the synch_cs.  The refcount starts out
at 2 when the thread is attached to the runtime and when we create the managed
InternalThread object that represents it.
Both detaching and finalizing the managed object will decrement the refounct,
and whichever one happens last will be responsible for destroying the mutex.

This addresses mono#11956 which was a race
condition due to the previous attempt to fix this lifetime problem.  The
previous attempt incorrectly used CAS in mono_thread_detach_internal while
continuing to use locking of synch_cs elsewhere. In particular
mono_thread_suspend_all_other_threads could race with
mono_thread_detach_internal: it expects to take the thread lock and test
thread->state and use the thread->suspended event, while detaching deletes
thread->suspended without taking a lock.

As a result we had a concurrency bug: in suspend_all_other_threads it's
possible to see both the old (non-Stopped) value of thread->state and the
new (NULL) value of thread->suspended.  Which leads to crashes.

---

Background - why we don't know if detaching or finalization happens first.

1. InternalThread normally outlives the machine thread. This can happen because
when one thread starts another it can hold a reference to the fresh thread's
Thread object which holds a reference to the InternalThread. So after the
machine thread is done, the older thread can query the state of the younger
Thread object. This is the normal situation.

2. During shutdown we can have the opposite situation: the InternalThread
objects are finalized first (this happens during root domain finalization), but
the machine threads are still running, and they may still return to
start_wrapper_internal and call detach_internal. So in this case we have an
InternalThread whose finalizer ran first and detach will run second.
lambdageek added a commit to lambdageek/mono that referenced this issue Jan 11, 2019
Reverts a29ad08 (mono#9914)

The basic problem we want to solve is the following:
  1. All access to InternalThread:state must be protected by the
     InternalThread:synch_cs mutex
  2. We must destroy the mutex when we are done with the thread.
  3. We don't know which happens later - detaching the machine thread or
     finalizing its InternalThread managed object.

The solution is to replace InternalThread:synch_cs by InternalThread:longlived
which is a refcounted struct that holds the synch_cs.  The refcount starts out
at 2 when the thread is attached to the runtime and when we create the managed
InternalThread object that represents it.
Both detaching and finalizing the managed object will decrement the refounct,
and whichever one happens last will be responsible for destroying the mutex.

This addresses mono#11956 which was a race
condition due to the previous attempt to fix this lifetime problem.  The
previous attempt incorrectly used CAS in mono_thread_detach_internal while
continuing to use locking of synch_cs elsewhere. In particular
mono_thread_suspend_all_other_threads could race with
mono_thread_detach_internal: it expects to take the thread lock and test
thread->state and use the thread->suspended event, while detaching deletes
thread->suspended without taking a lock.

As a result we had a concurrency bug: in suspend_all_other_threads it's
possible to see both the old (non-Stopped) value of thread->state and the
new (NULL) value of thread->suspended.  Which leads to crashes.

---

Background - why we don't know if detaching or finalization happens first.

1. InternalThread normally outlives the machine thread. This can happen because
when one thread starts another it can hold a reference to the fresh thread's
Thread object which holds a reference to the InternalThread. So after the
machine thread is done, the older thread can query the state of the younger
Thread object. This is the normal situation.

2. During shutdown we can have the opposite situation: the InternalThread
objects are finalized first (this happens during root domain finalization), but
the machine threads are still running, and they may still return to
start_wrapper_internal and call detach_internal. So in this case we have an
InternalThread whose finalizer ran first and detach will run second.
monojenkins added a commit to monojenkins/mono that referenced this issue Jan 14, 2019
Reverts a29ad08 (mono#9914)

The basic problem we want to solve is the following:
  1. All access to InternalThread:state must be protected by the
     InternalThread:synch_cs mutex
  2. We must destroy the mutex when we are done with the thread.
  3. We don't know which happens later - detaching the machine thread or
     finalizing its InternalThread managed object.

The solution is to replace InternalThread:synch_cs by InternalThread:longlived
which is a refcounted struct that holds the synch_cs.  The refcount starts out
at 2 when the thread is attached to the runtime and when we create the managed
InternalThread object that represents it.
Both detaching and finalizing the managed object will decrement the refounct,
and whichever one happens last will be responsible for destroying the mutex.

This addresses mono#11956 which was a race
condition due to the previous attempt to fix this lifetime problem.  The
previous attempt incorrectly used CAS in mono_thread_detach_internal while
continuing to use locking of synch_cs elsewhere. In particular
mono_thread_suspend_all_other_threads could race with
mono_thread_detach_internal: it expects to take the thread lock and test
thread->state and use the thread->suspended event, while detaching deletes
thread->suspended without taking a lock.

As a result we had a concurrency bug: in suspend_all_other_threads it's
possible to see both the old (non-Stopped) value of thread->state and the
new (NULL) value of thread->suspended.  Which leads to crashes.
lambdageek added a commit to lambdageek/mono that referenced this issue Jan 15, 2019
Reverts a29ad08 (mono#9914)

The basic problem we want to solve is the following:
  1. All access to InternalThread:state must be protected by the
     InternalThread:synch_cs mutex
  2. We must destroy the mutex when we are done with the thread.
  3. We don't know which happens later - detaching the machine thread or
     finalizing its InternalThread managed object.

The solution is to replace InternalThread:synch_cs by InternalThread:longlived
which is a refcounted struct that holds the synch_cs.  The refcount starts out
at 2 when the thread is attached to the runtime and when we create the managed
InternalThread object that represents it.
Both detaching and finalizing the managed object will decrement the refounct,
and whichever one happens last will be responsible for destroying the mutex.

This addresses mono#11956 which was a race
condition due to the previous attempt to fix this lifetime problem.  The
previous attempt incorrectly used CAS in mono_thread_detach_internal while
continuing to use locking of synch_cs elsewhere. In particular
mono_thread_suspend_all_other_threads could race with
mono_thread_detach_internal: it expects to take the thread lock and test
thread->state and use the thread->suspended event, while detaching deletes
thread->suspended without taking a lock.

As a result we had a concurrency bug: in suspend_all_other_threads it's
possible to see both the old (non-Stopped) value of thread->state and the
new (NULL) value of thread->suspended.  Which leads to crashes.

---

Background - why we don't know if detaching or finalization happens first.

1. InternalThread normally outlives the machine thread. This can happen because
when one thread starts another it can hold a reference to the fresh thread's
Thread object which holds a reference to the InternalThread. So after the
machine thread is done, the older thread can query the state of the younger
Thread object. This is the normal situation.

2. During shutdown we can have the opposite situation: the InternalThread
objects are finalized first (this happens during root domain finalization), but
the machine threads are still running, and they may still return to
start_wrapper_internal and call detach_internal. So in this case we have an
InternalThread whose finalizer ran first and detach will run second.
lambdageek added a commit that referenced this issue Jan 15, 2019
Reverts a29ad08 (#9914)

The basic problem we want to solve is the following:
  1. All access to InternalThread:state must be protected by the
     InternalThread:synch_cs mutex
  2. We must destroy the mutex when we are done with the thread.
  3. We don't know which happens later - detaching the machine thread or
     finalizing its InternalThread managed object.

The solution is to replace InternalThread:synch_cs by InternalThread:longlived
which is a refcounted struct that holds the synch_cs.  The refcount starts out
at 2 when the thread is attached to the runtime and when we create the managed
InternalThread object that represents it.
Both detaching and finalizing the managed object will decrement the refounct,
and whichever one happens last will be responsible for destroying the mutex.

This addresses #11956 which was a race
condition due to the previous attempt to fix this lifetime problem.  The
previous attempt incorrectly used CAS in mono_thread_detach_internal while
continuing to use locking of synch_cs elsewhere. In particular
mono_thread_suspend_all_other_threads could race with
mono_thread_detach_internal: it expects to take the thread lock and test
thread->state and use the thread->suspended event, while detaching deletes
thread->suspended without taking a lock.

As a result we had a concurrency bug: in suspend_all_other_threads it's
possible to see both the old (non-Stopped) value of thread->state and the
new (NULL) value of thread->suspended.  Which leads to crashes.

---

Background - why we don't know if detaching or finalization happens first.

1. InternalThread normally outlives the machine thread. This can happen because
when one thread starts another it can hold a reference to the fresh thread's
Thread object which holds a reference to the InternalThread. So after the
machine thread is done, the older thread can query the state of the younger
Thread object. This is the normal situation.

2. During shutdown we can have the opposite situation: the InternalThread
objects are finalized first (this happens during root domain finalization), but
the machine threads are still running, and they may still return to
start_wrapper_internal and call detach_internal. So in this case we have an
InternalThread whose finalizer ran first and detach will run second.
lambdageek added a commit that referenced this issue Jan 15, 2019
Reverts a29ad08 (#9914)

The basic problem we want to solve is the following:
  1. All access to InternalThread:state must be protected by the
     InternalThread:synch_cs mutex
  2. We must destroy the mutex when we are done with the thread.
  3. We don't know which happens later - detaching the machine thread or
     finalizing its InternalThread managed object.

The solution is to replace InternalThread:synch_cs by InternalThread:longlived
which is a refcounted struct that holds the synch_cs.  The refcount starts out
at 2 when the thread is attached to the runtime and when we create the managed
InternalThread object that represents it.
Both detaching and finalizing the managed object will decrement the refounct,
and whichever one happens last will be responsible for destroying the mutex.

This addresses #11956 which was a race
condition due to the previous attempt to fix this lifetime problem.  The
previous attempt incorrectly used CAS in mono_thread_detach_internal while
continuing to use locking of synch_cs elsewhere. In particular
mono_thread_suspend_all_other_threads could race with
mono_thread_detach_internal: it expects to take the thread lock and test
thread->state and use the thread->suspended event, while detaching deletes
thread->suspended without taking a lock.

As a result we had a concurrency bug: in suspend_all_other_threads it's
possible to see both the old (non-Stopped) value of thread->state and the
new (NULL) value of thread->suspended.  Which leads to crashes.

---

Background - why we don't know if detaching or finalization happens first.

1. InternalThread normally outlives the machine thread. This can happen because
when one thread starts another it can hold a reference to the fresh thread's
Thread object which holds a reference to the InternalThread. So after the
machine thread is done, the older thread can query the state of the younger
Thread object. This is the normal situation.

2. During shutdown we can have the opposite situation: the InternalThread
objects are finalized first (this happens during root domain finalization), but
the machine threads are still running, and they may still return to
start_wrapper_internal and call detach_internal. So in this case we have an
InternalThread whose finalizer ran first and detach will run second.
lambdageek added a commit that referenced this issue Jan 15, 2019
Reverts a29ad08 (#9914)

The basic problem we want to solve is the following:
  1. All access to InternalThread:state must be protected by the
     InternalThread:synch_cs mutex
  2. We must destroy the mutex when we are done with the thread.
  3. We don't know which happens later - detaching the machine thread or
     finalizing its InternalThread managed object.

The solution is to replace InternalThread:synch_cs by InternalThread:longlived
which is a refcounted struct that holds the synch_cs.  The refcount starts out
at 2 when the thread is attached to the runtime and when we create the managed
InternalThread object that represents it.
Both detaching and finalizing the managed object will decrement the refounct,
and whichever one happens last will be responsible for destroying the mutex.

This addresses #11956 which was a race
condition due to the previous attempt to fix this lifetime problem.  The
previous attempt incorrectly used CAS in mono_thread_detach_internal while
continuing to use locking of synch_cs elsewhere. In particular
mono_thread_suspend_all_other_threads could race with
mono_thread_detach_internal: it expects to take the thread lock and test
thread->state and use the thread->suspended event, while detaching deletes
thread->suspended without taking a lock.

As a result we had a concurrency bug: in suspend_all_other_threads it's
possible to see both the old (non-Stopped) value of thread->state and the
new (NULL) value of thread->suspended.  Which leads to crashes.
@lambdageek

This comment has been minimized.

Copy link
Member

@lambdageek lambdageek commented Jan 15, 2019

I believe this is fixed on master, 2018-12 and 2018-10.

@lambdageek lambdageek closed this Jan 15, 2019
@marek-safar marek-safar added this to the 2018-10 (5.20.xx) milestone Jan 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.