-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix shutdown race between mono_thread_manage and mono_thread_detach_internal. #5599
Fix shutdown race between mono_thread_manage and mono_thread_detach_internal. #5599
Conversation
mono/metadata/threads.c
Outdated
g_hash_table_insert (joinable_threads, tid, tid); | ||
UnlockedIncrement (&joinable_thread_count); | ||
if (!g_hash_table_lookup (joinable_threads, tid)) { | ||
threads_add_joinable_thread_nolock (tid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the lookup needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Windows we need to store a handle to the thread in the joinable_threads table (as value) to make sure the tid still represents the same thread when waited upon. As long as there is an outstanding handle to a thread the tid won't be reused by the OS. This check prevents that we replace a tid handle, since the handle will then be leaked on Windows platforms. An alternative could be to implement a custom value_destroy_func on Windows to capture this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked a little more into this, current implementations call of g_hash_table_insert is problematic in cases where the tid is already in the table, since the joinable_thread_count will then not be correct, and there is no way to check if g_hash_table_insert added a new item or just changing previous item in the table. Doing the value_destroy_func on Windows won't solve this, and the problem with the increment of joinable_thread_count exists on other platforms. In previous usage of joinable threads this use case probably didn't happen (same tid added multiple times to table), but it can happen with this fix, so doing the check prevents duplicates, makes sure the joinable threads in the list corresponds to the joinable_thread_count. It also makes the implementation of mono_threads_add_joinable_thread more robust for future use. OK to keep this change as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use g_hash_table_lookup_extended
so a NULL
value isn't considered false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
build |
Looks like just looking at the shutdown flag wont get all use case. Will update with additional condition on when to add threads to the joinable threads list. |
mono/metadata/threads.c
Outdated
#else | ||
static void | ||
threads_native_thread_join_lock (gpointer key, gpointer value) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use mono_native_thread_join
on both platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wanted to avoid additional additional openthread since we already have a valid thread handle in the table. mono_native_thread_join only accepts tid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to a handle base join in the windows threading backend in complement to the already existing tid version.
mono/metadata/threads.c
Outdated
{ | ||
HANDLE thread_handle = (HANDLE)value; | ||
MONO_ENTER_GC_SAFE; | ||
WaitForSingleObject (thread_handle, INFINITE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use mono_native_thread_join
on both platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
mono/metadata/threads.c
Outdated
|
||
joinable_threads_lock (); | ||
if (!joinable_threads) | ||
joinable_threads = g_hash_table_new (NULL, NULL); | ||
if (g_hash_table_lookup (joinable_threads, tid)) { | ||
|
||
value = g_hash_table_lookup (joinable_threads, tid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use g_hash_table_lookup_extended
so a NULL
value wouldn't be considered false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
mono/metadata/threads.c
Outdated
g_hash_table_insert (joinable_threads, tid, tid); | ||
UnlockedIncrement (&joinable_thread_count); | ||
if (!g_hash_table_lookup (joinable_threads, tid)) { | ||
threads_add_joinable_thread_nolock (tid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
mono/metadata/threads.c
Outdated
g_hash_table_insert (joinable_threads, tid, tid); | ||
UnlockedIncrement (&joinable_thread_count); | ||
if (!g_hash_table_lookup (joinable_threads, tid)) { | ||
threads_add_joinable_thread_nolock (tid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use g_hash_table_lookup_extended
so a NULL
value isn't considered false.
Also, like you pointed out earlier, this will also wait for threads which have been created outside of the runtime, and for which there is no guarantee that they will ever exit. For the example of the platform threadpool, which can be used by user code to call into managed code, there is 0 guarantee that these threads will ever exit. |
I will remodell the condition on when we add to the joinable list since shutdown flag wont capture all cases anyways. I will most likely move back to a previous solution just addin threads that we control. |
We should now only add threads to the joinable thread list if its a thread that we know will terminate shortly after call to detach. External threads should never get onto the joinable threads list. This should not change much to current dependency on the threads list. If a attached thread is still on the threads list when thread_manage runs remove_and_abort_threads it will wait infinite for threads to finish. |
mono/metadata/threads.c
Outdated
@@ -739,7 +742,7 @@ mono_thread_attach_internal (MonoThread *thread, gboolean force_attach, gboolean | |||
} | |||
|
|||
static void | |||
mono_thread_detach_internal (MonoInternalThread *thread) | |||
mono_thread_detach_is_joinable_internal (MonoInternalThread *thread, gboolean is_joinable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would definitely prefer to keep the same function mono_thread_detach_internal
, even if you add a parameter, to avoid having a proliferation of functions that do very little.
mono/metadata/threads.c
Outdated
inline static void | ||
mono_thread_detach_internal (MonoInternalThread *thread) | ||
{ | ||
mono_thread_detach_is_joinable_internal (thread, FALSE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A non-joinable thread is a thread that have been externally attached, not a thread that is externally detached. And you already have this information via MonoThreadInfo.runtime_thread which is set to TRUE iff it's been created via start_wrapper
which is only accessible when creating a thread from the runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I can switch to just looking at runtime_thread in condition and remove the additional parameter.
4b818c0
to
ffa9cf0
Compare
Investigating a new scenario triggering similar assert when moving over to just looking at runtime_thread. Looks like there are runtime threads not included in the threads list during shutdown. Current solution won't add them to the joinable list, maybe they should be, but will investigate when/why this happens. There is also a potential race between timeout of finalizer thread and joinable threads. The current implementation of mono_threads_join_threads can cause issues if finalizer thread is waiting on a joinable thread when aborted by main thread. This is probably a less likely scenario since the default finalizer timeout is 40 seconds, so won't focus on this scenario for now. |
Found the additional race condition in this area, when a thread is detached it reset the threads background state, threads.c:773 and this will make thread_manage to not wait upon the tid, but remove it from the threads list. That will in turn cause mono_thread_detach_internal not hitting the code path adding to the joinable thread list. In order to resolve all races during shutdown between finalizing the call to mono_thread_detach_internal and shutdown of GC (and runtime) adding to joinable list needs to take place before changing the background state of the thread. NOTE, only runtime threads guaranteed to exit will be added to list. |
Joinable threads are added in this code path:
|
ffa9cf0
to
075b439
Compare
There are races when threads won't reach that code path before the runtime already have terminated the GC, and that will cause the assert/uncontrolled termination seen on Jenkins. Above will make sure runtime threads gets onto the joinable thread list earlier, just before the runtime thread could vanish from thread_manage's radar and not waited upon by anyone. |
075b439
to
517bca1
Compare
Will investigate what's triggering the Linux failures. Will add some temporary asserts and rerun Jenkins builds. |
Believe that I found the cause of Linux failures, it boils down to the same issue as described above with race between finalizer thread and joinable threads. There is a small chance that a thread will be joined multiple times. This won't cause any problems on Windows, but on Posix's platforms this is undefined behavior. |
mono/metadata/threads.c
Outdated
#else | ||
static void | ||
threads_native_thread_join_lock (gpointer key, gpointer value) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do the key/value parameters mean here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key/Value will be different on Windows where key is the thread ID and value is the OS handle to the thread (preventing thread id to be recycled). On other platforms key will be the same as value (same as previous implementation where tid was both key and value), but same function signature is used on all platforms. So for Windows key == tid and value == thread handle, on other platforms key == tid and value == tid (the same as previous implementation). I can give the parameters different names if that makes it clearer.
mono/metadata/threads.c
Outdated
} | ||
} | ||
inline static void | ||
threads_native_thread_join_nolock (gpointer key, gpointer value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not perf critical, they shouldn't be inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
6466c46
to
5d6a7bf
Compare
Current Windows implementation doesn't have support for the joinable thread implementation in threads.c. Joinable threads are used by the runtime during shutdown to coordinate with exit of platform threads. Since threads detaching and exiting during shutdown could still use or provide important resources, like GC memory, lacking support for this feature could potentially expose the implementation to various shutdown race conditions. Commit also change requested permission on a thread handle when opening it for synchronization purposes. No need to request all thread access when handle is only used for synchronization, like in WaitForXXX API calls.
There is a race during shutdown when main tread is running mono_thread_manage and attached threads are running mono_thread_detach_internal. The problem is related to the removal from the registered threads list since mono_thread_manage will pick threads to wait upon from that list. If a thread removes itself from the list using mono_thread_detach_internal it will then race against runtime shutdown and the removal of used resources still in used by mono_thread_detach_intenral, like MonoInternalThread. This race could trigger the assert seen in bugzilla: https://bugzilla.xamarin.com/show_bug.cgi?id=58770 or other undefined behavior depending on where the detached thread resume execution after main thread has shutdown the runtime, but not yet terminated the process. The fix makes sure threads that are detaching during shutdown ends up on the joinable threads list. Threads on that list will be waited upon by the main thread during shutdown, mono_thread_cleanup. This will make sure the detached thread finishes the remaining of mono_thread_detach_internal and potential other code still using GC memory during shutdown. Threads already on the threads list will already be waited upon by mono_thread_manage during shutdown (if not removed by mono_thread_detach_internal triggering this race), so this change won't add additional threads to be waited up by the runtime. It just makes the shutdown more reliable using already available concepts (like the joinable threads list).
5d6a7bf
to
fffe6e2
Compare
…ltiple times. Checking against joinable_threads list in mono_threads_add_joinable_thread is not enough. There is a small chance that the same runtime thread could get on the list twice since mono_threads_add_joinable_thread could be called multiple times for the the same tid. This will work under Windows but is undefined behavior on POSIX's platforms. This could happen if the finalizer thread calls mono_threads_join_threads when a thread is detaching and have added itself to the joinable_threads, but before unregister_thread has been called. Finalizer thread will then take tid from list and wait, this will cause the second call to mono_threads_add_joinable_thread to not find itself in the joinable_threads and add tid a second time. NOTE, this race only applies to runtime threads added to the joinable_threads list. Fix adds an additional method to add runtime threads to the list. This method will only add to list if its indeed a runtime thread and doesn’t already have an outstanding pending native join request. This method is called by code previously checking runtime_thread flag before adding to joinable_threads list. This will prevent the scenario when the same MonoThreadInfo add its tid to the joinable_threads list multiple times due to above race condition.
fffe6e2
to
7d92655
Compare
mono#5599 fixed a race condition during shutdown when runtime threads have come parts of their way through detach, but still depend on runtime resources, like GC memory. The fix added runtime threads to the joinable thread list just before they vanished from mono_thread_manage radar making sure shutdown waited upon the thread before cleaning up. The above fix slightly changed the behavior of the finalizer thread since it waits on joinable threads and will now potential block on threads still executing code (that involves runtime resources). There’s was an assumption around the threads on the joinable thread list that they should be very close to complete when added, so join calls coming from the finalizer thread should almost never block and if it does, the code that remains to execute should not involve runtime operations risking deadlock situations. Adding the thread to the list earlier than previously done expose the shutdown to some potential theoretical problems. To mitigate the risk and still solve the race condition this commit adds a mechanism to keep track of active runtime threads until they park on joinable thread list. The pending counter will be waited upon by the shutdown thread, just before it does its regular wait on all joinable threads (after finalizer thread has stopped) to make sure all runtime threads have been added to the joinable thread list before waiting upon them. Threads are added to the joinable thread as late as possible, exactly how it’s been done in the past by sgen_client_thread_detach_with_lock. Shutdown thread will wait on runtime threads to appear on the list for a short time and if timeout (pending runtime thread count not reaching 0 before timeout), it will just print a warning and continue shutdown. Getting into a wait state during shutdown due to runtime threads not yet added to joinable threads list should be very rare (hitting previous race condition that was rare), triggering the timeout should be even more rare, and if that ever happens, we are exposed to shutdown race condition as we have had in the past, but now we at least get a warning in the log making it simpler to analyze further. This commit also fixes a problem with the debugger thread hitting the same race condition as above. The shutdown thread stopping the debugger thread didn't completely wait for it to stop using runtime resources before continue shutdown sequence. This triggers the same race condition as when shutting down regular runtime threads. This commit makes sure stop_debugger_thread waits on the debugger thread handle to become signaled (happens at the very end of thread lifetime) before continuing the shutdown logic.
mono#5599 fixed a race condition during shutdown when runtime threads have come parts of their way through detach, but still depend on runtime resources, like GC memory. The fix added runtime threads to the joinable thread list just before they vanished from mono_thread_manage radar making sure shutdown waited upon the thread before cleaning up. The above fix slightly changed the behavior of the finalizer thread since it waits on joinable threads and will now potential block on threads still executing code (that involves runtime resources). There’s was an assumption around the threads on the joinable thread list that they should be very close to complete when added, so join calls coming from the finalizer thread should almost never block and if it does, the code that remains to execute should not involve runtime operations risking deadlock situations. Adding the thread to the list earlier than previously done expose the shutdown to some potential theoretical problems. To mitigate the risk and still solve the race condition this commit adds a mechanism to keep track of active runtime threads until they park on joinable thread list. The pending counter will be waited upon by the shutdown thread, just before it does its regular wait on all joinable threads (after finalizer thread has stopped) to make sure all runtime threads have been added to the joinable thread list before waiting upon them. Threads are added to the joinable thread as late as possible, exactly how it’s been done in the past by sgen_client_thread_detach_with_lock. Shutdown thread will wait on runtime threads to appear on the list for a short time and if timeout (pending runtime thread count not reaching 0 before timeout), it will just print a warning and continue shutdown. Getting into a wait state during shutdown due to runtime threads not yet added to joinable threads list should be very rare (hitting previous race condition that was rare), triggering the timeout should be even more rare, and if that ever happens, we are exposed to shutdown race condition as we have had in the past, but now we at least get a warning in the log making it simpler to analyze further. This commit also fixes a problem with the debugger thread hitting the same race condition as above. The shutdown thread stopping the debugger thread didn't completely wait for it to stop using runtime resources before continue shutdown sequence. This triggers the same race condition as when shutting down regular runtime threads. This commit makes sure stop_debugger_thread waits on the debugger thread handle to become signaled (happens at the very end of thread lifetime) before continuing the shutdown logic.
mono#5599 fixed a race condition during shutdown when runtime threads have come parts of their way through detach, but still depend on runtime resources, like GC memory. The fix added runtime threads to the joinable thread list just before they vanished from mono_thread_manage radar making sure shutdown waited upon the thread before cleaning up. The above fix slightly changed the behavior of the finalizer thread since it waits on joinable threads and will now potential block on threads still executing code (that involves runtime resources). There’s was an assumption around the threads on the joinable thread list that they should be very close to complete when added, so join calls coming from the finalizer thread should almost never block and if it does, the code that remains to execute should not involve runtime operations risking deadlock situations. Adding the thread to the list earlier than previously done expose the shutdown to some potential theoretical problems. To mitigate the risk and still solve the race condition this commit adds a mechanism to keep track of active runtime threads until they park on joinable thread list. The pending counter will be waited upon by the shutdown thread, just before it does its regular wait on all joinable threads (after finalizer thread has stopped) to make sure all runtime threads have been added to the joinable thread list before waiting upon them. Threads are added to the joinable thread as late as possible, exactly how it’s been done in the past by sgen_client_thread_detach_with_lock. Shutdown thread will wait on runtime threads to appear on the list for a short time and if timeout (pending runtime thread count not reaching 0 before timeout), it will just print a warning and continue shutdown. Getting into a wait state during shutdown due to runtime threads not yet added to joinable threads list should be very rare (hitting previous race condition that was rare), triggering the timeout should be even more rare, and if that ever happens, we are exposed to shutdown race condition as we have had in the past, but now we at least get a warning in the log making it simpler to analyze further. This commit also fixes a problem with the debugger thread hitting the same race condition as above. The shutdown thread stopping the debugger thread didn't completely wait for it to stop using runtime resources before continue shutdown sequence. This triggers the same race condition as when shutting down regular runtime threads. This commit makes sure stop_debugger_thread waits on the debugger thread handle to become signaled (happens at the very end of thread lifetime) before continuing the shutdown logic.
mono#5599 fixed a race condition during shutdown when runtime threads have come parts of their way through detach, but still depend on runtime resources, like GC memory. The fix added runtime threads to the joinable thread list just before they vanished from mono_thread_manage radar making sure shutdown waited upon the thread before cleaning up. The above fix slightly changed the behavior of the finalizer thread since it waits on joinable threads and will now potential block on threads still executing code (that involves runtime resources). There’s was an assumption around the threads on the joinable thread list that they should be very close to complete when added, so join calls coming from the finalizer thread should almost never block and if it does, the code that remains to execute should not involve runtime operations risking deadlock situations. Adding the thread to the list earlier than previously done expose the shutdown to some potential theoretical problems. To mitigate the risk and still solve the race condition this commit adds a mechanism to keep track of active runtime threads until they park on joinable thread list. The pending counter will be waited upon by the shutdown thread, just before it does its regular wait on all joinable threads (after finalizer thread has stopped) to make sure all runtime threads have been added to the joinable thread list before waiting upon them. Threads are added to the joinable thread as late as possible, exactly how it’s been done in the past by sgen_client_thread_detach_with_lock. Shutdown thread will wait on runtime threads to appear on the list for a short time and if timeout (pending runtime thread count not reaching 0 before timeout), it will just print a warning and continue shutdown. Getting into a wait state during shutdown due to runtime threads not yet added to joinable threads list should be very rare (hitting previous race condition that was rare), triggering the timeout should be even more rare, and if that ever happens, we are exposed to shutdown race condition as we have had in the past, but now we at least get a warning in the log making it simpler to analyze further. This commit also fixes a problem with the debugger thread hitting the same race condition as above. The shutdown thread stopping the debugger thread didn't completely wait for it to stop using runtime resources before continue shutdown sequence. This triggers the same race condition as when shutting down regular runtime threads. This commit makes sure stop_debugger_thread waits on the debugger thread handle to become signaled (happens at the very end of thread lifetime) before continuing the shutdown logic.
mono#5599 fixed a race condition during shutdown when runtime threads have come parts of their way through detach, but still depend on runtime resources, like GC memory. The fix added runtime threads to the joinable thread list just before they vanished from mono_thread_manage radar making sure shutdown waited upon the thread before cleaning up. The above fix slightly changed the behavior of the finalizer thread since it waits on joinable threads and will now potential block on threads still executing code (that involves runtime resources). There’s was an assumption around the threads on the joinable thread list that they should be very close to complete when added, so join calls coming from the finalizer thread should almost never block and if it does, the code that remains to execute should not involve runtime operations risking deadlock situations. Adding the thread to the list earlier than previously done expose the shutdown to some potential theoretical problems. To mitigate the risk and still solve the race condition this commit adds a mechanism to keep track of active runtime threads until they park on joinable thread list. The pending counter will be waited upon by the shutdown thread, just before it does its regular wait on all joinable threads (after finalizer thread has stopped) to make sure all runtime threads have been added to the joinable thread list before waiting upon them. Threads are added to the joinable thread as late as possible, exactly how it’s been done in the past by sgen_client_thread_detach_with_lock. Shutdown thread will wait on runtime threads to appear on the list for a short time and if timeout (pending runtime thread count not reaching 0 before timeout), it will just print a warning and continue shutdown. Getting into a wait state during shutdown due to runtime threads not yet added to joinable threads list should be very rare (hitting previous race condition that was rare), triggering the timeout should be even more rare, and if that ever happens, we are exposed to shutdown race condition as we have had in the past, but now we at least get a warning in the log making it simpler to analyze further. This commit also fixes a problem with the debugger thread hitting the same race condition as above. The shutdown thread stopping the debugger thread didn't completely wait for it to stop using runtime resources before continue shutdown sequence. This triggers the same race condition as when shutting down regular runtime threads. This commit makes sure stop_debugger_thread waits on the debugger thread handle to become signaled (happens at the very end of thread lifetime) before continuing the shutdown logic.
mono#5599 fixed a race condition during shutdown when runtime threads have come parts of their way through detach, but still depend on runtime resources, like GC memory. The fix added runtime threads to the joinable thread list just before they vanished from mono_thread_manage radar making sure shutdown waited upon the thread before cleaning up. The above fix slightly changed the behavior of the finalizer thread since it waits on joinable threads and will now potential block on threads still executing code (that involves runtime resources). There’s was an assumption around the threads on the joinable thread list that they should be very close to complete when added, so join calls coming from the finalizer thread should almost never block and if it does, the code that remains to execute should not involve runtime operations risking deadlock situations. Adding the thread to the list earlier than previously done expose the shutdown to some potential theoretical problems. To mitigate the risk and still solve the race condition this commit adds a mechanism to keep track of active runtime threads until they park on joinable thread list. The pending counter will be waited upon by the shutdown thread, just before it does its regular wait on all joinable threads (after finalizer thread has stopped) to make sure all runtime threads have been added to the joinable thread list before waiting upon them. Threads are added to the joinable thread as late as possible, exactly how it’s been done in the past by sgen_client_thread_detach_with_lock. Shutdown thread will wait on runtime threads to appear on the list for a short time and if timeout (pending runtime thread count not reaching 0 before timeout), it will just print a warning and continue shutdown. Getting into a wait state during shutdown due to runtime threads not yet added to joinable threads list should be very rare (hitting previous race condition that was rare), triggering the timeout should be even more rare, and if that ever happens, we are exposed to shutdown race condition as we have had in the past, but now we at least get a warning in the log making it simpler to analyze further. This commit also fixes a problem with the debugger thread hitting the same race condition as above. The shutdown thread stopping the debugger thread didn't completely wait for it to stop using runtime resources before continue shutdown sequence. This triggers the same race condition as when shutting down regular runtime threads. This commit makes sure stop_debugger_thread waits on the debugger thread handle to become signaled (happens at the very end of thread lifetime) before continuing the shutdown logic.
mono#5599 fixed a race condition during shutdown when runtime threads have come parts of their way through detach, but still depend on runtime resources, like GC memory. The fix added runtime threads to the joinable thread list just before they vanished from mono_thread_manage radar making sure shutdown waited upon the thread before cleaning up. The above fix slightly changed the behavior of the finalizer thread since it waits on joinable threads and will now potential block on threads still executing code (that involves runtime resources). There’s was an assumption around the threads on the joinable thread list that they should be very close to complete when added, so join calls coming from the finalizer thread should almost never block and if it does, the code that remains to execute should not involve runtime operations risking deadlock situations. Adding the thread to the list earlier than previously done expose the shutdown to some potential theoretical problems. To mitigate the risk and still solve the race condition this commit adds a mechanism to keep track of active runtime threads until they park on joinable thread list. The pending counter will be waited upon by the shutdown thread, just before it does its regular wait on all joinable threads (after finalizer thread has stopped) to make sure all runtime threads have been added to the joinable thread list before waiting upon them. Threads are added to the joinable thread as late as possible, exactly how it’s been done in the past by sgen_client_thread_detach_with_lock. Shutdown thread will wait on runtime threads to appear on the list for a short time and if timeout (pending runtime thread count not reaching 0 before timeout), it will just print a warning and continue shutdown. Getting into a wait state during shutdown due to runtime threads not yet added to joinable threads list should be very rare (hitting previous race condition that was rare), triggering the timeout should be even more rare, and if that ever happens, we are exposed to shutdown race condition as we have had in the past, but now we at least get a warning in the log making it simpler to analyze further. This commit also fixes a problem with the debugger thread hitting the same race condition as above. The shutdown thread stopping the debugger thread didn't completely wait for it to stop using runtime resources before continue shutdown sequence. This triggers the same race condition as when shutting down regular runtime threads. This commit makes sure stop_debugger_thread waits on the debugger thread handle to become signaled (happens at the very end of thread lifetime) before continuing the shutdown logic.
mono#5599 fixed a race condition during shutdown when runtime threads have come parts of their way through detach, but still depend on runtime resources, like GC memory. The fix added runtime threads to the joinable thread list just before they vanished from mono_thread_manage radar making sure shutdown waited upon the thread before cleaning up. The above fix slightly changed the behavior of the finalizer thread since it waits on joinable threads and will now potential block on threads still executing code (that involves runtime resources). There’s was an assumption around the threads on the joinable thread list that they should be very close to complete when added, so join calls coming from the finalizer thread should almost never block and if it does, the code that remains to execute should not involve runtime operations risking deadlock situations. Adding the thread to the list earlier than previously done expose the shutdown to some potential theoretical problems. To mitigate the risk and still solve the race condition this commit adds a mechanism to keep track of active runtime threads until they park on joinable thread list. The pending counter will be waited upon by the shutdown thread, just before it does its regular wait on all joinable threads (after finalizer thread has stopped) to make sure all runtime threads have been added to the joinable thread list before waiting upon them. Threads are added to the joinable thread as late as possible, exactly how it’s been done in the past by sgen_client_thread_detach_with_lock. Shutdown thread will wait on runtime threads to appear on the list for a short time and if timeout (pending runtime thread count not reaching 0 before timeout), it will just print a warning and continue shutdown. Getting into a wait state during shutdown due to runtime threads not yet added to joinable threads list should be very rare (hitting previous race condition that was rare), triggering the timeout should be even more rare, and if that ever happens, we are exposed to shutdown race condition as we have had in the past, but now we at least get a warning in the log making it simpler to analyze further. This commit also fixes a problem with the debugger thread hitting the same race condition as above. The shutdown thread stopping the debugger thread didn't completely wait for it to stop using runtime resources before continue shutdown sequence. This triggers the same race condition as when shutting down regular runtime threads. This commit makes sure stop_debugger_thread waits on the debugger thread handle to become signaled (happens at the very end of thread lifetime) before continuing the shutdown logic.
…nternal. (mono/mono#5599) * Add support for mono_threads_add_joinable_thread and friends on Windows. Current Windows implementation doesn't have support for the joinable thread implementation in threads.c. Joinable threads are used by the runtime during shutdown to coordinate with exit of platform threads. Since threads detaching and exiting during shutdown could still use or provide important resources, like GC memory, lacking support for this feature could potentially expose the implementation to various shutdown race conditions. Commit also change requested permission on a thread handle when opening it for synchronization purposes. No need to request all thread access when handle is only used for synchronization, like in WaitForXXX API calls. * Fix race between mono_thread_detach_internal and mono_thread_manage. There is a race during shutdown when main tread is running mono_thread_manage and attached threads are running mono_thread_detach_internal. The problem is related to the removal from the registered threads list since mono_thread_manage will pick threads to wait upon from that list. If a thread removes itself from the list using mono_thread_detach_internal it will then race against runtime shutdown and the removal of used resources still in used by mono_thread_detach_intenral, like MonoInternalThread. This race could trigger the assert seen in bugzilla: https://bugzilla.xamarin.com/show_bug.cgi?id=58770 or other undefined behavior depending on where the detached thread resume execution after main thread has shutdown the runtime, but not yet terminated the process. The fix makes sure threads that are detaching during shutdown ends up on the joinable threads list. Threads on that list will be waited upon by the main thread during shutdown, mono_thread_cleanup. This will make sure the detached thread finishes the remaining of mono_thread_detach_internal and potential other code still using GC memory during shutdown. Threads already on the threads list will already be waited upon by mono_thread_manage during shutdown (if not removed by mono_thread_detach_internal triggering this race), so this change won't add additional threads to be waited up by the runtime. It just makes the shutdown more reliable using already available concepts (like the joinable threads list). * Fix warning C4141 on Windows MSVC compiler. * Updated with review feedback. * Prevent race between finalizer and main thread to join same thread multiple times. Checking against joinable_threads list in mono_threads_add_joinable_thread is not enough. There is a small chance that the same runtime thread could get on the list twice since mono_threads_add_joinable_thread could be called multiple times for the the same tid. This will work under Windows but is undefined behavior on POSIX's platforms. This could happen if the finalizer thread calls mono_threads_join_threads when a thread is detaching and have added itself to the joinable_threads, but before unregister_thread has been called. Finalizer thread will then take tid from list and wait, this will cause the second call to mono_threads_add_joinable_thread to not find itself in the joinable_threads and add tid a second time. NOTE, this race only applies to runtime threads added to the joinable_threads list. Fix adds an additional method to add runtime threads to the list. This method will only add to list if its indeed a runtime thread and doesn’t already have an outstanding pending native join request. This method is called by code previously checking runtime_thread flag before adding to joinable_threads list. This will prevent the scenario when the same MonoThreadInfo add its tid to the joinable_threads list multiple times due to above race condition. Commit migrated from mono/mono@ed1884b
mono/mono#5599 fixed a race condition during shutdown when runtime threads have come parts of their way through detach, but still depend on runtime resources, like GC memory. The fix added runtime threads to the joinable thread list just before they vanished from mono_thread_manage radar making sure shutdown waited upon the thread before cleaning up. The above fix slightly changed the behavior of the finalizer thread since it waits on joinable threads and will now potential block on threads still executing code (that involves runtime resources). There’s was an assumption around the threads on the joinable thread list that they should be very close to complete when added, so join calls coming from the finalizer thread should almost never block and if it does, the code that remains to execute should not involve runtime operations risking deadlock situations. Adding the thread to the list earlier than previously done expose the shutdown to some potential theoretical problems. To mitigate the risk and still solve the race condition this commit adds a mechanism to keep track of active runtime threads until they park on joinable thread list. The pending counter will be waited upon by the shutdown thread, just before it does its regular wait on all joinable threads (after finalizer thread has stopped) to make sure all runtime threads have been added to the joinable thread list before waiting upon them. Threads are added to the joinable thread as late as possible, exactly how it’s been done in the past by sgen_client_thread_detach_with_lock. Shutdown thread will wait on runtime threads to appear on the list for a short time and if timeout (pending runtime thread count not reaching 0 before timeout), it will just print a warning and continue shutdown. Getting into a wait state during shutdown due to runtime threads not yet added to joinable threads list should be very rare (hitting previous race condition that was rare), triggering the timeout should be even more rare, and if that ever happens, we are exposed to shutdown race condition as we have had in the past, but now we at least get a warning in the log making it simpler to analyze further. This commit also fixes a problem with the debugger thread hitting the same race condition as above. The shutdown thread stopping the debugger thread didn't completely wait for it to stop using runtime resources before continue shutdown sequence. This triggers the same race condition as when shutting down regular runtime threads. This commit makes sure stop_debugger_thread waits on the debugger thread handle to become signaled (happens at the very end of thread lifetime) before continuing the shutdown logic. Commit migrated from mono/mono@be683c1
There is a race during shutdown when main tread is running mono_thread_manage and attached threads are running mono_thread_detach_internal. The problem is related to the removal from the registered threads list since mono_thread_manage will pick threads to wait upon from that list. If a thread removes itself from the list using mono_thread_detach_internal it will then race against runtime shutdown
and the removal of used resources still in used by mono_thread_detach_intenral, like MonoInternalThread.
This race could trigger the assert seen in Bugzilla (also includes repro steps):
https://bugzilla.xamarin.com/show_bug.cgi?id=58770
or other undefined behavior depending on where the detached thread resume execution after main thread has shutdown the runtime, but not yet terminated the process.
The fix makes sure threads that are detaching during shutdown ends up on the joinable threads list. Threads on that list will be waited upon by the main thread during shutdown, mono_thread_cleanup. This will make sure the detached thread finishes the remaining of mono_thread_detach_internal and potential other code still using GC memory during shutdown. Threads already on the threads list will already be waited upon by mono_thread_manage during shutdown (if not removed by mono_thread_detach_internal triggering this race), so this change won't add additional threads to be waited up by the runtime. It just makes the shutdown more reliable using already available concepts (like the joinable threads list).
Since above fix uses the joinable thread implementation in threads.c, not implemented on Windows platforms, PR also adds implementation for that missing feature.