New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC denies service to all threads if any attached thread is in disk sleep. #7099
Comments
#7099 With cooperative GC, user code will either have to indicate it is entering a syscall, or we have to detect it? (I believe NT makes it easy to detect). |
/cc @BrzVlad any suggested workaround? |
@blucz You can try running with coop to see if it fixes MONO_ENABLE_COOP=1. For a more recent mono (ex. 5.8), we seem to support MONO_ENABLE_BLOCKING_TRANSITION, that should avoid the need of suspending these types of threads, without the full usage of coop. |
Thanks, that's very helpful. I can confirm that both MONO_ENABLE_COOP and MONO_ENABLE_BLOCKING_TRANSITION successfully work around the issue in the test program. We've been trying to turn on MONO_ENABLE_COOP for our application since Miguel's original post in Dec 2015, but it has been associated with stability issues for us (as recently as 5.4 on mac), so we have left it off. We can re-open that option and look again, of course. It seems more likely that |
An update: In the real application, I tracked down at one place in mono so far that is missing the GC guards around an Pull request for that: #7126 Since making this change, I have not seen another long stop-the-world time. I am seeing a new crash or two that I think is related to enabling |
@blucz - can we close out this issue ? |
Yes |
Overview
On Linux, the Mono GC is propagating the effects of uninterruptible sleep to threads which are not involved in File I/O. As a result, it is unsafe to use mono to access files on any network filesystem on Linux--because uninterruptible sleep of various durations is happening constantly.
This is manifesting frequently in our media server software when music libraries are located on SMB shares. If the SMB server is ever less than perfectly responsive, we start seeing massive "stop the world" times that hang every thread in the process--sometimes for a minute or longer--even if only a single thread is actually in the uninterruptible state.
We have observed this in "real world" situations many times. The steps in this ticket, while synthetic, are meant to make it easy to produce instantly and repeatably. We have done sufficient investigation to be confident that if the bug demonstrated by this synthetic procedure is fixed, it will also fix the "real world" manifestations of the issue.
The problem here is not the I/O slowness itself--that is not atypical for networked filesystems on home networks--it is that slow I/O happening in one thread causes denial-of-service to other threads not involved with the slow filesystem.
To make this easier for you guys to look at:
Steps to Reproduce
For this procedure, you'll need to machines: one Mac and one running Linux.
The Mac's role is to be the "slow network share", since it's pretty easy to set that up on OS X. You could replace it with a different platform, provided a similar network slowdown mechanism is in place.
The linux machine is where you will actually run Mono, mount the slow network share, and see the bug.
1. On the mac, share a folder. In my case, I am sharing a folder with about 30,000 music files.
2. On the mac, Install Apple's "Network Link Conditioner" and make a profile like this:
See here for information: https://stackoverflow.com/questions/9659382/installing-apples-network-link-conditioner-tool
3. On the linux machine, mount the network share
In this example,
//192.168.1.32/music
is the share on my mac andbrian
is my username. This will differ for you.(if you are missing
mount.cifs
, installcifs-utils
)4. On the linux machine, run the sample program
The code is here (had to add
.txt
extension to satisfy github..sorry)uninterruptible_sleep_dos.cs.txt
Run like this:
It should print out stuff like this:
The thread printing those messages does no I/O other than to the console:
Thus, there should be an "alive last timediff" print roughly every ~500ms.
In our actual process there are usually 50+ threads and they invoke the GC "naturally", and never via
GC.Collect()
, but since this is a small program that is not generating much garbage, this is an easier way to show the bug.Leave the program running
5. On the mac, select the profile that you created earlier and enable the network link conditioner:
6. On the linux machine, check the trace
...and imagine what this does to a busy multi-threaded application when it happens in production.
Current Behavior
If any thread ends up in uninterruptible sleep, all runtime-attached threads hang as soon as the GC is invoked.
This animation demonstrates how quickly things go to sh*t when the network performance is decreased:
Expected Behavior
Slow I/O only hangs the threads where slow I/O is happening, and does not deny service to other threads.
On which platforms did you notice this
[ ] macOS
[ x ] Linux
[ ] Windows
Version Used:
Additional Information
When the I/O thread is hung, it can be difficult to use a debugger, since
ptrace
doesn't work. It can be useful to confirm the situation by looking at the kernel stacks. This brief script helps with that.Example Stacks
Conclusion
This is obviously a pretty heavy problem. Uninterruptible sleep (as an OS behavior) is pretty rude. The GC has a real need to interact with all attached threads. I know that we could fix this at the application level by creating a new I/O layer that makes syscalls totally outside of mono-attached threads, but that is a costly approach. Maybe there is some hope in the cooperative GC stuff.
Open to any thoughts about other workarounds, too, and happy to help in any way that I can..
The text was updated successfully, but these errors were encountered: