Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC denies service to all threads if any attached thread is in disk sleep. #7099

Closed
blucz opened this issue Feb 17, 2018 · 8 comments
Closed

Comments

@blucz
Copy link
Contributor

blucz commented Feb 17, 2018

Overview

On Linux, the Mono GC is propagating the effects of uninterruptible sleep to threads which are not involved in File I/O. As a result, it is unsafe to use mono to access files on any network filesystem on Linux--because uninterruptible sleep of various durations is happening constantly.

This is manifesting frequently in our media server software when music libraries are located on SMB shares. If the SMB server is ever less than perfectly responsive, we start seeing massive "stop the world" times that hang every thread in the process--sometimes for a minute or longer--even if only a single thread is actually in the uninterruptible state.

We have observed this in "real world" situations many times. The steps in this ticket, while synthetic, are meant to make it easy to produce instantly and repeatably. We have done sufficient investigation to be confident that if the bug demonstrated by this synthetic procedure is fixed, it will also fix the "real world" manifestations of the issue.

The problem here is not the I/O slowness itself--that is not atypical for networked filesystems on home networks--it is that slow I/O happening in one thread causes denial-of-service to other threads not involved with the slow filesystem.

To make this easier for you guys to look at:

  • I've included a procedure for synthetically creating this situation
  • I've created a very short test program which demonstrates the issue.

Steps to Reproduce

For this procedure, you'll need to machines: one Mac and one running Linux.

The Mac's role is to be the "slow network share", since it's pretty easy to set that up on OS X. You could replace it with a different platform, provided a similar network slowdown mechanism is in place.

The linux machine is where you will actually run Mono, mount the slow network share, and see the bug.

1. On the mac, share a folder. In my case, I am sharing a folder with about 30,000 music files.

screen shot 2018-02-17 at 1 11 54 pm

2. On the mac, Install Apple's "Network Link Conditioner" and make a profile like this:

See here for information: https://stackoverflow.com/questions/9659382/installing-apples-network-link-conditioner-tool

screen shot 2018-02-17 at 1 24 14 pm

3. On the linux machine, mount the network share

In this example, //192.168.1.32/music is the share on my mac and brian is my username. This will differ for you.

$ sudo mkdir /mnt/test
$ PASSWD="*****"  sudo /sbin/mount.cifs "//192.168.1.32/music" "/mnt/test" -o guest,nounix,iocharset=utf8,user="brian",domain="WORKGROUP",vers=2.1

(if you are missing mount.cifs, install cifs-utils)

4. On the linux machine, run the sample program

The code is here (had to add .txt extension to satisfy github..sorry)

uninterruptible_sleep_dos.cs.txt

Run like this:

$ export MONO_LOG_LEVEL=debug
$ export MONO_LOG_MASK=gc
$ mcs uninterruptible_sleep_dos.cs  && mono uninterruptible_sleep_dos.exe /mnt/test

It should print out stuff like this:

Mono: GC_MAJOR: (user request) time 0.20ms, stw 1.98ms los size: 1024K in use: 64K
Mono: GC_MAJOR_SWEEP: major size: 864K in use: 166K
[2/17/2018 1:25:15 PM] alive last timediff=502ms
Mono: GC_MAJOR: (user request) time 0.20ms, stw 1.74ms los size: 1024K in use: 64K
Mono: GC_MAJOR_SWEEP: major size: 864K in use: 166K
[2/17/2018 1:25:15 PM] alive last timediff=501ms
Mono: GC_MAJOR: (user request) time 0.20ms, stw 0.93ms los size: 1024K in use: 64K
Mono: GC_MAJOR_SWEEP: major size: 864K in use: 166K
[2/17/2018 1:25:16 PM] alive last timediff=501ms
Mono: GC_MAJOR: (user request) time 0.20ms, stw 2.76ms los size: 1024K in use: 64K
Mono: GC_MAJOR_SWEEP: major size: 864K in use: 164K
[2/17/2018 1:25:16 PM] alive last timediff=502ms
Mono: GC_MAJOR: (user request) time 0.43ms, stw 1.33ms los size: 1024K in use: 64K
Mono: GC_MAJOR_SWEEP: major size: 864K in use: 164K
[2/17/2018 1:25:17 PM] alive last timediff=501ms

The thread printing those messages does no I/O other than to the console:

            var sw = System.Diagnostics.Stopwatch.StartNew();
            while (true) {
                Console.WriteLine($"[{DateTime.Now}] alive last timediff={sw.ElapsedMilliseconds}ms");
                sw.Restart();
                Thread.Sleep(500);
                GC.Collect();           // trigger GCs often.
            }

Thus, there should be an "alive last timediff" print roughly every ~500ms.

In our actual process there are usually 50+ threads and they invoke the GC "naturally", and never via GC.Collect(), but since this is a small program that is not generating much garbage, this is an easier way to show the bug.

Leave the program running

5. On the mac, select the profile that you created earlier and enable the network link conditioner:

screen shot 2018-02-17 at 1 29 52 pm

6. On the linux machine, check the trace

Mono: GC_MAJOR: (user request) time 0.39ms, stw 21437.06ms los size: 1024K in use: 64K
Mono: GC_MAJOR_SWEEP: major size: 784K in use: 85K
[2/17/2018 1:30:42 PM] alive last timediff=21937ms
Mono: GC_MAJOR: (user request) time 0.38ms, stw 4550.65ms los size: 1024K in use: 64K
Mono: GC_MAJOR_SWEEP: major size: 784K in use: 85K
[2/17/2018 1:30:48 PM] alive last timediff=5050ms
Mono: GC_MAJOR: (user request) time 0.37ms, stw 2533.86ms los size: 1024K in use: 64K
Mono: GC_MAJOR_SWEEP: major size: 784K in use: 85K
[2/17/2018 1:30:51 PM] alive last timediff=3034ms
Mono: GC_MAJOR: (user request) time 0.17ms, stw 2540.17ms los size: 1024K in use: 64K
Mono: GC_MAJOR_SWEEP: major size: 784K in use: 85K
[2/17/2018 1:30:54 PM] alive last timediff=3040ms
Mono: GC_MAJOR: (user request) time 0.37ms, stw 2527.87ms los size: 1024K in use: 64K
Mono: GC_MAJOR_SWEEP: major size: 784K in use: 85K
[2/17/2018 1:30:57 PM] alive last timediff=3028ms
Mono: GC_MAJOR: (user request) time 0.37ms, stw 11317.64ms los size: 1024K in use: 64K
Mono: GC_MAJOR_SWEEP: major size: 784K in use: 85K
[2/17/2018 1:31:08 PM] alive last timediff=11817ms
Mono: GC_MAJOR: (user request) time 0.36ms, stw 6535.88ms los size: 1024K in use: 64K
Mono: GC_MAJOR_SWEEP: major size: 784K in use: 85K
[2/17/2018 1:31:15 PM] alive last timediff=7036ms

...and imagine what this does to a busy multi-threaded application when it happens in production.

Current Behavior

If any thread ends up in uninterruptible sleep, all runtime-attached threads hang as soon as the GC is invoked.

This animation demonstrates how quickly things go to sh*t when the network performance is decreased:

hangs

Expected Behavior

Slow I/O only hangs the threads where slow I/O is happening, and does not deny service to other threads.

On which platforms did you notice this

[ ] macOS
[ x ] Linux
[ ] Windows

Version Used:

Mono JIT compiler version 5.4.1.7 (tarball Wed Jan 17 20:26:34 UTC 2018)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
TLS:           __thread
SIGSEGV:       altstack
Notifications: epoll
Architecture:  amd64
Disabled:      none
Misc:          softdebug
LLVM:          supported, not enabled.
GC:            sgen (concurrent by default)

Additional Information

When the I/O thread is hung, it can be difficult to use a debugger, since ptrace doesn't work. It can be useful to confirm the situation by looking at the kernel stacks. This brief script helps with that.

#!/bin/sh
cd /proc/`pgrep -f uninterruptible_sleep_dos.exe`/task;
for i in *; do
    if grep 'disk sleep' $i/status >/dev/null ; then
        echo "---[ Thread $i is in Disk Sleep @ ]-------------"
        sudo cat $i/stack
    fi
done

Example Stacks

brian@nucbuntu ~ $ sh t.sh
---[ Thread 6682 is in Disk Sleep @ ]-------------
[<ffffffffc0c998c6>] cifs_user_readv+0x186/0x370 [cifs]
[<ffffffffc0c99b9b>] cifs_strict_readv+0xeb/0x100 [cifs]
[<ffffffff81211134>] new_sync_read+0x94/0xd0
[<ffffffff81211196>] __vfs_read+0x26/0x40
[<ffffffff81211756>] vfs_read+0x86/0x130
[<ffffffff812124a5>] SyS_read+0x55/0xc0
[<ffffffff818457ad>] entry_SYSCALL_64_fastpath+0x2b/0xe7
[<ffffffffffffffff>] 0xffffffffffffffff
brian@nucbuntu ~ $ sh t.sh
---[ Thread 6682 is in Disk Sleep @ ]-------------
[<ffffffffc0ca2d7f>] wait_for_response.isra.7+0xbf/0x110 [cifs]
[<ffffffffc0ca3ccb>] cifs_send_recv+0xfb/0x370 [cifs]
[<ffffffffc0ca4014>] SendReceive2+0xd4/0x110 [cifs]
[<ffffffffc0cbb653>] query_info+0x103/0x200 [cifs]
[<ffffffffc0cbcf5b>] SMB2_query_info+0x2b/0x30 [cifs]
[<ffffffffc0cbf33a>] smb2_open_op_close+0x1ba/0x210 [cifs]
[<ffffffffc0cbf493>] smb2_query_path_info+0x83/0x180 [cifs]
[<ffffffffc0c9c220>] cifs_get_inode_info+0x390/0x8f0 [cifs]
[<ffffffffc0c9e7c5>] cifs_revalidate_dentry_attr+0x1d5/0x250 [cifs]
[<ffffffffc0c9e883>] cifs_revalidate_dentry+0x13/0x30 [cifs]
[<ffffffffc0c8ef37>] cifs_d_revalidate+0x27/0xb0 [cifs]
[<ffffffff8121c72f>] lookup_fast+0x24f/0x330
[<ffffffff8121ddf9>] walk_component+0x49/0x300
[<ffffffff8121f421>] link_path_walk+0x191/0x5b0
[<ffffffff8121f93c>] path_lookupat+0x7c/0x110
[<ffffffff81221561>] filename_lookup+0xb1/0x180
[<ffffffff81221706>] user_path_at_empty+0x36/0x40
[<ffffffff81216696>] vfs_fstatat+0x66/0xc0
[<ffffffff81216bee>] SYSC_newstat+0x2e/0x60
[<ffffffff81216d7e>] SyS_newstat+0xe/0x10
[<ffffffff818457ad>] entry_SYSCALL_64_fastpath+0x2b/0xe7
[<ffffffffffffffff>] 0xffffffffffffffff

Conclusion

This is obviously a pretty heavy problem. Uninterruptible sleep (as an OS behavior) is pretty rude. The GC has a real need to interact with all attached threads. I know that we could fix this at the application level by creating a new I/O layer that makes syscalls totally outside of mono-attached threads, but that is a costly approach. Maybe there is some hope in the cooperative GC stuff.

Open to any thoughts about other workarounds, too, and happy to help in any way that I can..

@jaykrell
Copy link
Contributor

#7099
Do threads in syscalls not suspend?
Degenerate test case would just be sleep(infinite).

With cooperative GC, user code will either have to indicate it is entering a syscall, or we have to detect it? (I believe NT makes it easy to detect).

@marek-safar
Copy link
Member

/cc @BrzVlad any suggested workaround?

@BrzVlad
Copy link
Member

BrzVlad commented Feb 19, 2018

@blucz You can try running with coop to see if it fixes MONO_ENABLE_COOP=1.

For a more recent mono (ex. 5.8), we seem to support MONO_ENABLE_BLOCKING_TRANSITION, that should avoid the need of suspending these types of threads, without the full usage of coop.

@blucz
Copy link
Contributor Author

blucz commented Feb 19, 2018

Thanks, that's very helpful. I can confirm that both MONO_ENABLE_COOP and MONO_ENABLE_BLOCKING_TRANSITION successfully work around the issue in the test program.

We've been trying to turn on MONO_ENABLE_COOP for our application since Miguel's original post in Dec 2015, but it has been associated with stability issues for us (as recently as 5.4 on mac), so we have left it off. We can re-open that option and look again, of course.

It seems more likely that MONO_ENABLE_BLOCKING_TRANSITION is something that we can roll out rapidly. Do you have a gut feeling about how risky it is? I read the patch, and I understand what it's doing. What I don't have a feel for is how reliant it is on lots of different people successfully getting subtle details right at transition points.

@BrzVlad
Copy link
Member

BrzVlad commented Feb 19, 2018

@blucz From the start, it's something that I'm not aware of being tested anywhere. Still, I don't see any risk to it, from the GC's perspective. I am not quite sure about all the coop pieces that come into play. Maybe @luhenry can also share his opinion.

@blucz
Copy link
Contributor Author

blucz commented Feb 19, 2018

An update: In the real application, MONO_ENABLE_BLOCKING_TRANSITION is a night+day difference, but doesn't 100% solve the problem.

I tracked down at one place in mono so far that is missing the GC guards around an access(2) call.

Pull request for that: #7126

Since making this change, I have not seen another long stop-the-world time.

I am seeing a new crash or two that I think is related to enabling MONO_ENABLE_BLOCKING_TRANSITION, or possibly just a new regression between 5.4 and 5.8/master. I am still working on tracking that down, but it is intermittent. Aside from that, this seems like a promising workaround.

@SamMonoRT
Copy link
Contributor

@blucz - can we close out this issue ?

@blucz
Copy link
Contributor Author

blucz commented Jun 6, 2019

Yes

@blucz blucz closed this as completed Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants