Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: background scavenger can delay deadlock detection significantly #31966

Closed
bcmills opened this issue May 10, 2019 · 14 comments

Comments

Projects
None yet
6 participants
@bcmills
Copy link
Member

commented May 10, 2019

First failure on the dashboard is https://build.golang.org/log/b38cacf4803d9fa1c15e29e74600cf974369d26b, at CL 142960 (#30333).

The failure mode seems consistent.

CC @aclements @mwhudson @agnivade

--- FAIL: TestGoexitCrash (60.03s)
    crash_test.go:95: testprog GoexitExit exit status: exit status 2
    crash_test.go:258: output:
        SIGQUIT: quit
        PC=0x900000000216b38 m=1 sigcode=0
        
        goroutine 0 [idle]:
        runtime.sem_timedwait(...)
        	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/os2_aix.go:560
        runtime.semasleep(0xdf8475800, 0x0)
        	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/os_aix.go:65 +0x148
        runtime.notetsleep_internal(0x11001e430, 0xdf8475800, 0x0, 0x159d2af229f3dad8, 0xa00010000000480)
        	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/lock_sema.go:218 +0x198
        runtime.notetsleep(0x11001e430, 0xdf8475800, 0x1000343a0)
        	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/lock_sema.go:269 +0x64
        runtime.sysmon()
        	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/proc.go:4317 +0x4e4
        runtime.mstart1()
        	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/proc.go:1201 +0xf4
        runtime.mstart()
        	/ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/src/runtime/proc.go:1167 +0x5c
        
        r0   0xffffffffffffffff	r1   0x1101602d0
        r2   0xffffffffffffffff	r3   0xffffffffffffffff
        r4   0xffffffffffffffff	r5   0xffffffffffffffff
        r6   0xffffffffffffffff	r7   0xffffffffffffffff
        r8   0xffffffffffffffff	r9   0xffffffffffffffff
        r10  0xffffffffffffffff	r11  0xffffffffffffffff
        r12  0xffffffffffffffff	r13  0x110168800
        r14  0x0	r15  0x0
        r16  0x0	r17  0x0
        r18  0x0	r19  0x0
        r20  0x0	r21  0x0
        r22  0x0	r23  0x0
        r24  0x0	r25  0x0
        r26  0x110160340	r27  0x0
        r28  0x0	r29  0x1101605f0
        r30  0x0	r31  0x110581410
        pc   0x900000000216b38	ctr  0xffffffff00000000
        link 0xffffffffffffffff	xer  0xffffffff
        ccr  0x0	trap 0x0
        
        
        want output containing: no goroutines (main called runtime.Goexit) - deadlock!
FAIL
FAIL	runtime	160.362s
@mknyszek

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

git bisect points to CL 174657 (CC @cherrymui @aclements).

I don't know if that's related, since @cherrymui pointed out to me that it should produce identical binaries (though different object files).

I think it is my change: I introduced another runtime goroutine in the offending change (on the dashboard) and I suspect that it never quite parks, and instead falls into its exponential back-off loop (it thinks there's work to do, but when it goes to do it none is there). This can happen if heap_sys-heap_released > 1.1 * next_gc but there are no free and unscavenged spans left.

I think the right thing to do is to bound the exponential backoff loop at a reasonable time (e.g. 10 ms) and if there's no work to do after waiting all that time just park. Because new spans only get freed when a goroutine exits (a stack is freed) or when spans are getting swept, if we approach the end of a GC cycle with nothing to do, it's likely it'll stay that way.

But, this is assuming that's what the problem is. If that really is the issue it's surprising to me that this has only ever been observed on AIX and not elsewhere, but I'll dig into it now on a gomote and check my hypothesis.

@bcmills

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

I don't know if that's related

Yeah, I had commented on the wrong issue. CL 142960 is almost certainly the culprit for this one.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

@aclements

This comment has been minimized.

Copy link
Member

commented May 10, 2019

The deadlock detector isn't supposed to consider system goroutines (those that were started by the runtime), but I don't think any others have this back-off behavior, so maybe that doesn't work quite right, or maybe the logic for detecting system goroutines isn't quite right on AIX.

Shouldn't the current logic eventually always park? Or are there cases where it can't achieve the scavenge goal and will just keep backing off (until the next GC)?

@mknyszek

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

OK running this test with GODEBUG=gctrace=1 in a gomote makes it very clear as to what's going on.

The arena size on AIX is 256 MiB, and since the heap size is ~0 for this test the scavenger sees aaaaaall this work to do. But, AIX (as far as the runtime knows) doesn't have huge pages, so it's doing its work at the slow pace of 1 physical page per ms, and it takes it some time to do so.

This test, however, is looking for a deadlock, but the scavenger doesn't park until its done with its work. It does eventually park, given enough time, and the runtime finally notices the deadlock.

I'm not sure what the right solution is here. Perhaps we just... have the scavenger notice when its the only goroutine left and park at the first opportunity? If it goes into a long sleep though we kind of a have a problem. Something needs to wake it up.

Maybe the solution is much simpler. Maybe when it sleeps it shouldn't park with a "sleep" wait reason and instead should park with something else, so the runtime notices the deadlock.

WDYT? @aclements

@mknyszek

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

For the record, the wait reason has nothing do with deadlock detection.

So turns out this is a bigger problem with how the background scavenger and the deadlock detection interact. Changing deadlock detection to handle a special case isn't really on the table because of how subtle that code is.

The fundamental problem (thanks @aclements for getting me here) is that sleeping on the regular event-driven timer means its really difficult for deadlock detection to notice that the runtime thinks there's no more work to do.

We can fix this by just changing how the scavenger sleeps.

@rsc suggested just having sysmon kick the scavenger awake, and this works. My concern is that it could wake up and do up to 10x more work than it needs (1 page / ms, max sleep time of sysmon is 10 ms). For other situations this wouldn't be so bad, except the scavenger must grab the heap lock, which means it could really hurt the tail latency of anything that grabs the heap lock (notably allocation in the slow path). Furthermore (and this isn't really a con, just a note) it's slightly more complex than just sysmon kicking it awake; if sysmon goes into its "P's are idle" long sleep then it needs to make sure it can always wake the scavenger if it goes to sleep. This means that the scavenger must wake sysmon before it goes to sleep.

@aclements suggested to instead use notetsleepg and make that the primary method through which the background scavenger parks and sleeps. The only difference (that would allow the deadlock detection mechanism to function correctly) would be to increment/decrement sched.nmsys before it goes to sleep and after it wakes up, so that the background scavenger's sleeping m isn't counted against deadlock detection. The downside to this is now we could have one more m in any given application than we otherwise would have. The additional state is not great, but the lack of interaction with sysmon and the ability to sleep for exact time periods is a boon.

I've implemented the former and I'll now be implementing the latter. I'll update this bug with what we end up going with. I'll also update the bug's title to better reflect what's really going on.

@mknyszek mknyszek changed the title runtime: TestGoexitCrash broken on AIX runtime: background scavenger can delay deadlock detection significantly May 10, 2019

@mknyszek mknyszek added release-blocker and removed OS-AIX labels May 10, 2019

@Helflym

This comment has been minimized.

Copy link
Contributor

commented May 13, 2019

The arena size on AIX is 256 MiB, and since the heap size is ~0 for this test the scavenger sees aaaaaall this work to do. But, AIX (as far as the runtime knows) doesn't have huge pages, so it's doing its work at the slow pace of 1 physical page per ms, and it takes it some time to do so.

Does all other OSes have huge pages ? Or the problem mainly comes from the arena size ? AIX isn't yet fully optimized. So it might be possible to hasten this part of the runtime.

@mknyszek

This comment has been minimized.

Copy link
Contributor

commented May 13, 2019

The arena size on AIX is 256 MiB, and since the heap size is ~0 for this test the scavenger sees aaaaaall this work to do. But, AIX (as far as the runtime knows) doesn't have huge pages, so it's doing its work at the slow pace of 1 physical page per ms, and it takes it some time to do so.

Does all other OSes have huge pages ? Or the problem mainly comes from the arena size ? AIX isn't yet fully optimized. So it might be possible to hasten this part of the runtime.

The more fundamental problem is that this goroutine can block deadlock detection for a (probably bounded?) large amount of time, so it's definitely not specific to AIX. It just so happens to manifest as a failure on the trybots on AIX because arena sizes are big and there isn't any huge page awareness built into the runtime for that platform. If AIX has an equivalent of transparent huge pages (like Linux or FreeBSD) then we can at least fix the trybots in the super-short term, but the more fundamental problem still needs to be fixed ASAP.

@mknyszek

This comment has been minimized.

Copy link
Contributor

commented May 13, 2019

The more fundamental problem is that this goroutine can block deadlock detection for a (probably bounded?) large amount of time, so it's definitely not specific to AIX. It just so happens to manifest as a failure on the trybots on AIX because arena sizes are big and there isn't any huge page awareness built into the runtime for that platform. If AIX has an equivalent of transparent huge pages (like Linux or FreeBSD) then we can at least fix the trybots in the super-short term, but the more fundamental problem still needs to be fixed ASAP.

After discussing with @aclements this issue is more broad. Technically the background sweeper and the GC could also cause delays in deadlock detection, but in most cases it's not a problem. The background scavenger just really brings this to the spotlight because of how slowly it is paced by default.

Also, the issue appearing on AIX (specifically because of the runtime's large arena size on the platform) brings to mind an accounting problem: the background scavenger technically shouldn't even need to scavenge that 256 MiB arena because those pages haven't been faulted in yet. This is a result of the fact that newly-mapped heap memory is not marked as scavenged. We can change it such that it is marked as scavenged, which should make the problem go away on the trybots specifically (and should avoid any start-up time penalties caused by the scavenger doing unnecessary work). I opened #32012 to track this.

But, we should still make it so that the scavenger may be considered "dead" at the next sleep opportunity to significantly reduce the time that the scavenger could delay deadlock detection. I have a working implementation based on @aclements idea with notes (I ended up not using nmsys because of accounting issues, but with lockOSThread we can make this work) and I'll put it up shortly.

@gopherbot

This comment has been minimized.

Copy link

commented May 13, 2019

Change https://golang.org/cl/176957 mentions this issue: runtime: count the scavenger's M as idle when sleeping

@bcmills

This comment has been minimized.

Copy link
Member Author

commented May 14, 2019

it's definitely not specific to AIX

Just to confirm that empirically, here's a failure on the linux-ppc64le-power9osu builder:
https://build.golang.org/log/5b237da5c942cb3db8240d94cb73cf4e8adc10c9

@mknyszek

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

So it's probably best to not fix this with my hack in golang.org/cl/176957. @ianlancetaylor noted offline that the deadlock detector already doesn't work for many people in many situations already, so it doesn't make as much sense to bend over backwards to make it not get delayed by things like sweeping, GC, or the scavenger.

With that said, we can solve the problem on the trybots by just fixing #32012, which should also fix the small heap use-case since the scavenger should no longer be doing much work for short-lived programs. So I'm going to fix that instead (which is a much simpler fix, and also a good one to have overall) and close up this bug. I've confirmed the fix lowers the time for the AIX builder (among other builders).

@gopherbot

This comment has been minimized.

Copy link

commented May 14, 2019

Change https://golang.org/cl/177097 mentions this issue: runtime: mark newly-mapped memory as scavenged

@gopherbot

This comment has been minimized.

Copy link

commented May 16, 2019

Change https://golang.org/cl/177557 mentions this issue: runtime: disable TestGoexitCrash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.