runtime: background scavenger can delay deadlock detection significantly #31966
First failure on the dashboard is https://build.golang.org/log/b38cacf4803d9fa1c15e29e74600cf974369d26b, at CL 142960 (#30333).
The failure mode seems consistent.
The text was updated successfully, but these errors were encountered:
I don't know if that's related, since @cherrymui pointed out to me that it should produce identical binaries (though different object files).
I think it is my change: I introduced another runtime goroutine in the offending change (on the dashboard) and I suspect that it never quite parks, and instead falls into its exponential back-off loop (it thinks there's work to do, but when it goes to do it none is there). This can happen if
I think the right thing to do is to bound the exponential backoff loop at a reasonable time (e.g. 10 ms) and if there's no work to do after waiting all that time just park. Because new spans only get freed when a goroutine exits (a stack is freed) or when spans are getting swept, if we approach the end of a GC cycle with nothing to do, it's likely it'll stay that way.
But, this is assuming that's what the problem is. If that really is the issue it's surprising to me that this has only ever been observed on AIX and not elsewhere, but I'll dig into it now on a gomote and check my hypothesis.
The deadlock detector isn't supposed to consider system goroutines (those that were started by the runtime), but I don't think any others have this back-off behavior, so maybe that doesn't work quite right, or maybe the logic for detecting system goroutines isn't quite right on AIX.
Shouldn't the current logic eventually always park? Or are there cases where it can't achieve the scavenge goal and will just keep backing off (until the next GC)?
OK running this test with GODEBUG=gctrace=1 in a gomote makes it very clear as to what's going on.
The arena size on AIX is 256 MiB, and since the heap size is ~0 for this test the scavenger sees aaaaaall this work to do. But, AIX (as far as the runtime knows) doesn't have huge pages, so it's doing its work at the slow pace of 1 physical page per ms, and it takes it some time to do so.
This test, however, is looking for a deadlock, but the scavenger doesn't park until its done with its work. It does eventually park, given enough time, and the runtime finally notices the deadlock.
I'm not sure what the right solution is here. Perhaps we just... have the scavenger notice when its the only goroutine left and park at the first opportunity? If it goes into a long sleep though we kind of a have a problem. Something needs to wake it up.
Maybe the solution is much simpler. Maybe when it sleeps it shouldn't park with a "sleep" wait reason and instead should park with something else, so the runtime notices the deadlock.
For the record, the wait reason has nothing do with deadlock detection.
So turns out this is a bigger problem with how the background scavenger and the deadlock detection interact. Changing deadlock detection to handle a special case isn't really on the table because of how subtle that code is.
The fundamental problem (thanks @aclements for getting me here) is that sleeping on the regular event-driven timer means its really difficult for deadlock detection to notice that the runtime thinks there's no more work to do.
We can fix this by just changing how the scavenger sleeps.
@rsc suggested just having sysmon kick the scavenger awake, and this works. My concern is that it could wake up and do up to 10x more work than it needs (1 page / ms, max sleep time of sysmon is 10 ms). For other situations this wouldn't be so bad, except the scavenger must grab the heap lock, which means it could really hurt the tail latency of anything that grabs the heap lock (notably allocation in the slow path). Furthermore (and this isn't really a con, just a note) it's slightly more complex than just sysmon kicking it awake; if sysmon goes into its "P's are idle" long sleep then it needs to make sure it can always wake the scavenger if it goes to sleep. This means that the scavenger must wake sysmon before it goes to sleep.
@aclements suggested to instead use notetsleepg and make that the primary method through which the background scavenger parks and sleeps. The only difference (that would allow the deadlock detection mechanism to function correctly) would be to increment/decrement sched.nmsys before it goes to sleep and after it wakes up, so that the background scavenger's sleeping
I've implemented the former and I'll now be implementing the latter. I'll update this bug with what we end up going with. I'll also update the bug's title to better reflect what's really going on.
Does all other OSes have huge pages ? Or the problem mainly comes from the arena size ? AIX isn't yet fully optimized. So it might be possible to hasten this part of the runtime.
The more fundamental problem is that this goroutine can block deadlock detection for a (probably bounded?) large amount of time, so it's definitely not specific to AIX. It just so happens to manifest as a failure on the trybots on AIX because arena sizes are big and there isn't any huge page awareness built into the runtime for that platform. If AIX has an equivalent of transparent huge pages (like Linux or FreeBSD) then we can at least fix the trybots in the super-short term, but the more fundamental problem still needs to be fixed ASAP.
After discussing with @aclements this issue is more broad. Technically the background sweeper and the GC could also cause delays in deadlock detection, but in most cases it's not a problem. The background scavenger just really brings this to the spotlight because of how slowly it is paced by default.
Also, the issue appearing on AIX (specifically because of the runtime's large arena size on the platform) brings to mind an accounting problem: the background scavenger technically shouldn't even need to scavenge that 256 MiB arena because those pages haven't been faulted in yet. This is a result of the fact that newly-mapped heap memory is not marked as scavenged. We can change it such that it is marked as scavenged, which should make the problem go away on the trybots specifically (and should avoid any start-up time penalties caused by the scavenger doing unnecessary work). I opened #32012 to track this.
But, we should still make it so that the scavenger may be considered "dead" at the next sleep opportunity to significantly reduce the time that the scavenger could delay deadlock detection. I have a working implementation based on @aclements idea with notes (I ended up not using
Just to confirm that empirically, here's a failure on the
So it's probably best to not fix this with my hack in golang.org/cl/176957. @ianlancetaylor noted offline that the deadlock detector already doesn't work for many people in many situations already, so it doesn't make as much sense to bend over backwards to make it not get delayed by things like sweeping, GC, or the scavenger.
With that said, we can solve the problem on the trybots by just fixing #32012, which should also fix the small heap use-case since the scavenger should no longer be doing much work for short-lived programs. So I'm going to fix that instead (which is a much simpler fix, and also a good one to have overall) and close up this bug. I've confirmed the fix lowers the time for the AIX builder (among other builders).