In tests that have goroutines which heap-allocate and nothing else, I've observed these goroutines allocate beyond the heap goal by up to tens of MiB. This happens because of an inopportune scheduling delay, either by the OS scheduler or the Go scheduler, of a thread trying to terminate the GC's mark phase, where simply assisting the GC with scan work is not enough.
This is unlikely to affect real Go programs, but when it does, it is more likely to affect those with smaller heaps.
While writing a test for the implementation of #48409, I noticed that the memory limit might be passed.
The test in question has a single goroutine allocating heap memory in a loop, and a goroutine checking stats to ensure the memory limit is maintained. Occasionally, with a target limit of 64 MiB, the runtime could exceed the memory limit. Raising the limit to 512 MiB or more effectively mitigated the problem.
Digging deeper, I discovered the cause: the runtime was allowing the allocating goroutine to exceed the heap goal. GC assists weren't kicking in, because there was no mark work left to do anyway. With more logging I further discovered that the GC mark worker, which was in mark termination, would disappear entirely for 1-3ms when dropping into a sleep that has a target of 100µs. Replacing that sleep with an osyield could reproduce the same issue. I was also able to reproduce it by removing any kind of sleep and having it spin, though the point at which the thread could disappear was now more variable, when it did happen. It was also easier to reproduce with multiple instances running, but in all cases I ensured the CPU hardware still had plenty of breathing room (50% headroom in # of cores).
When investigating the spinning case, I also noticed that if a GC mark worker running mark termination algorithm notices that there's still work to do, it will by default park its goroutine. By the time that worker would wake back up and run mark termination again, the allocating goroutine would have blown past the heap goal. This was another common source of overrun.
Outside of this parking issue, however, I suspected an OS scheduler latency issue (this was on Linux). To confirm this, I traced OS scheduler events (perf sched record -k CLOCK_MONOTONIC) and lined them up with my logging to discover that indeed, the thread was getting context switched away for a couple milliseconds in at least the osyield and spinning cases (I haven't confirmed the original, but the symptoms are identical). It appears that due to kernel workers and background processes, the system could become transiently oversubscribed, and tasks could get kicked across CPUs, incurring this degree of latency.
But this isn't necessarily an issue on the OS's side. OS schedulers are always going to have some latency. More fundamentally, the runtime isn't always robust to this kind of external latency.
Aside from this test I wrote, the Go team has observed two other tests occasionally failing with mysterious heap overrun issues: #37331 and #49564. I suspect, but have not confirmed yet, that these flakes arise from precisely the same issue. Both involve tests that perform heap allocations in a loop.
Note that in practice this kind of heap overrun is rarely, if ever, an issue. In order to allocate at this rate, the application would have to do nothing with that memory, not even write to it or read from it, which is pointless. The overrun is also proportional to these external latency factors, which are effectively constant, so this is typically a lot less noticeable for larger heaps (i.e. going over by a few MiB is a much bigger fraction of the heap goal for 64 MiB than for 1 GiB).
The short-term mitigation for tests is to just allocate at a slower rate, reducing the amount of possible overrun. Long-term, the runtime should be more robust to these situations. A few ideas on how to do this:
For the situation where a mark worker parks after a failure to terminate the mark phase, forcing allocating goroutines to assist more often would actually help here, because assists can trigger mark termination as well. They can take over.
@aclements notes that typically, wait-free algorithms are used to combat these sorts of situations. A wait-free distributed mark termination algorithm is a tall order, but that would tackle this particular situation directly, as several goroutines could be helping each other through mark termination, as opposed to one goroutine coordinating it at a time.
As far as I can tell, this test suffers from #52433. For some reason,
this seems to become more of a problem on the windows/386 than anywhere
else. This CL is an attempt at a mitigation by slowing down the
allocation rate by inserting runtime.Gosched call in the inner loop. It
also cuts the iteration count which should help too (as less memory is
allocated in total), but the main motivation is to make sure the test
doesn't take too long to run.
Reviewed-by: Bryan Mills <email@example.com>
Run-TryBot: Michael Knyszek <firstname.lastname@example.org>
TryBot-Result: Gopher Robot <email@example.com>