runtime/pprof: regression in TestMemoryProfiler/debug=1 starting in April 2021 #46500
All of these failures are because we expect an entry like:
but get one like
The only mismatches are the first two numbers: want 0, got 1 and 2097152.
The first number is
Since these are "transient" allocations and the test runs GC, we are expecting to see the allocations freed. The frees are recorded during sweep, so at first glance this would look like another case of #45315. However, this started failing after that was fixed, so I suspect that something in http://golang.org/cl/307915 or http://golang.org/cl/307916 is subtly broken and triggering this.
I ran the full suite of
Specifically, I ran:
On the hunch that
I was able to reproduce again. I added a check for some potential issues with how reflect changed in Go 1.17, and I think I've successfully ruled that out.
I captured the output, so I can say now that it took about 2 hours of continuously running the full
FWIW, I'm not 100% sure if this should be an RC blocker. What this test failure means is that there's a very rare chance that a heap profile ends up stale, specifically in the case of calling
Got another reproducer, while running only the first 3 tests in the package in a loop. 206267 executions at ~0.077s per execution... about 4 hours to reproduce.
But hey, this time, I got a GC trace! And there's something very peculiar about this. The GC trace for the failing test is the only one that actually has a forced GC! You'd think that every single execution would have a forced GC, but that's not true at all, as it turns out.
I wonder if I'll be able to reproduce this more easily by adding a sleep, to make sure another GC cycle doesn't stomp on the forced GC.
I think I've confirmed this is a subtle bug in the
I've got the following output, annotated for clarity:
I think I might know what the problem is.
Lo and behold, we've missed a free in the published heap profile.
Assuming this is actually the problem (I will continue to try to confirm this), I think that this does not indicate a larger potential issue. This issue of
When Austin was working on fixing this, we discussed how this condition wasn't actually problematic for GC correctness, because the GC will actually ensure all outstanding sweeps are complete because it needs to stop the world to begin the next mark phase. Sweeping in every case does always prevent preemption -- this is necessary for a much broader sense of correctness -- so a new GC will actually only start once all outstanding sweeps have completed.
@dmitshur As a result, I don't think this should block the RC, but I think this should be fixed prior to release. Ultimately, the worst it can do is make some tests (particularly ones that rely on
I'm currently testing my theory by adding an extra
The good news is that it's been 30 minutes and nothing has failed yet.
This ensures that
Unfortunately I think this is a hacky fix. I think even if with
Basically, what we need to guarantee is that:
I think that these might be two contradictory conditions. I need to think about this more, though I'm certain there's a clean resolution to all this.
I've been thinking about this more. I think the right fix is to make the process of updating
Then it gets manipulated in the following way:
There's one more caveat here with reclaimers that don't pop from the list but do acquire spans for sweeping. They need only be accounted for in
@aclements does this sound right to you?
My one concern here is contention due to CAS-looping. I think it should be relatively OK because this happens on the allocation slow path (the first slow path, refilling spans), though there are a number of other potential sweepers (reclaimers, proportional sweepers, or the background sweeper). I guess we'll just have to benchmark it.
I've sketched out a fix at https://golang.org/cl/333389 (be warned: it may not even compile, I didn't try) and that seems way too big for this release.
I think we should just add an extra
Currently, there is a chance that the sweep termination condition could flap, causing e.g. runtime.GC to return before all sweep work has not only been drained, but also completed. CL 307915 and CL 307916 attempted to fix this problem, but it is still possible that mheap_.sweepDrained is marked before any outstanding sweepers are accounted for in mheap_.sweepers, leaving a window in which a thread could observe isSweepDone as true before it actually was (and after some time it would revert to false, then true again, depending on the number of outstanding sweepers at that point). This change fixes the sweep termination condition by merging mheap_.sweepers and mheap_.sweepDrained into a single atomic value. This value is updated such that a new potential sweeper will increment the oustanding sweeper count iff there are still outstanding spans to be swept without an outstanding sweeper to pick them up. This design simplifies the sweep termination condition into a single atomic load and comparison and ensures the condition never flaps. Updates #46500. Fixes #45315. Change-Id: I6d69aff156b8d48428c4cc8cfdbf28be346dbf04 Reviewed-on: https://go-review.googlesource.com/c/go/+/333389 Trust: Michael Knyszek <firstname.lastname@example.org> Run-TryBot: Michael Knyszek <email@example.com> TryBot-Result: Go Bot <firstname.lastname@example.org> Reviewed-by: Austin Clements <email@example.com>