Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: regression in TestMemoryProfiler/debug=1 starting in April 2021 #46500

Closed
bcmills opened this issue Jun 1, 2021 · 25 comments · Fixed by ferrmin/go#101 or sthagen/golang-go#316
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Jun 1, 2021

2021-05-30T02:37:38-1607c28/linux-amd64-sid
2021-04-27T21:55:07-214c8dd/linux-amd64-nocgo
2021-04-27T21:44:16-645cb62/linux-amd64-nocgo

This test has otherwise been passing consistently since it was last fixed in November 2019, so this looks like a 1.17 regression (CC @golang/release).

2019 failures

2019-11-11T12:40:04-f07059d/linux-amd64-noopt
2019-11-10T20:36:44-47bc240/linux-amd64-noopt
2019-11-10T17:13:25-298be61/linux-amd64-noopt
2019-11-10T17:12:35-46c9fd0/linux-amd64-noopt
2019-11-10T17:12:15-40ebcfa/linux-amd64-noopt
2019-11-10T17:11:34-cd53fdd/linux-amd64-noopt
2019-11-10T13:41:45-4d4ddd8/linux-amd64-noopt
2019-11-10T12:12:46-9eb9c7b/linux-amd64-noopt
2019-11-10T04:23:22-e6fb39a/linux-amd64-noopt
2019-11-09T22:14:01-78d4560/linux-amd64-noopt
2019-11-09T20:08:06-29cfb4d/linux-amd64-noopt
2019-11-09T19:31:32-7148478/linux-amd64-noopt
2019-11-09T19:25:46-6e11195/linux-amd64-noopt
2019-11-09T05:51:04-a0262b2/linux-amd64-noopt
2019-11-09T00:36:15-bde1968/linux-amd64-noopt
2019-11-08T23:22:06-11da2b2/linux-amd64-noopt
2019-11-08T22:44:29-42db1da/linux-amd64-noopt
2019-11-08T21:32:23-0bbcce9/linux-amd64-noopt
2019-11-08T21:27:51-b7d097a/linux-amd64-noopt
2019-11-08T21:05:17-9ee6ba0/linux-amd64-noopt
2019-11-08T20:50:17-9e914f5/linux-amd64-noopt
2019-11-08T20:24:43-bababde/linux-amd64-noopt
2019-11-08T20:05:25-7a5e0fe/linux-amd64-noopt
2019-11-08T19:28:57-904e113/linux-amd64-noopt
2019-11-08T19:28:49-f6ff806/linux-amd64-noopt
2019-11-08T19:24:30-e6c12c3/linux-amd64-noopt
2019-11-08T18:55:44-b2b0992/linux-amd64-noopt
2019-11-08T18:00:31-c444ec3/linux-amd64-noopt
2019-11-08T17:56:35-a84ac18/linux-amd64-noopt
2019-11-08T17:01:32-4517c02/linux-amd64-noopt
2019-11-08T17:01:05-a5a6f61/linux-amd64-noopt
2019-11-08T17:00:57-dac936a/linux-amd64-noopt
2019-11-08T16:46:24-47232f0/linux-amd64-noopt
2019-11-08T16:44:48-374c284/linux-amd64-noopt
2019-11-08T16:35:48-ffb5646/linux-amd64-noopt
2019-11-08T15:10:39-52aebe8/linux-amd64-noopt
2019-09-19T20:26:22-1c50fcf/netbsd-arm-bsiegert

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 labels Jun 1, 2021
@bcmills bcmills added this to the Go1.17 milestone Jun 1, 2021
@bcmills
Copy link
Member Author

bcmills commented Jun 2, 2021

CC @cherrymui for runtime/pprof.

@heschi heschi removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Jun 10, 2021
@toothrot
Copy link
Contributor

@cherrymui Following up, as we're approaching RC1 and this is a release-blocker.

@prattmic
Copy link
Member

All of these failures are because we expect an entry like:

0: 0 [1: 2097152] @ 0x52cb88 0x49ed54 0x49e485 0x52cca7 0x52cec5 0x4c5e22 0x466121
            #	0x52cb87	runtime/pprof.allocateReflectTransient+0x27	/workdir/go/src/runtime/pprof/mprof_test.go:56

but get one like

1: 2097152 [1: 2097152] @ 0x52cb88 0x49ed54 0x49e485 0x52cca7 0x52cec5 0x4c5e22 0x466121
            #	0x52cb87	runtime/pprof.allocateReflectTransient+0x27	/workdir/go/src/runtime/pprof/mprof_test.go:56

The only mismatches are the first two numbers: want 0, got 1 and 2097152.

The first number is InUseObjects() => AllocObjects - FreeObjects. The second is InUseBytes() => AllocBytes - FreeBytes.

Since these are "transient" allocations and the test runs GC, we are expecting to see the allocations freed. The frees are recorded during sweep, so at first glance this would look like another case of #45315. However, this started failing after that was fixed, so I suspect that something in http://golang.org/cl/307915 or http://golang.org/cl/307916 is subtly broken and triggering this.

cc @mknyszek

@prattmic
Copy link
Member

FWIW, I've been unable to reproduce this locally so far.

@mknyszek
Copy link
Contributor

I think @prattmic is right and that this is probably a regression. Looking into it.

@mknyszek
Copy link
Contributor

I ran the full suite of runtime/pprof tests overnight and I reproduced it. Unfortunately because of limited terminal scrollback I don't actually know how many executions it took...

Specifically, I ran:

CGO_ENABLED=0 go test -short -count=1 runtime/pprof

On the hunch that CGO_ENABLED=0 somehow makes the regression reproduce more easily.

@mknyszek
Copy link
Contributor

OK I can probably shorten time to reproduce: TestMemoryProfiler is the third test to be executed. The only tests that could be causally influencing the failure are the two tests before it. Together they take much less time to run.

@toothrot
Copy link
Contributor

@mknyszek We're close to the RC date for Go 1.17. Just a friendly ping.

@mknyszek
Copy link
Contributor

I was able to reproduce again. I added a check for some potential issues with how reflect changed in Go 1.17, and I think I've successfully ruled that out.

I captured the output, so I can say now that it took about 2 hours of continuously running the full runtime/pprof test suite to hit it. I think I can actually stress test this into something useful, but now I need to figure out what I want to learn.

@mknyszek
Copy link
Contributor

FWIW, I'm not 100% sure if this should be an RC blocker. What this test failure means is that there's a very rare chance that a heap profile ends up stale, specifically in the case of calling runtime.GC. That's not great, and I will continue to try to stress test it and fix it (builders should be green!), but it does not critically impact users of Go.

@mknyszek
Copy link
Contributor

Got another reproducer, while running only the first 3 tests in the package in a loop. 206267 executions at ~0.077s per execution... about 4 hours to reproduce.

But hey, this time, I got a GC trace! And there's something very peculiar about this. The GC trace for the failing test is the only one that actually has a forced GC! You'd think that every single execution would have a forced GC, but that's not true at all, as it turns out.

I wonder if I'll be able to reproduce this more easily by adding a sleep, to make sure another GC cycle doesn't stomp on the forced GC.

@mknyszek
Copy link
Contributor

Added a 1 second sleep before runtime.GC and now I have an instant reproducer. :)

@mknyszek
Copy link
Contributor

Oh, wait. It occurs to me that because time.Sleep allocates, it might just be breaking the test.

@mknyszek
Copy link
Contributor

Yeah, the time import changes the line numbers. Sigh.

@mknyszek
Copy link
Contributor

False alarm. STDERR output for the test itself was hidden, so while I have a GC trace, the forced GC is not unique.

@dmitshur
Copy link
Contributor

dmitshur commented Jul 1, 2021

Thanks for investigating this Michael. In a release meeting we discussed we're primarily looking to understand the failure mode before RC 1, if possible. Then we can make a better decision about it.

@mknyszek
Copy link
Contributor

mknyszek commented Jul 1, 2021

I'm slowly gathering information, but given that it takes hours to reproduce, this is going to take a while.

@mknyszek
Copy link
Contributor

mknyszek commented Jul 1, 2021

Alrighty, new update: I've got a ~3 minute feedback loop now. I'm using debuglog to slowly whittle down the possibilities. Hopefully should have something soon. If I start floundering, I'll start bisecting.

@mknyszek
Copy link
Contributor

mknyszek commented Jul 1, 2021

I think I've confirmed this is a subtle bug in the isSweepDone condition.

I've got the following output, annotated for clarity:

/tmp/go-stress-20210701T205000-498548926
runtime/pprof.allocateReflectTransient <--- the profile stack entry that was incorrect.
>> begin log 1 <<
[0.032741835 P 2] PostSweep 6 3 3 2 <--- heap profile published in runtime.GC. GC cycle is #3, forced GC was triggered during GC cycle #2.
>> begin log 0 <<
[0.033512719 P 0] freed 6 3 <--- the allocation created by runtime/pprof.allocateReflectTransient is actually freed
[0.033651934 P 2] MemProfile 6 3 <--- the test grabs proflock and copies the profile out

I think I might know what the problem is.

Looking at sweepone as an example of sweepLocker usage, it appears that sweepDrained can be set by sweepone before another, concurrent, instance of sweepone (or similar) calls tryAcquire. That tryAcquire is what actually blocks completion. Consider the following events:

  1. Some sweeper pulls the last span for sweeping out of the list of spans that need sweeping. It has not yet acquired ownership of the span for sweeping.
  2. Some other sweeper notices there's no more work. Hooray! They mark sweepDrained. However note that at this very point, mheap_.sweepers == 0, because the last span hasn't actually been acquired for sweeping.
  3. runtime.GC is looping, checking isSweepDone. It happens to fire right at this moment, so it continues on to publish the heap profile.
  4. Then the last span actually gets swept.

Lo and behold, we've missed a free in the published heap profile.

Assuming this is actually the problem (I will continue to try to confirm this), I think that this does not indicate a larger potential issue. This issue of isSweepDone conflating "no more spans to sweep" with "no more outstanding sweeps" already existed in earlier releases. Austin's CLs were trying to fix this. I believe they succeeded them in making them less likely, eliminating the flakiness of another test Austin wrote in the 1.17 cycle. However, it appears that this test manages to take advantage of this race window.

When Austin was working on fixing this, we discussed how this condition wasn't actually problematic for GC correctness, because the GC will actually ensure all outstanding sweeps are complete because it needs to stop the world to begin the next mark phase. Sweeping in every case does always prevent preemption -- this is necessary for a much broader sense of correctness -- so a new GC will actually only start once all outstanding sweeps have completed.

@dmitshur As a result, I don't think this should block the RC, but I think this should be fixed prior to release. Ultimately, the worst it can do is make some tests (particularly ones that rely on runtime.GC) flaky.

@mknyszek
Copy link
Contributor

mknyszek commented Jul 1, 2021

I'm currently testing my theory by adding an extra blockCompletion call in sweepone, just after newSweepLocker is created. There's a risk that this could cause the isSweepDone condition to flap, but because sweepone bails out early if isSweepDone is already true, it can't actually happen, for this case anyway. AFAICT, this is the only path that matters for this test... a real fix may need more, more on that later...

The good news is that it's been 30 minutes and nothing has failed yet.

This ensures that mheap_.sweepers is always >0 before we grab that last span out of the list (i.e. before mheap_.nextSpanForSweep()). If someone else then marks sweepDrained, then that other sweeper will always be accounted for until it's actually finished. As a result, anyone checking isSweepDone will never observe it switch to true prematurely.

Unfortunately I think this is a hacky fix. I think even if with isSweepDone check, there's still a small window where the condition could flap. Consider this set of events:

  1. A sweeper notices sweepDrained == 0 so it says that isSweepDone is false. There are no other outstanding sweepers.
  2. Another sweeper notices that there's nothing left to sweep, so it sets sweepDrained. isSweepDone is now observable as true.
  3. The first sweeper increments mheap_.sweepers because it passed the c heck. isSweepDone is now temporarily observable as false.
  4. The first sweeper decrements mheap_.sweepers and now isSweepDone is true again.

Basically, what we need to guarantee is that:

  1. mheap_.sweepers is incremented before any pop operation from the unswept lists.
  2. mheap_.sweepers is incremented only if there's any more spans to be swept.

I think that these might be two contradictory conditions. I need to think about this more, though I'm certain there's a clean resolution to all this.

@mknyszek
Copy link
Contributor

mknyszek commented Jul 2, 2021

2.5 hours later and this narrower window has prevented a test failure. I think this is our culprit.

UPDATE: 20.5 hours later, and still no failure.

@mknyszek
Copy link
Contributor

mknyszek commented Jul 8, 2021

I've been thinking about this more. I think the right fix is to make the process of updating mheap_.sweepers and sweepDrained should be updated together. Consider a design where they're packed in the same uint32, with the top bit reserved as the boolean sweepDrained, the rest of the bits for mheap_.sweepers (2 billion concurrent sweepers should be enough for anyone :P).

Then it gets manipulated in the following way:

  1. A potential sweeper is considering doing some sweeping. They read the combined value, and check sweepDrained. If it's set, then don't sweep. There's nothing to do.
  2. Otherwise, they try a CAS-loop where the they increment sweepers, backing out if they ever observe sweepDrained as set.
  3. If they successfully CAS, they can go on and pop a span and sweep. If they notice that there's nothing to sweep, they CAS-loop on setting sweepDrained and decrementing sweepers, then return.
  4. Otherwise, acquire the span they popped for sweeping, sweep the span and whatever else, then CAS-loop to decrement sweepers.

isSweepDone is then just an atomic load, then a comparison with 0x80000000. The condition will never flap, because a sweeper is guaranteed to never increment sweepers once sweepDrained is set. Existing sweepers or CAS-loopers will simply decrement and drain sweepers as intended.

There's one more caveat here with reclaimers that don't pop from the list but do acquire spans for sweeping. They need only be accounted for in sweepers, so I think they can just do the same as above.

@aclements does this sound right to you?

My one concern here is contention due to CAS-looping. I think it should be relatively OK because this happens on the allocation slow path (the first slow path, refilling spans), though there are a number of other potential sweepers (reclaimers, proportional sweepers, or the background sweeper). I guess we'll just have to benchmark it.

@mknyszek
Copy link
Contributor

mknyszek commented Jul 8, 2021

I've sketched out a fix at https://golang.org/cl/333389 (be warned: it may not even compile, I didn't try) and that seems way too big for this release.

I think we should just add an extra runtime.GC call and a TODO to prevent this test from flaking for the release.

@gopherbot
Copy link

Change https://golang.org/cl/333389 mentions this issue: runtime: fix sweep termination condition

@gopherbot
Copy link

Change https://golang.org/cl/333549 mentions this issue: runtime/pprof: call runtime.GC twice in memory profile test

gopherbot pushed a commit that referenced this issue Oct 29, 2021
Currently, there is a chance that the sweep termination condition could
flap, causing e.g. runtime.GC to return before all sweep work has not
only been drained, but also completed. CL 307915 and CL 307916 attempted
to fix this problem, but it is still possible that mheap_.sweepDrained is
marked before any outstanding sweepers are accounted for in
mheap_.sweepers, leaving a window in which a thread could observe
isSweepDone as true before it actually was (and after some time it would
revert to false, then true again, depending on the number of outstanding
sweepers at that point).

This change fixes the sweep termination condition by merging
mheap_.sweepers and mheap_.sweepDrained into a single atomic value.

This value is updated such that a new potential sweeper will increment
the oustanding sweeper count iff there are still outstanding spans to be
swept without an outstanding sweeper to pick them up. This design
simplifies the sweep termination condition into a single atomic load and
comparison and ensures the condition never flaps.

Updates #46500.
Fixes #45315.

Change-Id: I6d69aff156b8d48428c4cc8cfdbf28be346dbf04
Reviewed-on: https://go-review.googlesource.com/c/go/+/333389
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
7 participants