Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestGcSys is still flaky #37331

Open
bcmills opened this issue Feb 20, 2020 · 31 comments
Open

runtime: TestGcSys is still flaky #37331

bcmills opened this issue Feb 20, 2020 · 31 comments
Assignees
Milestone

Comments

@bcmills bcmills added this to the Backlog milestone Feb 20, 2020
@josharian
Copy link
Contributor

@josharian josharian commented Mar 3, 2020

Shall we disable the test for now?

@bcmills
Copy link
Member Author

@bcmills bcmills commented Mar 16, 2020

2020-03-15T08:14:24-dc32553/freebsd-amd64-race
2020-03-04T20:52:43-c55a50e/solaris-amd64-oraclerel

@aclements, @mknyszek: what do you want to do about this test? (Do we understand the cause of these flakes?)

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Mar 17, 2020

@bcmills I'll take a look.

This is usually due to some GC pacing heuristic doing something weird. A GC trace should get us part of the way there.

@mknyszek mknyszek self-assigned this Mar 17, 2020
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Mar 26, 2020

OK sorry for the delay, finally looking into this now.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Mar 26, 2020

Ugh, OK. So this definitely looks like another GOMAXPROCS=1 GC pacing issue. Looking at the gctrace for a bad run on freebsd-amd64-race (which is pretty easily reproducible):

gc 1 @0.000s 1%: 0.010+0.25+0.011 ms clock, 0.010+0/0.055/0.17+0.011 ms cpu, 0->0->0 MB, 4 MB goal, 1 P (forced)
gc 2 @0.001s 3%: 0.011+0.34+0.015 ms clock, 0.011+0.15/0/0+0.015 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 3 @0.003s 4%: 0.011+0.91+0.014 ms clock, 0.011+0.15/0/0+0.014 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 4 @0.005s 4%: 0.013+0.50+0.014 ms clock, 0.013+0.15/0/0+0.014 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 5 @0.007s 5%: 0.010+0.45+0.013 ms clock, 0.010+0.15/0/0+0.013 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 6 @0.008s 6%: 0.011+0.41+0.013 ms clock, 0.011+0.14/0/0+0.013 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 7 @0.009s 7%: 0.012+0.38+0.013 ms clock, 0.012+0.14/0/0+0.013 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 8 @0.010s 7%: 0.012+0.39+0.015 ms clock, 0.012+0.16/0/0+0.015 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 9 @0.011s 8%: 0.012+0.39+0.018 ms clock, 0.012+0.14/0/0+0.018 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 10 @0.012s 5%: 0.012+10+0.014 ms clock, 0.012+0.059/0.13/0+0.014 ms cpu, 4->80->75 MB, 5 MB goal, 1 P
using too much memory: 70813704 bytes

You'll notice that in gc 10, we trigger the GC at the right time, but while it's happening we blow right past the hard goal.

The last time I debugged this, it was a problem with the fact that we didn't fall back to the hard goal even when we were doing more scan work than expected. It's hard for me to see how this is the case again, so there's likely something else going on. A thought: heap_scan is updated less frequently from local_scan in Go 1.14. Most of the GC work in this test comes from assists (because GOMAXPROCS=1). What if this is an issue where the runtime falls into a case where it's consistently behind on the assist ratio? It finally catches up and the assist work gets done, but it's too late. If that's the case, I'm not sure why this is basically impossible to reproduce on Linux, and easily reproducible on the freebsd-amd64-race builders.

@nmeum
Copy link

@nmeum nmeum commented May 31, 2020

I'm not sure why this is basically impossible to reproduce on Linux, and easily reproducible on the freebsd-amd64-race builders.

I think we are also running into this on armv7 and armhf on Alpine Linux edge when building go 1.14.3:

algitbot pushed a commit to alpinelinux/aports that referenced this issue Jun 1, 2020
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 10, 2020

Having dug into this before, I suspect this is related to #42430 but I'm not sure in what way.

Going back to the thought I had in March (#37331 (comment)) I did change a bunch of details in this release regarding how heap_scan is updated (specifically, more often!), so that may be why the failure rate has gone down. @bcmills are those all the recent failures?

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 10, 2020

@mknyszek, those are all of the failures I could find using greplogs with the regexp FAIL: TestGcSys, yes.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 11, 2020

Oh, I also want to note #40460 which is probably related. I could likely prove it with an execution trace. Will look into this soon.

@ksshannon
Copy link
Contributor

@ksshannon ksshannon commented Mar 31, 2021

I am getting this fairly consistently with:

go version devel +87c6fa4f47 Wed Mar 31 20:28:39 2021 +0000 darwin/amd64

and

go version devel +87c6fa4f47 Wed Mar 31 20:28:39 2021 +0000 linux/amd64

Running go test -run=GcSys -count=100 runtime , I get up to a dozen failures. Occasionally the darwin system passes, I haven't had the linux system pass.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Apr 12, 2021

@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 23, 2021

2021-04-23T00:40:48-050b408/linux-s390x-ibm
2021-04-22T22:01:47-7405968/linux-s390x-ibm
2021-04-22T20:45:37-ecfce58/linux-s390x-ibm
2021-04-22T03:03:41-f0a8101/aix-ppc64
2021-04-21T21:25:26-7e97e4e/linux-s390x-ibm
2021-04-21T20:24:34-2550563/linux-amd64-wsl
2021-04-21T09:07:02-7735ec9/linux-amd64-wsl
2021-04-20T20:58:52-dbade77/linux-s390x-ibm
2021-04-20T15:13:47-4ce49b4/linux-s390x-ibm
2021-04-20T00:14:27-9f87943/linux-s390x-ibm
2021-04-19T21:27:43-e97d8eb/linux-s390x-ibm
2021-04-19T18:37:15-f889214/linux-s390x-ibm
2021-04-16T22:45:02-14dbd6e/linux-amd64-wsl
2021-04-16T14:15:49-0613c74/linux-arm64-packet
2021-04-14T19:32:31-bcbde83/linux-s390x-ibm
2021-04-14T13:21:14-e224787/linux-s390x-ibm
2021-04-12T17:30:21-2fa7163/linux-s390x-ibm
2021-04-10T19:02:06-a6d95b4/linux-s390x-ibm
2021-04-10T19:02:03-4638545/linux-amd64-wsl
2021-04-09T18:19:42-952187a/solaris-amd64-oraclerel
2021-04-09T15:01:13-6951da5/linux-amd64-wsl
2021-04-09T12:56:04-519f223/linux-amd64-wsl
2021-04-08T19:51:32-98dd205/aix-ppc64
2021-04-08T19:30:34-ecca94a/linux-s390x-ibm
2021-04-08T15:03:31-283b020/aix-ppc64
2021-04-08T15:03:31-283b020/solaris-amd64-oraclerel
2021-04-08T15:02:51-1be8be4/linux-s390x-ibm
2021-04-08T02:17:15-0c4a08c/aix-ppc64
2021-04-07T20:23:47-fca51ba/linux-s390x-ibm
2021-04-07T06:53:34-5d5f779/linux-arm64-packet
2021-04-07T06:53:34-5d5f779/linux-s390x-ibm
2021-04-06T18:59:08-3a30381/aix-ppc64
2021-04-05T19:15:53-e985245/linux-arm64-packet
2021-04-05T19:15:53-e985245/linux-s390x-ibm
2021-04-05T17:22:26-e617b2b/linux-s390x-ibm
2021-04-03T10:58:19-fe587ce/linux-s390x-ibm
2021-04-02T14:40:43-3651eff/linux-s390x-ibm
2021-04-02T12:44:37-a78b12a/linux-s390x-ibm
2021-04-02T05:24:14-aebc0b4/linux-s390x-ibm
2021-04-01T00:51:24-1f29e69/linux-s390x-ibm
2021-03-30T17:51:37-c40dc67/linux-s390x-ibm
2021-03-30T01:17:14-a95454b/linux-amd64-wsl
2021-03-30T00:47:22-bd6628e/linux-s390x-ibm
2021-03-29T16:48:08-2abf280/linux-s390x-ibm
2021-03-28T03:27:04-23ffb5b/linux-s390x-ibm
2021-03-25T21:35:05-374b190/linux-s390x-ibm
2021-03-25T19:21:34-2c8692d/linux-s390x-ibm
2021-03-25T14:46:50-4d66d77/linux-s390x-ibm
2021-03-23T23:08:19-769d4b6/solaris-amd64-oraclerel
2021-03-23T11:14:58-53dd0d7/linux-s390x-ibm
2021-03-23T03:49:17-b8371d4/linux-s390x-ibm
2021-03-22T17:50:42-78afca2/linux-s390x-ibm
2021-03-22T03:52:31-d8394bf/linux-s390x-ibm
2021-03-18T13:31:52-9de49ae/linux-arm64-packet
2021-03-18T03:52:02-42c25e6/linux-s390x-ibm
2021-03-17T21:24:05-7e00049/linux-amd64-wsl
2021-03-17T19:48:52-a5df883/linux-s390x-ibm
2021-03-17T03:18:12-119d76d/darwin-arm64-11_0-toothrot

@laboger
Copy link
Contributor

@laboger laboger commented Apr 23, 2021

I see this error intermittently when doing test builds on our ppc64le machines both power8 and power9.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Apr 23, 2021

Looking at this again. I was able to reproduce once on linux/amd64.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Apr 26, 2021

Of course, I've been unable to reproduce since that last failure. I will try again on the s390x builder since it seems to be more common there.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 10, 2021

OK finally coming back to this.

What's nice is this reproduces very readily on s390x. Hopefully I'll get somewhere.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 10, 2021

This is definitely #40460.

General observations:

  • GOMAXPROCS=1 in this test, so we always have 0.25 fractional worker and zero dedicated workers.
  • This test has a single heavily-allocating goroutine. I'll call it G1.
  • Most of the time, the mark work is satisfied 100% by assists. This leads to wildly incorrect pacing parameters (known issue).
  • Every time the failure happens, the heap erroneously grows beyond the goal during the GC cycle.
  • Every time the failure happens, the 0.25 fractional worker kicks in.

Every single failure always looks like the following situation :

  1. G1 starts a GC cycle after allocating a bunch. It immediately drops into a mark assist.
  2. It begins to assist, but is unable to (1) steal background credit or (2) perform enough assist work to satisfy the assist.
  3. G1, however, was preempted in all this. So it calls Gosched.
  4. The fractional worker at this point wakes up and does some work.
  5. G1 is scheduled once more, not long after, and rechecks to see if it can steal work. Turns out it can.
  6. G1 "over-steals" credit (much like how it "over-assists" to avoid calling into the assist path too much).
  7. Turns out the amount of credit it steals amounts to something like 13 MiB, far larger than the current heap goal.
  8. At this point, the fractional worker doesn't get scheduled because its scheduling policy is extremely loose, and G1 keeps allocating (13 MiB) until it runs out of credit before assisting once more and ending the GC cycle.
  9. The next GC cycle then has a much higher heap goal, but upon realizing that most of this memory is dead and the heap is actually small, the heap goal drops back to its original point.

So, what's really going wrong here? A couple things:

  1. If the fractional worker was scheduled more regularly, it would pick up the slack and the GC wouldn't last quite so long.
  2. The assist ratio is large enough that stealing 65536 units of work translates into 13 MiB of credit.

I'm going to try to figure out if there's anything specifically wrong with the assist ratio next, or if its working as intended and there's something more fundamental wrong. The former issue is much harder to fix (and wouldn't resolve this 100%).

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 10, 2021

One other thing to notice about this is the failure always has the mark assist + fractional GC help happen at the start of a GC cycle.

So what if the situation is something like this: the fractional GC does a whole bunch of work such that there's very little work left. So the assist ratio (computed as "heap remaining / scan work remaining") ends up very large since we still have ~ 1 MiB of runway but there's very little scan work left. This is WAI and it would just be a coincidence, but unfortunately it does mean that with GOMAXPROCS=1, GC progress basically hinges on G1. That's really not ideal.

Here's a quick fix idea: what if a goroutine could never have more credit than the amount of heap runway left (basically the difference between the heap goal and the heap size at the point the assist finishes)? Then by construction a goroutine could never allocate past the heap goal without going into assist first and finishing off the GC cycle. The downside is you could have many goroutines try to end GC at the same time (go into gcMarkDone, I mean, and start trying to acquire markDoneSema), stalling the whole program a little bit as everything waits for GC to actually end. This situation should be exceedingly rare, though, and only potentially common when GOMAXPROCS=1 in which case there will only ever be 1 goroutine in gcMarkDone.

@prattmic @randall77 what do you two think?

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 11, 2021

By the way, my previous suggestion probably can't land in the freeze, even though it's fixing a bug. I don't know what to do about the flaky test until then. I think the issues causing this test to be flaky are somewhat fundamental. Do we change the test to be less flaky, then write a new one that's consistently failing due to this behavior (but turn it off)?

@bcmills any ideas?

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 11, 2021

@mknyszek, it's probably fine to add a call to testenv.SkipFlaky to the existing test for during the freeze.

If you have an idea for a non-flaky test that checks a similar behavior, you could maybe add the new test as a backstop against more severe regressions along with the SkipFlaky for the current test?

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 11, 2021

I think what virtually eliminates issues with this test is having GOMAXPROCS >= 4, since the GC always has one dedicated worker running so it can never go too deep into the red. Technically we don't even need the platform and machine to support having 4 CPUs, just having that dedicated worker time sliced in is enough. I'll try to see if I can reproduce with GOMAXPROCS set to 4 in various circumstances, but in the past I wasn't able to.

@prattmic
Copy link
Member

@prattmic prattmic commented May 11, 2021

Here's a quick fix idea: what if a goroutine could never have more credit than the amount of heap runway left (basically the difference between the heap goal and the heap size at the point the assist finishes)?

This seems reasonable to me, though I don't think I've fully grokked the problem. Given the rarity of the situation, I don't think having a run on makeDoneSema would be problematic.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 14, 2021

Unfortunately, I can actually reproduce this issue even when we have a dedicated mark worker (and GOMAXPROCS=4). It's the same basic issue, AFAICT, but I don't fully understand why the allocating goroutine(s) so heavily out-pace(s) the dedicated worker (and one is in fact running the whole time).

I believe my quick fix would resolve this case too, though.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 14, 2021

Looking at the execution trace for the GOMAXPROCS=4 version of the test, I'm noticing several issues (this is on linux/amd64).

For one, the background sweeper gets scheduled surprisingly often. The way it's written, it should basically always be in the global runqueue, and there's 4 goroutines ready to do work (they just allocate heavily). Then, there's a bunch of cases where the dedicated mark worker is taking a long time to do things; there's one case where it even stays scheduled long after a GC has completed. The allocating goroutines also frequently get blocked up in mark assist, finding no way to satisfy their debt. Finally, there are fairly large gaps in the trace where nothing is scheduled on a P at all. It just sits idle.

FWIW I have 16 available cores of parallelism on my machine (apparently, but it's a VM).

I have a bunch of hunches, but I'm not sure what's actually wrong. It could be a few of them all at once.

  • The big delays with the dedicated workers smell a lot like the actual Linux thread getting descheduled (it would explain why the worker apparently "stays scheduled" beyond the GC cycle). Then again, there's plenty of available parallelism, and I'm not sure why Linux would do that.
  • This test directly exploits an issue with the current pacer in that it doesn't account for mark work outside of scannable heap objects. In this case, there are basically zero scannable heap objects and most of the mark work comes from scanning goroutine stacks and scanning global variables (see #42430).
  • Because of all the mark assist descheduling, the goroutines that consistently have work get bounced around and can end up stuck in some local runqueue. The GC cycles are very short (100s of µs) and the usual goroutine quantum is much higher than that, so I imagine that work can go completely unnoticed in the context of this test.

Anyway, I think this test has value in that, yes, the runtime should be able to keep the heap size down if the live heap actually stays small, but I do think it's constructed in such a way that it manages to touch a lot of runtime issues that don't necessarily affect real-world applications.

So, yeah, the 4 P test isn't going to work as a way to side-step this. I'm not really sure what to do aside from marking the test as flaky at this point. I could try to construct a long, big test that actually gives the GC a lot of work (like, a sizeable, scannable heap) that stays at a steady size, and ensure that the heap doesn't grow "too big" for some reasonable definition of "too big."

@prattmic
Copy link
Member

@prattmic prattmic commented May 17, 2021

This is interesting. We could investigate the hypothesis of an M getting descheduled by looking at the Linux scheduler events. Either by collecting with perf record -e 'sched:*' and manually analyzing, or more fancily with something like https://github.com/google/schedviz.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 18, 2021

I did some digging into the history of this test, and have found that TestGcSys broke at some point before Go 1.12 (the allocation was optimized away, it was unnoticed since the test passed; hard to say without bisection when it happened) and once it was fixed, it's basically been flaky since then (in Go 1.12). See #23343.

I think this test has helped illuminate a few issues, and as I said earlier something like it is useful, but that's only true once we've fixed the actual issues causing it to fail. Until then, it might make sense to just mark it as flaky so that it's not masking other more pressing flakes on the dashboard. Since I'm planning on doing pacer work in 1.18 anyway, that's going to be a good time to try to resolve the issues and re-enable the test. Anyway, just a thought. For now I'll keep trying to understand what's happening.

@prattmic I'll give that a try and see what's up with scheduling. I'll try to correlate that with the execution trace, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants