Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestGcSys is still flaky #37331

Open
bcmills opened this issue Feb 20, 2020 · 12 comments
Open

runtime: TestGcSys is still flaky #37331

bcmills opened this issue Feb 20, 2020 · 12 comments
Assignees
Milestone

Comments

@bcmills bcmills added this to the Backlog milestone Feb 20, 2020
@josharian
Copy link
Contributor

@josharian josharian commented Mar 3, 2020

Shall we disable the test for now?

@bcmills
Copy link
Member Author

@bcmills bcmills commented Mar 16, 2020

2020-03-15T08:14:24-dc32553/freebsd-amd64-race
2020-03-04T20:52:43-c55a50e/solaris-amd64-oraclerel

@aclements, @mknyszek: what do you want to do about this test? (Do we understand the cause of these flakes?)

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Mar 17, 2020

@bcmills I'll take a look.

This is usually due to some GC pacing heuristic doing something weird. A GC trace should get us part of the way there.

@mknyszek mknyszek self-assigned this Mar 17, 2020
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Mar 26, 2020

OK sorry for the delay, finally looking into this now.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Mar 26, 2020

Ugh, OK. So this definitely looks like another GOMAXPROCS=1 GC pacing issue. Looking at the gctrace for a bad run on freebsd-amd64-race (which is pretty easily reproducible):

gc 1 @0.000s 1%: 0.010+0.25+0.011 ms clock, 0.010+0/0.055/0.17+0.011 ms cpu, 0->0->0 MB, 4 MB goal, 1 P (forced)
gc 2 @0.001s 3%: 0.011+0.34+0.015 ms clock, 0.011+0.15/0/0+0.015 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 3 @0.003s 4%: 0.011+0.91+0.014 ms clock, 0.011+0.15/0/0+0.014 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 4 @0.005s 4%: 0.013+0.50+0.014 ms clock, 0.013+0.15/0/0+0.014 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 5 @0.007s 5%: 0.010+0.45+0.013 ms clock, 0.010+0.15/0/0+0.013 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 6 @0.008s 6%: 0.011+0.41+0.013 ms clock, 0.011+0.14/0/0+0.013 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 7 @0.009s 7%: 0.012+0.38+0.013 ms clock, 0.012+0.14/0/0+0.013 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 8 @0.010s 7%: 0.012+0.39+0.015 ms clock, 0.012+0.16/0/0+0.015 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 9 @0.011s 8%: 0.012+0.39+0.018 ms clock, 0.012+0.14/0/0+0.018 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 10 @0.012s 5%: 0.012+10+0.014 ms clock, 0.012+0.059/0.13/0+0.014 ms cpu, 4->80->75 MB, 5 MB goal, 1 P
using too much memory: 70813704 bytes

You'll notice that in gc 10, we trigger the GC at the right time, but while it's happening we blow right past the hard goal.

The last time I debugged this, it was a problem with the fact that we didn't fall back to the hard goal even when we were doing more scan work than expected. It's hard for me to see how this is the case again, so there's likely something else going on. A thought: heap_scan is updated less frequently from local_scan in Go 1.14. Most of the GC work in this test comes from assists (because GOMAXPROCS=1). What if this is an issue where the runtime falls into a case where it's consistently behind on the assist ratio? It finally catches up and the assist work gets done, but it's too late. If that's the case, I'm not sure why this is basically impossible to reproduce on Linux, and easily reproducible on the freebsd-amd64-race builders.

@nmeum
Copy link

@nmeum nmeum commented May 31, 2020

I'm not sure why this is basically impossible to reproduce on Linux, and easily reproducible on the freebsd-amd64-race builders.

I think we are also running into this on armv7 and armhf on Alpine Linux edge when building go 1.14.3:

algitbot pushed a commit to alpinelinux/aports that referenced this issue Jun 1, 2020
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 10, 2020

Having dug into this before, I suspect this is related to #42430 but I'm not sure in what way.

Going back to the thought I had in March (#37331 (comment)) I did change a bunch of details in this release regarding how heap_scan is updated (specifically, more often!), so that may be why the failure rate has gone down. @bcmills are those all the recent failures?

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 10, 2020

@mknyszek, those are all of the failures I could find using greplogs with the regexp FAIL: TestGcSys, yes.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 11, 2020

Oh, I also want to note #40460 which is probably related. I could likely prove it with an execution trace. Will look into this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.