Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestGcSys is still flaky #37331

Open
bcmills opened this issue Feb 20, 2020 · 6 comments
Open

runtime: TestGcSys is still flaky #37331

bcmills opened this issue Feb 20, 2020 · 6 comments
Assignees
Milestone

Comments

@bcmills bcmills added this to the Backlog milestone Feb 20, 2020
@bcmills

This comment has been minimized.

Copy link
Member Author

@bcmills bcmills commented Mar 2, 2020

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented Mar 3, 2020

Shall we disable the test for now?

@bcmills

This comment has been minimized.

Copy link
Member Author

@bcmills bcmills commented Mar 16, 2020

2020-03-15T08:14:24-dc32553/freebsd-amd64-race
2020-03-04T20:52:43-c55a50e/solaris-amd64-oraclerel

@aclements, @mknyszek: what do you want to do about this test? (Do we understand the cause of these flakes?)

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Mar 17, 2020

@bcmills I'll take a look.

This is usually due to some GC pacing heuristic doing something weird. A GC trace should get us part of the way there.

@mknyszek mknyszek self-assigned this Mar 17, 2020
@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Mar 26, 2020

OK sorry for the delay, finally looking into this now.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Mar 26, 2020

Ugh, OK. So this definitely looks like another GOMAXPROCS=1 GC pacing issue. Looking at the gctrace for a bad run on freebsd-amd64-race (which is pretty easily reproducible):

gc 1 @0.000s 1%: 0.010+0.25+0.011 ms clock, 0.010+0/0.055/0.17+0.011 ms cpu, 0->0->0 MB, 4 MB goal, 1 P (forced)
gc 2 @0.001s 3%: 0.011+0.34+0.015 ms clock, 0.011+0.15/0/0+0.015 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 3 @0.003s 4%: 0.011+0.91+0.014 ms clock, 0.011+0.15/0/0+0.014 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 4 @0.005s 4%: 0.013+0.50+0.014 ms clock, 0.013+0.15/0/0+0.014 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 5 @0.007s 5%: 0.010+0.45+0.013 ms clock, 0.010+0.15/0/0+0.013 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 6 @0.008s 6%: 0.011+0.41+0.013 ms clock, 0.011+0.14/0/0+0.013 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 7 @0.009s 7%: 0.012+0.38+0.013 ms clock, 0.012+0.14/0/0+0.013 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 8 @0.010s 7%: 0.012+0.39+0.015 ms clock, 0.012+0.16/0/0+0.015 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 9 @0.011s 8%: 0.012+0.39+0.018 ms clock, 0.012+0.14/0/0+0.018 ms cpu, 4->5->1 MB, 5 MB goal, 1 P
gc 10 @0.012s 5%: 0.012+10+0.014 ms clock, 0.012+0.059/0.13/0+0.014 ms cpu, 4->80->75 MB, 5 MB goal, 1 P
using too much memory: 70813704 bytes

You'll notice that in gc 10, we trigger the GC at the right time, but while it's happening we blow right past the hard goal.

The last time I debugged this, it was a problem with the fact that we didn't fall back to the hard goal even when we were doing more scan work than expected. It's hard for me to see how this is the case again, so there's likely something else going on. A thought: heap_scan is updated less frequently from local_scan in Go 1.14. Most of the GC work in this test comes from assists (because GOMAXPROCS=1). What if this is an issue where the runtime falls into a case where it's consistently behind on the assist ratio? It finally catches up and the assist work gets done, but it's too late. If that's the case, I'm not sure why this is basically impossible to reproduce on Linux, and easily reproducible on the freebsd-amd64-race builders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.