Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/trace/v2: OOMs in TestGCStress on 386 platforms #66624

Closed
mknyszek opened this issue Mar 31, 2024 · 4 comments
Closed

internal/trace/v2: OOMs in TestGCStress on 386 platforms #66624

mknyszek opened this issue Mar 31, 2024 · 4 comments
Assignees
Labels
NeedsFix The path to resolution is known, but the work has not been done. release-blocker Soon This needs to be done soon. (regressions, serious bugs, outages)
Milestone

Comments

@mknyszek
Copy link
Contributor

#!watchflakes
post <- pkg == "internal/trace/v2" && `out of memory` && goarch == "386"

Example failure: https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8752088278110881105/+/u/step/11/log/2.

I suspect this is just bad luck, and the uptick in failures might be correlated with an uptick in noise on our builders. The heap for this test isn't that big, but it does allocate extremely rapidly. It's within the realm of possibility that the GC is transiently failing to keep up, leading to enough memory use that we see failures on 386. I think we can avoid this by setting a memory limit because the actual live heap should be relatively small (last I checked, 50-60 MiB max).

@mknyszek mknyszek self-assigned this Mar 31, 2024
@mknyszek mknyszek added this to the Go1.23 milestone Mar 31, 2024
@mknyszek mknyszek added NeedsFix The path to resolution is known, but the work has not been done. release-blocker Soon This needs to be done soon. (regressions, serious bugs, outages) labels Mar 31, 2024
@gopherbot
Copy link

Change https://go.dev/cl/577475 mentions this issue: internal/trace/v2: halve the memory footprint of TestGCStress

gopherbot pushed a commit that referenced this issue Apr 10, 2024
This test has been OOMing on 32-bit platforms for a bit. I suspect the
very high allocation rate is causing the program to outrun the GC in
some corner-case scenarios, especially on 32-bit Windows.

I don't have a strong grasp of what's going on yet, but lowering the
memory footprint should help with the flakiness. This shouldn't
represent a loss in test coverage, since we're still allocating and
assisting plenty (tracing the latter is a strong reason this test
exists).

For #66624.

Change-Id: Idd832cfc5cde04701386919df4490f201c71130a
Reviewed-on: https://go-review.googlesource.com/c/go/+/577475
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
@mknyszek
Copy link
Contributor Author

So the above CL didn't fix the issue. However, after trying this out on a gomote, I can get much more predictable behavior from the GC by not allocating a random size slice (surprise, surprise).

It appears that we can perhaps fall into some unlucky combination of:

  • The GC is struggling to finish in time because the test is only allocating and the pacer hasn't caught on to the absolutely crazy high allocation rate quite yet.
  • All the memory allocated black while the GC is active inflates the next cycle's live heap size.
  • This happens for more than one cycle.
  • The random sizes allocated end up on the bigger side.

I can definitely reproduce big heap inflation early on in the run (not quite 2 GiB, but 100s of MiB) and it always levels out. If I use a consistent slice size, I don't see the same overruns anymore.

I do not understand yet why the flakes here only started to kick up on March 25th. There was only one flake before that in a CL I had been trying out. The only tracing-related CL that could possibly be the culprit is https://go.dev/cl/565937, but that CL shouldn't really affect the code executing in this test much, if at all.

@gopherbot
Copy link

Change https://go.dev/cl/578319 mentions this issue: internal/trace/v2: make TestGCStress less random

@mknyszek
Copy link
Contributor Author

Confirmed that this issue is fixed. There have been no new failures on any builders, except for one in a CL which was based on a commit that didn't include the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. release-blocker Soon This needs to be done soon. (regressions, serious bugs, outages)
Projects
None yet
Development

No branches or pull requests

2 participants