Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "out of memory" in heapsampling.go on 386 and wasm since 2021-11-05 #49564

Closed
bcmills opened this issue Nov 12, 2021 · 19 comments
Closed
Labels
arch-386 Issues solely affecting the 32-bit x86 architecture arch-wasm WebAssembly issues NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Nov 12, 2021

@bcmills bcmills added OS-Windows NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker arch-386 Issues solely affecting the 32-bit x86 architecture labels Nov 12, 2021
@bcmills bcmills added this to the Go1.18 milestone Nov 12, 2021
@toothrot toothrot added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Nov 19, 2021
@aclements
Copy link
Member

aclements commented Dec 1, 2021

We haven't seen any more of these failures.

Our working hypothesis is that these were related to the new pacer. That landed 2021-11-04. In all of these failures, the heap is around 1.7 GB when it fails, which is way bigger than we'd expect from this test. Based on printing MemStats.Sys at the end of this test, the largest memory footprint I've seen on my linux/amd64 laptop is 120 MB.

@mknyszek
Copy link
Contributor

mknyszek commented Dec 1, 2021

I want to amend @aclements' a little bit just to say that we've seen these large memory footprints (I think around 113 MiB?) with the old pacer too (GOEXPERIMENT=nopacerredesign). It's still possible this 1.7 GiB uptick was somehow related to the new pacer, but it's less likely to be solely related to that, I think.

What's strange is that the failures showed up twice, then for a period of about 8 days, then disappeared.

@toothrot
Copy link
Contributor

toothrot commented Dec 8, 2021

Checking on this as a release blocker. Is this still occurring @bufflig?

@bcmills
Copy link
Member Author

bcmills commented Dec 8, 2021

Still no more failures in the logs. Maybe this was a bad interaction with some other runtime test that has since been fixed?

greplogs --dashboard -md -l -e FAIL\\s+heapsampling\\.go --since=2021-11-13

@cherrymui cherrymui removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 14, 2021
@cherrymui
Copy link
Member

cherrymui commented Dec 15, 2021

Seems still no new failure since 2021-11-12.

@aclements
Copy link
Member

aclements commented Dec 15, 2021

It's unfortunate that we don't know what caused this, but at this point I think we can close this issue. We'll reopen it if the issue returns.

@bcmills
Copy link
Member Author

bcmills commented Feb 16, 2022

We'll reopen it if the issue returns.

greplogs --dashboard -md -l -e FAIL\\s+heapsampling\\.go --since=2021-12-15

2022-02-15T21:18:59-08ed488/windows-386-2008

Given the timing (and the relative rarity since the last failure), retargeting to Go 1.19.

@bcmills bcmills reopened this Feb 16, 2022
@bcmills bcmills modified the milestones: Go1.18, Go1.19 Feb 16, 2022
@bcmills
Copy link
Member Author

bcmills commented Mar 2, 2022

Another one this week:

greplogs --dashboard -md -l -e FAIL\\s+heapsampling\\.go --since=2022-02-16

2022-03-02T20:02:12-fc5b64e/windows-386-2012

@jeremyfaller jeremyfaller removed their assignment Mar 23, 2022
@bcmills
Copy link
Member Author

bcmills commented Apr 18, 2022

@golang/runtime, this release-blocking issue needs an owner now that it is no longer assigned to Jeremy.

(This is a recurring test regression on a first class port.)

@mknyszek
Copy link
Contributor

mknyszek commented Apr 18, 2022

I've been investigating heap overrun issues in a test I wrote for #48409 that appear to be similar to #37331.

So far, I've discovered that any test that allocates in a loop can somewhat easily overrun the heap in an unbounded manner due to an OS scheduler preemption (even on undersubscribed systems) during mark termination. I believe I have sufficient evidence to narrow it down to an OS scheduler preemption at this point, too. I've been trying to obtain a scheduler trace and prove it for certain. It reproduces more readily on VMs, but those traces seem to be busted. I can reproduce on a dedicated machine as well, but not while tracing is active. (I ran it continuously over the weekend trying to obtain a trace -- no dice... on Linux, anyway.)

This bug frankly looks a lot like those cases, and is consistent with previous behavior I've observed when investigating this bug (with smaller spikes).

@mknyszek
Copy link
Contributor

mknyszek commented Apr 18, 2022

FWIW, the best mitigation I have come up with now is to slow down how fast the program is allocating by inserting Gosched calls and writing to the allocated memory. I could apply this mitigation to heapsampling.go but it means a slower test. I could also reduce the number of iterations it does as well to try to make up for that.

@prattmic
Copy link
Member

prattmic commented Apr 18, 2022

Would it be OK to reduce the size of allocations in that test (maybe just on 32-bit systems)?

Based on my back-of-the-envelope calculation, each allocInterleaved call in testInterleavedAllocations allocates ~1GB. Only ~50KB is live at a time, but it seems like a slow GC has a pretty good chance of blowing the address space limit.

@alexbrainman
Copy link
Member

alexbrainman commented Apr 18, 2022

@mknyszek

It reproduces more readily on VMs, but those traces seem to be busted. I can reproduce on a dedicated machine as well, but not while tracing is active.

There is new code that allows you to create Windows dumps when you set GOTRACEBACK=crash - see CL 307372 commit message for how to use new feature. Maybe you could use delve or windbg to see the state of crashed test process.

Alex

@bcmills
Copy link
Member Author

bcmills commented May 11, 2022

Curiously, also on js-wasm of all places‽ 🤯

# go run run.go -- heapsampling.go
exit status 1
runtime: out of memory: cannot allocate 4194304-byte block (1052639232 in use)
fatal error: out of memory

greplogs -l -e FAIL\\s+heapsampling\\.go --since=2022-05-06
2022-05-10T20:52:34-e0f9977/js-wasm
2022-05-06T17:41:30-091e913/windows-386-2008

@bcmills bcmills changed the title runtime: "out of memory" in heapsampling.go on windows-386 since 2021-11-05 runtime: "out of memory" in heapsampling.go on windows-386 and js-wasm since 2021-11-05 May 11, 2022
@mknyszek
Copy link
Contributor

mknyszek commented May 11, 2022

It's not so surprising to me. :) I encountered this failure on js/wasm when landing the SetMemoryLimit changes. Basically, js/wasm is very slow and single-threaded, so it can trigger the GC CPU limiter if it gets particularly unlucky, so it's especially sensitive to bugs in the GC CPU limiter, at least when a test allocates in a loop like this one.

I suspect https://go.dev/cl/404304 will help on that front, since there are still bugs. :)

@bcmills bcmills changed the title runtime: "out of memory" in heapsampling.go on windows-386 and js-wasm since 2021-11-05 runtime: "out of memory" in heapsampling.go on 386 and wasm since 2021-11-05 May 12, 2022
@bcmills bcmills added arch-wasm WebAssembly issues and removed OS-Windows labels May 12, 2022
@bcmills
Copy link
Member Author

bcmills commented May 12, 2022

A different 386 platform this time!

greplogs -l -e FAIL\\s+heapsampling\\.go --since=2022-05-11
2022-05-11T23:39:27-ec5bdef/freebsd-386-12_3

@mknyszek
Copy link
Contributor

mknyszek commented May 23, 2022

Looking back over this, I think the only thing to do here is to slow down the allocation rate of this test. If it remains an issue after that, we should consider just skipping the test on some platforms. I hope (🤞 ) that the js/wasm issues are resolved by the fixes to the GC CPU limiter.

@mknyszek mknyszek added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label May 25, 2022
@gopherbot
Copy link

gopherbot commented May 27, 2022

Change https://go.dev/cl/408825 mentions this issue: test/heapsampling.go: slow down allocation rate and reduce iterations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-386 Issues solely affecting the 32-bit x86 architecture arch-wasm WebAssembly issues NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker
Projects
Status: Done
Development

No branches or pull requests

9 participants