Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: sysMemStat overflow on openbsd-mips64-jsing #52682

Closed
mknyszek opened this issue May 3, 2022 · 11 comments
Closed

runtime: sysMemStat overflow on openbsd-mips64-jsing #52682

mknyszek opened this issue May 3, 2022 · 11 comments
Assignees
Labels
arch-mips help wanted NeedsInvestigation OS-OpenBSD
Milestone

Comments

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 3, 2022

Example failure: https://build.golang.org/log/94e9c9c820d29aa509ffe49c14cb5c8784954f99

This started happening as of the SetMemoryLimit CLs. Notably, linux-mips64le looks OK.

@mknyszek mknyszek added NeedsInvestigation release-blocker Soon labels May 3, 2022
@mknyszek mknyszek added this to the Go1.19 milestone May 3, 2022
@mknyszek mknyszek self-assigned this May 3, 2022
@bcmills
Copy link
Member

@bcmills bcmills commented May 3, 2022

@mengzhuo
Copy link
Contributor

@mengzhuo mengzhuo commented May 5, 2022

Bisect shows https://go-review.googlesource.com/c/go/+/399474 is the bad commit

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented May 5, 2022

The overflow is real: the scavenger tries to subtract 16 KiB from heapFree when it's zero, hence the overflow.

Looking at this, it either doesn't seem possible, or something horribly wrong is happening.

heapFree is increased atomically before the pages are made available in the page heap, under a lock. The scavenger is unable to notice that any of those pages are available before that happens, because it grabs that same lock to check for those pages. Only after it finds the free pages, and scavenges them, does it actually decrease heapFree.

What could be happening is an allocation comes in and concurrently grabs those pages with the scavenger, decreasing heapFree twice. Again, this is protected by a lock, so it should be impossible, but if that's the case, that's really bad, because you can end up with spurious zeroing of memory.

What changed in https://go-review.googlesource.com/c/go/+/399474 is that there's a lock-free way of determining which 4 MiB chunks of memory have free pages. The thing is, that index is only updated with the lock held. And even if the scavenger notices that there could be pages in some 4 MiB chunk, it again always grabs the lock to check for certain and acquire those pages before doing anything.

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented May 5, 2022

I will note that the other mips64 builders appear to be fine, at least when it comes to this issue. I have no idea how being on OpenBSD specifically causes this to happen.

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented May 5, 2022

I've reassigned this to @4a6f656c. I don't have the time at this moment to look into this, and it's low priority for me because it's a second class port. I'm willing to take a look later, but I can't make any promises on when.

@bcmills bcmills removed release-blocker Soon labels May 5, 2022
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 18, 2022

This happens in schedinit, before gcinit, so the gcController hasn't been initialized. The bad value is -16384 (which equals to n=-16384), so the original value is indeed 0, i.e. uninitialized.

The allocation is allocating the gsignal stack for m0. On all other platforms but OpenBSD/MIPS64, that stack size is 32K. But it is 64K on OpenBSD/MIPS64 https://cs.opensource.google/go/go/+/master:src/runtime/os_openbsd.go;l=157
This makes the stack allocator to go with the mheap_.allocManual path as it is above _StackCacheSize (=32K).

Would it be possible to initialize the gcController earlier? Or we could allocate that particular stack in some special way (maybe sysAlloc)?

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented May 18, 2022

Oh, huh. Thanks for digging into this.

I think we should just initialize the gcController earlier. I suspect there are other (minor) issues with how late it's initialized. For instance, that means the memory limit is initialized pretty late. I think this will fix the problem because then allocSpan won't ever trigger anything to scavenge that early.

But on another note, how can it find anything to scavenge at all? On other platforms, I believe that nothing is found at all. But if something is found, that suggests that heapFree must be > 0. Unless there's something else I'm missing with the 64K stacks.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 20, 2022

This code https://cs.opensource.google/go/go/+/master:src/runtime/mheap.go;l=1185 calls h.pages.free in allocSpan. I think this may cause the scavenger to find something, as it marks bits in the scavengeIndex, without changing heapFree (?). The condition needPhysPageAlign is OpenBSD specific.

That allocSpan call may be from a stack allocation before the allocation of the 64K stack. What allocation that is? I don't know...

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented May 20, 2022

Oh, wow. OK. Thanks for figuring that out!!

Ugh. This case is super annoying. When that memory gets allocated, its scavenged bits get cleared. So, when the edges of the allocation are subsequently freed, they're always considered unscavenged, even though we didn't actually use them. If we want to maintain that behavior, we need to figure out exactly how much of those "edge" pages were scavenged and make the appropriate update. This looks like it was pretty much always wrong; we just got lucky.

We can avoid this entirely by just finding a big enough region that we can slice up, and then allocate only the aligned part in the first place, so we don't have to free the edges. I'll send a patch.

@gopherbot
Copy link

@gopherbot gopherbot commented May 20, 2022

Change https://go.dev/cl/407502 mentions this issue: runtime: allocate physical-page-aligned memory differently

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented May 20, 2022

Another, potentially simpler solution, would be to just always overallocate and have the stack code do the alignment. It wastes memory instead of CPU time, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-mips help wanted NeedsInvestigation OS-OpenBSD
Projects
None yet
Development

No branches or pull requests

6 participants