Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/debug: TestFreeOSMemory failure with too few bytes released #52158

Closed
bcmills opened this issue Apr 5, 2022 · 2 comments
Closed

runtime/debug: TestFreeOSMemory failure with too few bytes released #52158

bcmills opened this issue Apr 5, 2022 · 2 comments
Assignees
Labels
NeedsInvestigation release-blocker
Milestone

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Apr 5, 2022

--- FAIL: TestFreeOSMemory (0.01s)
    garbage_test.go:149: less than 33554432 released: 3260416 -> 3686400
allocating objects
starting gc
starting dump
done dump
FAIL
FAIL	runtime/debug	0.080s

This is the first such failure since #49478 was closed (attn: @mknyszek; CC @golang/runtime).
Marking as release-blocker because this failure was observed on linux/amd64, which is a first class port.

greplogs --dashboard -md -l -e 'FAIL: TestFreeOSMemory' --since=2021-11-12

2022-04-05T14:18:19-5210a71/linux-amd64-unified

@bcmills bcmills added NeedsInvestigation release-blocker labels Apr 5, 2022
@bcmills bcmills added this to the Go1.19 milestone Apr 5, 2022
@mknyszek mknyszek self-assigned this Apr 7, 2022
@heschi
Copy link
Contributor

@heschi heschi commented May 11, 2022

ping: this is a release blocker that hasn't been updated in a while.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 18, 2022

Looks like that was the only failure since Nov. 2021 (around when I think I last improved the test).

Looking back at the scavenger code from around that time, there was a very, very small chance that the scavenger (running from any context, including debug.FreeOSMemory) could miss scavengable memory. The scavenger used to iterate over the heap optimistically, performing racy reads on a pair of bitmaps that are usually protected by a lock. Because 4 MiB of address space is 64 bytes in the bitmap (roughly a cache line, on purpose), and this test tries to look for around 16 MiB released (out of 32 MiB freed), that means only 3 cache lines of the bitmap need to be stale to actually fail the test. Again it's unlikely, but possible. This can happen if the big allocation is swept not by whatever CPU calls runtime.GC in the test, but by some other CPU running the background sweeper. (It also means that something must have loaded those same bitmaps on that first CPU, but given that on the builder there are many tests running from runtime/debug, I think this is possible).

The new implementation uses an atomically-managed index on top of the bitmap instead, and there should be no more racy reads. Given that this is a relatively rare failure, and the old implementation had the possibility of that failure, I think that the racy reads being the problem is not an unreasonable conclusion. Also, I don't have a better explanation... :P

As a result, I'm inclined to close this issue. Any objections? If this were to happen in the new implementation, I would be more concerned, and we should definitely re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation release-blocker
Projects
None yet
Development

No branches or pull requests

3 participants