Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestPhysicalMemoryUtilization test failures on various builders #49411

Closed
ALTree opened this issue Nov 6, 2021 · 19 comments
Closed

runtime: TestPhysicalMemoryUtilization test failures on various builders #49411

ALTree opened this issue Nov 6, 2021 · 19 comments
Labels
NeedsInvestigation release-blocker
Milestone

Comments

@ALTree
Copy link
Member

@ALTree ALTree commented Nov 6, 2021

A trybot run:
https://storage.googleapis.com/go-build-log/25f64550/windows-386-2008_982c8967.log

step 1
step 8
step 9
step 10
step 11
step 12
step 13
step 14
--- FAIL: TestPhysicalMemoryUtilization (0.03s)
    malloc_test.go:246: expected "OK\n", but got "exceeded physical memory overuse threshold of 30.40%: 32.03%\n(alloc: 8581176, goal: 17384516, sys: 20840448, rel: 9510912, objs: 3)\n"
FAIL
FAIL	runtime	48.356s
FAIL

But I also see a few of this on https://build.golang.org (windows/386 builder).

cc @mknyszek

@ALTree ALTree added the NeedsInvestigation label Nov 6, 2021
@ALTree ALTree added this to the Go1.18 milestone Nov 6, 2021
@ALTree ALTree changed the title runtime: TestPhysicalMemoryUtilization trybot failure on windows/386 runtime: TestPhysicalMemoryUtilization test failure on windows/386 Nov 6, 2021
@mknyszek mknyszek self-assigned this Nov 8, 2021
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 8, 2021

This immediately makes me remember that timer granularity is really coarse on Windows. I bet that's the problem. Sigh.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 8, 2021

Actually, for this test, that shouldn't matter. Hm. Anyway, I'll look into it.

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 8, 2021

Seems not to be platform-specific, but reproduces frequently on the windows-386 and openbsd-386 builders.

greplogs --dashboard -md -l -e 'FAIL: TestPhysicalMemoryUtilization'

2021-11-08T18:06:16-5e64755/freebsd-386-12_2
2021-11-06T19:41:12-7ca772a/windows-386-2008
2021-11-06T10:24:44-f19e400/openbsd-386-68
2021-11-05T22:56:55-2f71c86/openbsd-386-68
2021-11-05T22:30:17-b07c41d/openbsd-386-68
2021-11-05T22:28:56-87a3fc5/openbsd-386-68
2021-11-05T22:26:07-d3a80c7/openbsd-386-68
2021-11-05T22:23:52-35a5881/openbsd-386-68
2021-11-05T22:03:24-035963c/windows-386-2008
2021-11-05T21:48:25-6b223e8/openbsd-386-68
2021-11-05T21:28:50-fb8b176/freebsd-386-11_4
2021-11-05T21:27:34-7aed6dd/windows-386-2008
2021-11-05T21:26:54-3e9e024/freebsd-386-12_2
2021-11-05T21:18:28-4c7cafd/openbsd-386-68
2021-11-05T21:13:38-091948a/openbsd-386-68
2021-11-05T17:52:30-df18377/windows-386-2008
2021-11-05T17:46:41-f063e0d/darwin-amd64-nocgo
2021-05-28T03:34:02-3de3440/freebsd-amd64-race
2019-11-22T15:32:23-c02f3b8/linux-mipsle-rtrk
2019-11-22T03:06:22-0e02cfb/linux-mips-rtrk
2019-11-21T16:01:14-c7e73ef/linux-mips-rtrk
2019-11-19T19:21:57-e0306c1/linux-mips-rtrk
2019-11-18T19:04:04-a23f9af/linux-mips64-rtrk
2019-11-13T18:03:37-e762378/linux-mips-rtrk
2019-11-13T14:18:45-c075115/plan9-arm
2019-11-13T08:07:51-bf5f641/windows-386-2008
2019-11-12T18:52:52-1fe33e3/illumos-amd64
2019-11-12T01:09:31-0ee22d9/plan9-arm
2019-11-11T23:28:26-eb68c4a/plan9-arm
2019-11-11T20:17:03-eb55a0c/windows-386-2008
2019-11-11T17:53:51-70be481/linux-mips64-rtrk
2019-11-08T23:22:06-11da2b2/linux-mips64le-rtrk
2019-11-08T18:00:54-a2cd2bd/linux-mips-rtrk
2019-11-04T15:18:34-d3660e8/plan9-386-0intro

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 8, 2021

Fascinating! Thanks for the additional context. So it looks like a 32-bit thing. I wonder if I can get this to fire in Linux with enough runs.

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 8, 2021

It's not clear to me whether it's strictly 32-bit — there was one weird darwin-amd64-nocgo outlier at the start of the current run of failures.

@jeremyfaller
Copy link
Contributor

@jeremyfaller jeremyfaller commented Nov 8, 2021

CC-ing @bufflig

@ALTree ALTree changed the title runtime: TestPhysicalMemoryUtilization test failure on windows/386 runtime: TestPhysicalMemoryUtilization test failures on various builders Nov 8, 2021
@bufflig
Copy link
Contributor

@bufflig bufflig commented Nov 8, 2021

mknyszek@ do you need some windows-specific work here, like reproducing?

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 9, 2021

A reproducer would be super helpful. If you want to take a stab at the problem, also, here's some context:

This test is checking to make sure that when the heap grows, the runtime returns memory to the OS to try to make up for it. It constructs a heap with a bunch of multiple-of-8-KiB-page-sized holes, reads memstats, then triggers an allocation that grows the heap. Then it checks memstats again to see if it took back at least that much memory.

This is most likely related to changes to the background scavenger in this cycle. I'm going to take a wild guess and say it'll bisect down to https://golang.org/cl/358675, but the problem is actually in https://golang.org/cl/353975. The former increases the background scavenger's rate, while the latter modifies it to try to do 1 ms of work before going to sleep.

My earlier comment was alluding to poor time granularity on Windows, so that in fact the background scavenger could be doing 15 ms of work instead of 1 ms. An important detail in all this is that the scavenger runs over the heap linearly each "cycle" and maintains a pointer that represents its progress. Maybe that pointer gets stale? I'm not really sure where it goes from there.

@bufflig
Copy link
Contributor

@bufflig bufflig commented Nov 9, 2021

I'll try to see if I can reproduce, it appears it is more likely to happen on windows, so that's promising (if I'm to reproduce it, I mean :)).

The timer, I'm not sure I get how 15ms of work would give this effect, I would have thought that the scavenger doing less work would cause this, not more?

@bufflig
Copy link
Contributor

@bufflig bufflig commented Nov 9, 2021

Ah, right - 32bit. Well, I wanted to set up a 32bit VM anyway. It might take a little time before I can attempt a reproduction though, but I'll give it a shot.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 9, 2021

This bug is also reproducing on FreeBSD 32-bit builders. It seems more strongly correlated with a smaller address space, despite that darwin-amd64-nocgo failure.

https://build.golang.org/log/30765367d142efba9a1fdeaaedfdfdf3e57d1638
https://build.golang.org/log/d651f0b4d41d9bb05ea9667d100ce69fc6a6da34

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 10, 2021

So I've just been doing some debugging on this test to see what it could be doing differently at tip now and... well. It definitely ceased to be a useful test.

I think as-is it was already fragile, but now it's just not great. Because of timing it basically relies on the background scavenger to do the work (AFAICT) and this would explain why the target gets missed on these 32-bit systems, especially if they don't have very many CPUs.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 10, 2021

I think we should strongly consider just marking it flaky for the moment while I try to figure something out here, again.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 10, 2021

OK! I have a fix. The Nth iteration of this test. Good news is that it's simpler and closer to the original intent. I was able to reproduce the issue on openbsd/386 and show that the new version doesn't fail.

To follow up from my previous analysis, what seems to have happened here is that this test was pretty fragile to when GCs would run and if there were extra GC cycles. This new iteration is no longer fragile to that AFAICT and also has a tighter bound.

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 10, 2021

Change https://golang.org/cl/362978 mentions this issue: runtime: rewrite TestPhysicalMemoryUtilization

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 11, 2021

The good news is, the rewritten test seems to have fixed the windows-386 and openbsd-386 builders! 🎉

The bad news is, it is now failing intermittently on the linux-ppc64* and darwin-arm64 builders. 😞

greplogs --dashboard -md -l -e 'FAIL: TestPhysicalMemoryUtilization' --since=2021-11-10T20:46:00

2021-11-11T13:58:28-d76b1ac/linux-ppc64le-power9osu
2021-11-11T11:47:33-c49627e/linux-ppc64le-buildlet
2021-11-11T11:47:33-c49627e/linux-ppc64le-power9osu
2021-11-11T11:47:18-79e03a9/linux-ppc64le-buildlet
2021-11-11T11:47:18-79e03a9/linux-ppc64le-power9osu
2021-11-11T11:47:08-e9ef931/linux-ppc64le-buildlet
2021-11-11T11:00:33-a01a6d6/linux-mips64le-mengzhuo
2021-11-11T11:00:33-a01a6d6/linux-ppc64le-power9osu
2021-11-11T07:03:15-1ec5108/linux-ppc64le-power9osu
2021-11-11T05:16:39-99fa49e/darwin-arm64-11_0-toothrot
2021-11-11T05:16:39-99fa49e/linux-ppc64le-power9osu
2021-11-11T04:54:05-4b27d40/linux-ppc64le-buildlet
2021-11-11T04:41:56-d5a5a13/linux-ppc64le-buildlet
2021-11-11T04:41:56-d5a5a13/linux-ppc64le-power9osu
2021-11-11T04:02:33-3949faf/linux-ppc64le-buildlet
2021-11-11T04:02:33-3949faf/linux-ppc64le-power9osu
2021-11-10T22:16:25-4d06839/darwin-arm64-11_0-toothrot
2021-11-10T22:16:25-4d06839/linux-ppc64le-power9osu
2021-11-10T21:59:24-23f653d/linux-mips64le-mengzhuo
2021-11-10T21:59:24-23f653d/linux-ppc64le-buildlet
2021-11-10T21:59:23-96c94c2/darwin-arm64-11_0-toothrot
2021-11-10T21:59:23-96c94c2/linux-ppc64le-buildlet
2021-11-10T21:59:23-96c94c2/linux-ppc64le-power9osu
2021-11-10T21:53:03-229b909/darwin-arm64-11_0-toothrot
2021-11-10T21:53:03-229b909/linux-ppc64le-buildlet
2021-11-10T21:32:50-f410786/linux-ppc64le-buildlet

@bcmills bcmills reopened this Nov 11, 2021
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 11, 2021

Sigh. On it. At least it's the same builders as the FreeOSMemory failure. Probably something to do with these platforms having 64 KiB pages.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 11, 2021

Ugh, OK, the problem is clear to me now. The 64 KiB chunks that are supposed to be scavenged are all unaligned, so the scavenger glosses over them (runtime pages are 8 KiB, so they need not actually line up).

I don't think this test can be engineered such that all the allocations are aligned to 64 KiB. I can make the chunks really big, I guess, so that if we miss a 64 KiB page it's not a huge deal. I'll experiment quickly.

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 11, 2021

Change https://golang.org/cl/363415 mentions this issue: runtime: adjust TestPhysicalMemoryUtilization to handle large page sizes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants