Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestReadMetrics failures #60607

Closed
gopherbot opened this issue Jun 5, 2023 · 30 comments
Closed

runtime: TestReadMetrics failures #60607

gopherbot opened this issue Jun 5, 2023 · 30 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@gopherbot
Copy link
Contributor

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"

Issue created automatically to collect these failures.

Example (log):

--- FAIL: TestReadMetrics (0.00s)
    metrics_test.go:147: live bytes: 5894368 > heap alloc: 5882216

watchflakes

@gopherbot gopherbot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 5, 2023
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-06-05 16:41 windows-arm64-11 go@587c1c19 runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.00s)
    metrics_test.go:147: live bytes: 5894368 > heap alloc: 5882216

watchflakes

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jun 5, 2023
@RuinanSun
Copy link
Contributor

I've been having the issue a lot, and I think it's caused by CL 497315

@mknyszek
Copy link
Contributor

mknyszek commented Jun 7, 2023

This is a test-only issue. I believe there are a few legitimate ways this can happen. I'll fix it.

@mknyszek mknyszek self-assigned this Jun 7, 2023
@mknyszek mknyszek added this to the Go1.21 milestone Jun 7, 2023
@mknyszek
Copy link
Contributor

mknyszek commented Jun 8, 2023

On second thought, I don't see how this could happen. The test is fairly strong in terms of it's consistency. (World is stopped, caches and stats are flushed.) There should be no skew that makes this possible. I can't reproduce on linux/amd64, so the fact that this happened on windows/arm64 is interesting. Maybe there's something odd happening on weak memory architectures.

@mknyszek
Copy link
Contributor

mknyszek commented Jun 8, 2023

@RuinanSun Do you happen to have more details as to where you encountered this error and what you did to reproduce it?

@mknyszek
Copy link
Contributor

mknyszek commented Jun 8, 2023

Ah, nope. Sorry, I had this in my head before, but when I came back to this I forgot. 😅

The issue is that the GC can double-count objects as live if two GC workers race to mark the same object. This is intentional and inconsequential; it would make the GC significantly slower for one worker to have to take ownership of marking an object. Marking is idempotent by design.

The race is rare, and the rest of the runtime is already robust to it. This is indeed a test-only issue. That being said, I'm not sure what to change the test to. That will take some thought.

@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/501858 mentions this issue: runtime: apply looser bound to /gc/heap/live:bytes in test

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-10-31 20:47 linux-amd64-longtest go@b11defea runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.01s)
    metrics_test.go:67: metric "/memory/classes/heap/free:bytes": got 6496256, want 6397952
    metrics_test.go:69: metric "/memory/classes/heap/released:bytes": got 127950848, want 128049152

watchflakes

@gopherbot gopherbot reopened this Oct 31, 2023
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-10-31 20:38 darwin-amd64-longtest go@66b8107a runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.01s)
    metrics_test.go:67: metric "/memory/classes/heap/free:bytes": got 7143424, want 6750208
    metrics_test.go:69: metric "/memory/classes/heap/released:bytes": got 119906304, want 120299520

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-10-31 20:50 darwin-amd64-longtest go@d2f3a68b runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.01s)
    metrics_test.go:67: metric "/memory/classes/heap/free:bytes": got 3104768, want 2908160
    metrics_test.go:69: metric "/memory/classes/heap/released:bytes": got 119635968, want 119832576

watchflakes

@mknyszek mknyszek modified the milestones: Go1.21, Go1.22 Nov 1, 2023
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-11-01 17:49 linux-amd64-longtest go@b7a695bd runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.01s)
    metrics_test.go:67: metric "/memory/classes/heap/free:bytes": got 2809856, want 2646016
    metrics_test.go:69: metric "/memory/classes/heap/released:bytes": got 106536960, want 106700800

watchflakes

@mknyszek
Copy link
Contributor

mknyszek commented Nov 1, 2023

I suspect a recent change got some accounting wrong. I'll try and bisect.

@mknyszek
Copy link
Contributor

mknyszek commented Nov 1, 2023

I believe I've bisected it down to e293c4b.

CC @cherrymui @4a6f656c

@mknyszek
Copy link
Contributor

mknyszek commented Nov 1, 2023

Reproducer:

GOFLAGS='-gcflags=runtime=-d=maymorestack=runtime.mayMoreStackMove' go test -run="^TestReadMetrics\$" -count=1000 -failfast runtime

@mknyszek
Copy link
Contributor

mknyszek commented Nov 1, 2023

I have a suspicion this is a latent issue that's just cropping up now. My best guess is that stack movement happens between readmemstats_m and readMetrics in ReadMetricsSlow, which causes memory to be allocated from the heap. That memory shifts from "released" to "free" and voila, the stats don't match.

@mknyszek
Copy link
Contributor

mknyszek commented Nov 1, 2023

Oof, yeah, this is going to be very difficult to avoid. We can't stay on the system stack for readMetrics because it acquires a semaphore. That means we can't in general prevent stack movement between the two metrics reads. This can in theory become a problem for the non-mayMoreStackMove tests but it's just so unlikely to happen in practice that it's not a problem.

I'm inclined to just disable the test for mayMoreStackMove somehow, and leave a comment explaining the possible (but exceedingly unlikely) issue for non-mayMoreStackMove

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-11-01 17:49 darwin-amd64-longtest go@b7a695bd runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.01s)
    metrics_test.go:67: metric "/memory/classes/heap/free:bytes": got 6545408, want 6479872
    metrics_test.go:69: metric "/memory/classes/heap/released:bytes": got 120430592, want 120496128

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-11-02 03:57 linux-amd64-longtest go@e5ef4846 runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.01s)
    metrics_test.go:67: metric "/memory/classes/heap/free:bytes": got 3465216, want 3203072
    metrics_test.go:69: metric "/memory/classes/heap/released:bytes": got 89186304, want 89448448
2023-11-02 04:17 linux-amd64-longtest go@11677d98 runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.01s)
    metrics_test.go:67: metric "/memory/classes/heap/free:bytes": got 3047424, want 2916352
    metrics_test.go:69: metric "/memory/classes/heap/released:bytes": got 89391104, want 89522176

watchflakes

@mknyszek
Copy link
Contributor

mknyszek commented Nov 2, 2023

I had a realization as to how we could work around the constraints of readMetrics. Turns out, not that hard.

@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/539117 mentions this issue: runtime: eliminate possible stack movements in ReadMetricsSlow

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-11-02 08:05 linux-amd64-longtest go@4e896d17 runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.01s)
    metrics_test.go:67: metric "/memory/classes/heap/free:bytes": got 4112384, want 4046848
    metrics_test.go:69: metric "/memory/classes/heap/released:bytes": got 92413952, want 92479488

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-11-02 19:15 darwin-amd64-longtest go@2ffe600d runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.01s)
    metrics_test.go:67: metric "/memory/classes/heap/free:bytes": got 6422528, want 6160384
    metrics_test.go:69: metric "/memory/classes/heap/released:bytes": got 120569856, want 120832000

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-11-02 19:37 darwin-amd64-longtest go@f31a030e runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.01s)
    metrics_test.go:67: metric "/memory/classes/heap/free:bytes": got 3342336, want 3014656
    metrics_test.go:69: metric "/memory/classes/heap/released:bytes": got 123650048, want 123977728

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-11-03 15:06 linux-amd64-longtest go@1764da77 runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.01s)
    metrics_test.go:67: metric "/memory/classes/heap/free:bytes": got 2891776, want 2760704
    metrics_test.go:69: metric "/memory/classes/heap/released:bytes": got 131645440, want 131776512

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-11-03 15:20 windows-amd64-longtest go@e2d9574b runtime.TestReadMetrics (log)
--- FAIL: TestReadMetrics (0.01s)
    metrics_test.go:67: metric "/memory/classes/heap/free:bytes": got 2138112, want 2039808
    metrics_test.go:69: metric "/memory/classes/heap/released:bytes": got 90374144, want 90472448

watchflakes

@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/539695 mentions this issue: runtime: donate racectx to g0 in ReadMetricsSlow

gopherbot pushed a commit that referenced this issue Nov 3, 2023
ReadMetricsSlow was updated to call the core of readMetrics on the
systemstack to prevent issues with stat skew if the stack gets moved
between readmemstats_m and readMetrics. However, readMetrics calls into
the map implementation, which has race instrumentation. The system stack
typically has no racectx set, resulting in crashes.

Donate racectx to g0 like the tracer does, so that these accesses don't
crash.

For #60607.

Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-race
Change-Id: Ic0251af2d9b60361f071fe97084508223109480c
Reviewed-on: https://go-review.googlesource.com/c/go/+/539695
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-11-03 16:11 linux-amd64-race go@6a32ecc0 runtime.TestReadMetrics (log)
SIGSEGV: segmentation violation
PC=0x4265eb m=18 sigcode=1 addr=0x10

rax    0x4265d0
rbx    0x454668
rcx    0x44d761
rdx    0x454668
rdi    0x0
rsi    0xc00088e000
rbp    0x44d761
...
fs     0x0
gs     0x0

runtime.systemstack_switch()
	/workdir/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0002d4348 sp=0xc0002d4338 pc=0x4b4608
runtime.ReadMetricsSlow(0xc0002d43f8, 0xc0000c0000, 0x48, 0x48)
	/workdir/go/src/runtime/export_test.go:427 +0x8e fp=0xc0002d4388 sp=0xc0002d4348 pc=0x4a66ce
runtime_test.TestReadMetrics(0xc0000d7d40)
	/workdir/go/src/runtime/metrics_test.go:49 +0x468 fp=0xc0002d5ee8 sp=0xc0002d4388 pc=0x94a848
testing.tRunner(0xc0000d7d40, 0xaea010)

watchflakes

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 3, 2023
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-11-03 16:11 freebsd-amd64-race go@6a32ecc0 runtime.TestReadMetrics (log)
SIGSEGV: segmentation violation
PC=0x404018 m=4 sigcode=1 addr=0x10

rax    0x403ff0
rbx    0x472f48
rcx    0x46c3c1
rdx    0x472f48
rdi    0x0
rsi    0xc001e60000
rbp    0x7fffdfbfbe78
...
fs     0x13
gs     0x1b

runtime.systemstack_switch()
	/tmp/workdir/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc000036348 sp=0xc000036338 pc=0x4d1ba8
runtime.ReadMetricsSlow(0xc0000363f8, 0xc00005a000, 0x48, 0x48)
	/tmp/workdir/go/src/runtime/export_test.go:427 +0x8e fp=0xc000036388 sp=0xc000036348 pc=0x4c3d4e
runtime_test.TestReadMetrics(0xc00040da00)
	/tmp/workdir/go/src/runtime/metrics_test.go:49 +0x468 fp=0xc000037ee8 sp=0xc000036388 pc=0x9676a8
testing.tRunner(0xc00040da00, 0xb06248)
2023-11-03 16:11 linux-arm64-race go@6a32ecc0 runtime.TestReadMetrics (log)
SIGSEGV: segmentation violation
PC=0x2f0f8 m=5 sigcode=1 addr=0x10

r0      0x0
r1      0xc0008e8000
r2      0x59e38
r3      0x52f80
r4      0xc000064000
r5      0x0
r6      0xc000064000
...
pc      0x2f0f8
fault   0x10

runtime.systemstack_switch()
	/tmp/workdir/go/src/runtime/asm_arm64.s:200 +0x8 fp=0xc0002a2350 sp=0xc0002a2340 pc=0xba9b8
runtime.ReadMetricsSlow(0xc0002a2408, 0xc000064000, 0x48, 0x48)
	/tmp/workdir/go/src/runtime/export_test.go:427 +0x7c fp=0xc0002a2390 sp=0xc0002a2350 pc=0xacaec
runtime_test.TestReadMetrics(0xc0002df520)
	/tmp/workdir/go/src/runtime/metrics_test.go:49 +0x3a8 fp=0xc0002a3ed0 sp=0xc0002a2390 pc=0x476868
testing.tRunner(0xc0002df520, 0x5f8da0)

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime" && test == "TestReadMetrics"
2023-11-03 16:11 darwin-amd64-race go@6a32ecc0 runtime.TestReadMetrics (log)
SIGSEGV: segmentation violation
PC=0xc9415c8 m=4 sigcode=1 addr=0x10

rax    0xc9415a0
rbx    0xc9acba8
rcx    0xc9a60a1
rdx    0xc9acba8
rdi    0x0
rsi    0xc00003e000
rbp    0x70000d339e70
...
fs     0x0
gs     0x0

runtime.systemstack_switch()
	/tmp/buildlet/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc000038348 sp=0xc000038338 pc=0xca0cb68
runtime.ReadMetricsSlow(0xc0000383f8, 0xc00004e000, 0x48, 0x48)
	/tmp/buildlet/go/src/runtime/export_test.go:427 +0x8e fp=0xc000038388 sp=0xc000038348 pc=0xc9fe6ce
runtime_test.TestReadMetrics(0xc000980340)
	/tmp/buildlet/go/src/runtime/metrics_test.go:49 +0x468 fp=0xc000039ee8 sp=0xc000038388 pc=0xce9f348
testing.tRunner(0xc000980340, 0xd103678)
2023-11-03 16:11 linux-amd64-longtest-race go@6a32ecc0 runtime.TestReadMetrics (log)
SIGSEGV: segmentation violation
PC=0x4265eb m=16 sigcode=1 addr=0x10

rax    0x4265d0
rbx    0x454668
rcx    0x44d761
rdx    0x454668
rdi    0x0
rsi    0xc00021e000
rbp    0x44d761
...
fs     0x0
gs     0x0

runtime.systemstack_switch()
	/workdir/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0002c8348 sp=0xc0002c8338 pc=0x4b4608
runtime.ReadMetricsSlow(0xc0002c83f8, 0xc0003b0000, 0x48, 0x48)
	/workdir/go/src/runtime/export_test.go:427 +0x8e fp=0xc0002c8388 sp=0xc0002c8348 pc=0x4a66ce
runtime_test.TestReadMetrics(0xc0005031e0)
	/workdir/go/src/runtime/metrics_test.go:49 +0x468 fp=0xc0002c9ee8 sp=0xc0002c8388 pc=0x94a848
testing.tRunner(0xc0005031e0, 0xaea010)
2023-11-03 16:11 linux-s390x-ibm-race go@6a32ecc0 runtime.TestReadMetrics (log)
SIGSEGV: segmentation violation
PC=0x35582 m=4 sigcode=1 addr=0x0

r0   0xc014000000	r1   0x35560
r2   0x0	r3   0xc0020d8000
r4   0x6c11e	r5   0x62460
r6   0xc000058000	r7   0x3ff429e7da0
r8   0x62460	r9   0xc0020d8000
r10  0x6c11e	r11  0x0
r12  0xc0001bc380	r13  0xc000003520
r14  0xe517a	r15  0x3ff429e7c60
pc   0x35582	link 0xe517a

runtime.systemstack_switch()
	/data/golang/workdir/go/src/runtime/asm_s390x.s:238 +0x10 fp=0xc0001bc370 sp=0xc0001bc368 pc=0xe2430
runtime.ReadMetricsSlow(0xc0001bc420, 0xc000058000, 0x48, 0x48)
	/data/golang/workdir/go/src/runtime/export_test.go:427 +0x78 fp=0xc0001bc3a8 sp=0xc0001bc370 pc=0xd1d98
runtime_test.TestReadMetrics(0xc0003db520)
	/data/golang/workdir/go/src/runtime/metrics_test.go:49 +0x474 fp=0xc0001bdee8 sp=0xc0001bc3a8 pc=0x61cf44
testing.tRunner(0xc0003db520, 0x7ea330)
2023-11-03 16:11 windows-amd64-race go@6a32ecc0 runtime.TestReadMetrics (log)
Exception 0xc0000005 0x0 0x10 0x1405f236a
PC=0x1405f236a

runtime: g 0: unknown pc 0x1405f236a
stack: frame={sp:0x46cffd10, fp:0x0} stack=[0x0,0x46cffea0)
0x0000000046cffc10:  0x0000000140048ea5 <runtime.execute+0x0000000000000125>  0x000000c000086518 
0x0000000046cffc20:  0x0000000200000001  0x000000c00001b080 
0x0000000046cffc30:  0x0000000046cffc68  0x000000014004ab5c <runtime.schedule+0x000000000000023c> 
0x0000000046cffc40:  0x0000000046cffc70  0x0000000140051717 <runtime.runqsteal+0x0000000000000037> 
0x0000000046cffc50:  0x000000c000042a00  0x000000c00003e198 
...
0x0000000046cffdf0:  0x000000014071426e  0x0000000000000018 
0x0000000046cffe00:  0x0000000000000000  0x0000000000000000 

runtime.systemstack_switch()
	C:/workdir/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0002e4348 sp=0xc0002e4338 pc=0x14007f868
runtime.ReadMetricsSlow(0xc0002e43f8, 0xc000072000, 0x48, 0x48)
	C:/workdir/go/src/runtime/export_test.go:427 +0x8e fp=0xc0002e4388 sp=0xc0002e4348 pc=0x140070a4e
runtime_test.TestReadMetrics(0xc0005f7a00)
	C:/workdir/go/src/runtime/metrics_test.go:49 +0x468 fp=0xc0002e5ee8 sp=0xc0002e4388 pc=0x14052e688
testing.tRunner(0xc0005f7a00, 0x1407351c0)

watchflakes

@mknyszek
Copy link
Contributor

mknyszek commented Nov 3, 2023

These are race mode failures from before https://go.dev/cl/539695. I believe these are all the failures, so they should stop trickling in now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Projects
Archived in project
Development

No branches or pull requests

4 participants