Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/trace: TestTraceCPUProfile failures #55317

Open
gopherbot opened this issue Sep 21, 2022 · 36 comments
Open

runtime/trace: TestTraceCPUProfile failures #55317

gopherbot opened this issue Sep 21, 2022 · 36 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@gopherbot
Copy link

gopherbot commented Sep 21, 2022

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`

Issue created automatically to collect these failures.

Example (log):

SIGQUIT: quit
PC=0x86d24 m=7 sigcode=0

trap    0x0
error   0x0
oldmask 0x0
r0      0x2fbadc
r1      0x80
r2      0x0
r3      0x0
...
	/workdir/go/src/runtime/proc.go:1090 +0x98 fp=0x183acfc sp=0x183acdc pc=0x53e88
runtime.stopTheWorldGC({0x1c6ecd, 0xc})
	/workdir/go/src/runtime/proc.go:1137 +0x48 fp=0x183ad10 sp=0x183acfc pc=0x54054
runtime.StopTrace()
	/workdir/go/src/runtime/trace.go:320 +0x24 fp=0x183ad28 sp=0x183ad10 pc=0x74ab8
runtime/trace.Stop()
	/workdir/go/src/runtime/trace/trace.go:148 +0x78 fp=0x183ad44 sp=0x183ad28 pc=0xe697c
runtime/trace_test.TestTraceCPUProfile(0x1c04e10)
	/workdir/go/src/runtime/trace/trace_test.go:621 +0x1a4 fp=0x183af98 sp=0x183ad44 pc=0x1885ec
testing.tRunner(0x1c04e10, 0x1d3994)

watchflakes

@gopherbot gopherbot added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. compiler/runtime Issues related to the Go compiler and/or runtime. labels Sep 21, 2022
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime/trace" && test == "TestTraceCPUProfile"
2022-08-18 13:16 linux-arm-aws go@38edd9bd runtime/trace.TestTraceCPUProfile (log)
SIGQUIT: quit
PC=0x86d24 m=7 sigcode=0

trap    0x0
error   0x0
oldmask 0x0
r0      0x2fbadc
r1      0x80
r2      0x0
r3      0x0
...
	/workdir/go/src/runtime/proc.go:1090 +0x98 fp=0x183acfc sp=0x183acdc pc=0x53e88
runtime.stopTheWorldGC({0x1c6ecd, 0xc})
	/workdir/go/src/runtime/proc.go:1137 +0x48 fp=0x183ad10 sp=0x183acfc pc=0x54054
runtime.StopTrace()
	/workdir/go/src/runtime/trace.go:320 +0x24 fp=0x183ad28 sp=0x183ad10 pc=0x74ab8
runtime/trace.Stop()
	/workdir/go/src/runtime/trace/trace.go:148 +0x78 fp=0x183ad44 sp=0x183ad28 pc=0xe697c
runtime/trace_test.TestTraceCPUProfile(0x1c04e10)
	/workdir/go/src/runtime/trace/trace_test.go:621 +0x1a4 fp=0x183af98 sp=0x183ad44 pc=0x1885ec
testing.tRunner(0x1c04e10, 0x1d3994)
2022-09-02 19:09 windows-arm64-10 go@55ca6a20 runtime/trace.TestTraceCPUProfile (log)
fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x7ff6f49eb8e4]

runtime stack:
runtime.throw({0x7ff6f4b4c65b?, 0x0?})
	C:/workdir/go/src/runtime/panic.go:1047 +0x40 fp=0xea793ff1b0 sp=0xea793ff180 pc=0x7ff6f49c73a0
runtime.sigpanic()
	C:/workdir/go/src/runtime/signal_windows.go:251 +0x220 fp=0xea793ff200 sp=0xea793ff1b0 pc=0x7ff6f49dcc90
runtime.gentraceback(0x4000041520?, 0x4000024000?, 0xea793ff5a8?, 0x40003c4000, 0x0, 0xea793ff5a8, 0x40, 0x0, 0x0?, 0x6)
	C:/workdir/go/src/runtime/traceback.go:162 +0x304 fp=0xea793ff540 sp=0xea793ff210 pc=0x7ff6f49eb8e4
...
	C:/workdir/go/src/runtime/trace/trace_test.go:774 +0x74 fp=0x40002759a0 sp=0x4000275970 pc=0x7ff6f4af3934
runtime/trace_test.cpuHog1(0x0?)
	C:/workdir/go/src/runtime/trace/trace_test.go:766 +0x24 fp=0x40002759c0 sp=0x40002759a0 pc=0x7ff6f4af3894
runtime/trace_test.cpuHogger(0x7ff6f4b51038, 0x7ff6f4cdfec8, 0x5f5e100)
	C:/workdir/go/src/runtime/trace/trace_test.go:753 +0x54 fp=0x4000275a10 sp=0x40002759c0 pc=0x7ff6f4af37e4
runtime/trace_test.TestTraceCPUProfile.func1(0x5f5e100)
	C:/workdir/go/src/runtime/trace/trace_test.go:618 +0x1f8 fp=0x4000275af0 sp=0x4000275a10 pc=0x7ff6f4af3688
runtime/trace_test.TestTraceCPUProfile(0x400040d380)
	C:/workdir/go/src/runtime/trace/trace_test.go:619 +0x12c fp=0x4000275f60 sp=0x4000275af0 pc=0x7ff6f4af283c
testing.tRunner(0x400040d380, 0x7ff6f4b50fc0)

watchflakes

@rhysh
Copy link
Contributor

rhysh commented Sep 22, 2022

As a first pass:

The failure on linux-arm-aws is a timeout, and looks like a conflict between these two goroutines:

goroutine 32 [waiting]:
runtime.systemstack_switch()
	/workdir/go/src/runtime/asm_arm.s:274 +0x4 fp=0x183acdc sp=0x183acd8 pc=0x84088
runtime.stopTheWorld({0x1c6ecd, 0xc})
	/workdir/go/src/runtime/proc.go:1090 +0x98 fp=0x183acfc sp=0x183acdc pc=0x53e88
runtime.stopTheWorldGC({0x1c6ecd, 0xc})
	/workdir/go/src/runtime/proc.go:1137 +0x48 fp=0x183ad10 sp=0x183acfc pc=0x54054
runtime.StopTrace()
	/workdir/go/src/runtime/trace.go:320 +0x24 fp=0x183ad28 sp=0x183ad10 pc=0x74ab8
runtime/trace.Stop()
	/workdir/go/src/runtime/trace/trace.go:148 +0x78 fp=0x183ad44 sp=0x183ad28 pc=0xe697c
runtime/trace_test.TestTraceCPUProfile(0x1c04e10)
	/workdir/go/src/runtime/trace/trace_test.go:621 +0x1a4 fp=0x183af98 sp=0x183ad44 pc=0x1885ec
testing.tRunner(0x1c04e10, 0x1d3994)
	/workdir/go/src/testing/testing.go:1446 +0x118 fp=0x183afe0 sp=0x183af98 pc=0xf3654
testing.(*T).Run.func1()
	/workdir/go/src/testing/testing.go:1493 +0x30 fp=0x183afec sp=0x183afe0 pc=0xf44ac
runtime.goexit()
	/workdir/go/src/runtime/asm_arm.s:824 +0x4 fp=0x183afec sp=0x183afec pc=0x85d14
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1493 +0x3a0
goroutine 82 [running]:
	goroutine running on other thread; stack unavailable
created by runtime/trace.Start
	/workdir/go/src/runtime/trace/trace.go:128 +0xd4

The goroutine created on src/runtime/trace/trace.go:128 is the one that calls runtime.ReadTrace:

go func() {
for {
data := runtime.ReadTrace()
if data == nil {
break
}
w.Write(data)
}
}()


The failure on windows-arm64-10 is a crash, fatal error: unexpected signal during runtime execution / [signal 0xc0000005 code=0x0 addr=0x0 pc=0x7ff6f49eb8e4], coming from runtime.gentraceback as called by runtime.sigprof:

if flags&_TraceJumpStack != 0 && gp == gp.m.g0 && gp.m.curg != nil {

@prattmic prattmic self-assigned this Oct 5, 2022
@seankhliao seankhliao added this to the Backlog milestone Nov 19, 2022
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime/trace" && test == "TestTraceCPUProfile"
2023-10-14 22:30 linux-arm-aws go@bc9dc8d4 runtime/trace.TestTraceCPUProfile (log)
SIGQUIT: quit
PC=0x8eefc m=0 sigcode=0

trap    0x0
error   0x0
oldmask 0x0
r0      0x329170
r1      0x80
r2      0x0
r3      0x0
...
	/workdir/go/src/runtime/trace/trace_test.go:755 +0x40 fp=0x1c3bc88 sp=0x1c3bc4c pc=0x1a5538
runtime/trace_test.TestTraceCPUProfile.func1.1({0x22a338, 0x1c520d8})
	/workdir/go/src/runtime/trace/trace_test.go:615 +0x34 fp=0x1c3bc9c sp=0x1c3bc88 pc=0x1a5444
runtime/pprof.Do({0x22a2f8, 0x339860}, {{0x1f74f50, 0x1, 0x1}}, 0x1c3bd10)
	/workdir/go/src/runtime/pprof/runtime.go:51 +0xb8 fp=0x1c3bcd8 sp=0x1c3bc9c pc=0x14e21c
runtime/trace_test.TestTraceCPUProfile.func1(0x5f5e100)
	/workdir/go/src/runtime/trace/trace_test.go:614 +0x21c fp=0x1c3bd44 sp=0x1c3bcd8 pc=0x1a5380
runtime/trace_test.TestTraceCPUProfile(0x1eae0f0)
	/workdir/go/src/runtime/trace/trace_test.go:621 +0x190 fp=0x1c3bf94 sp=0x1c3bd44 pc=0x1a4260
testing.tRunner(0x1eae0f0, 0x1f9968)

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "runtime/trace" && test == "TestTraceCPUProfile"
2023-10-30 17:08 linux-arm-aws go@ecda959b runtime/trace.TestTraceCPUProfile (log)
SIGQUIT: quit
PC=0x8ef88 m=2 sigcode=0

trap    0x0
error   0x0
oldmask 0x0
r0      0x3294f0
r1      0x80
r2      0x2
r3      0x0
...
	/workdir/go/src/runtime/trace/trace_test.go:776 +0x60 fp=0x23b9c8c sp=0x23b9c7c pc=0x1a5bdc
runtime/trace_test.cpuHog1(0x0)
	/workdir/go/src/runtime/trace/trace_test.go:768 +0x24 fp=0x23b9c9c sp=0x23b9c8c pc=0x1a5b5c
runtime/trace_test.cpuHogger(0x1f9bbc, 0x339990, 0x5f5e100)
	/workdir/go/src/runtime/trace/trace_test.go:755 +0x40 fp=0x23b9cd8 sp=0x23b9c9c pc=0x1a5aa4
runtime/trace_test.TestTraceCPUProfile.func1(0x5f5e100)
	/workdir/go/src/runtime/trace/trace_test.go:620 +0x240 fp=0x23b9d44 sp=0x23b9cd8 pc=0x1a5910
runtime/trace_test.TestTraceCPUProfile(0x2542c00)
	/workdir/go/src/runtime/trace/trace_test.go:621 +0x190 fp=0x23b9f94 sp=0x23b9d44 pc=0x1a47cc
testing.tRunner(0x2542c00, 0x1f9af0)

watchflakes

@gopherbot

This comment was marked as resolved.

@gopherbot

This comment was marked as resolved.

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-11-10 15:50 windows-arm64-11 go@9d914015 TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (2.64s)
    --- FAIL: TestTraceCPUProfile/Default (2.62s)
        reader_test.go:65: unexpected error while reading the trace: expected a proc but didn't have one
        trace_test.go:567: found bad trace; dumping to test log...
        trace_test.go:568: Trace Go1.22
            EventBatch gen=1 m=2144 time=23479457578 size=65458
            ProcStart dt=1 p=2 p_seq=2
            ProcStop dt=1
            ProcStart dt=1 p=2 p_seq=3
            ProcStop dt=1
...
            GoStart dt=1 g=1 g_seq=3938
            GoStop dt=1 reason_string=20 stack=14
            GoStart dt=1 g=1 g_seq=3939
            GoStop dt=1 reason_string=20 stack=14
            GoStart dt=1 g=1 g_seq=3940
            GoStop dt=1 reason_string=20 stack=14
            GoStart dt=1 g=1 g_seq=3941
            GoStop dt=1 reason_st

... log truncated ...
2023-11-10 15:51 windows-arm64-11 go@3b303fa9 TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (2.70s)
    --- FAIL: TestTraceCPUProfile/Default (2.68s)
        reader_test.go:65: unexpected error while reading the trace: per-M streams are out-of-order
        trace_test.go:567: found bad trace; dumping to test log...
        trace_test.go:568: Trace Go1.22
            EventBatch gen=1 m=9936 time=23496448275 size=65455
            ProcStart dt=1 p=0 p_seq=3
            ProcStop dt=1
            ProcStart dt=1 p=0 p_seq=4
            ProcStop dt=1
...
            GoStop dt=1 reason_string=20 stack=16
            GoStart dt=1 g=1 g_seq=19041
            GoStop dt=1 reason_string=20 stack=16
            GoStart dt=1 g=1 g_seq=19042
            GoStop dt=1 reason_string=20 stack=16
            GoStart dt=1 g=1 g_seq=19043
            GoStop dt=1 reason_string=20 stack=16


... log truncated ...
2023-11-10 16:18 windows-arm64-11 go@abf84221 TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (2.63s)
    --- FAIL: TestTraceCPUProfile/Default (2.62s)
        reader_test.go:65: unexpected error while reading the trace: per-M streams are out-of-order
        trace_test.go:569: found bad trace; dumping to test log...
        trace_test.go:570: Trace Go1.22
            EventBatch gen=1 m=1420 time=27049777477 size=65457
            ProcStart dt=1 p=0 p_seq=1
            ProcStop dt=1
            ProcStart dt=1 p=0 p_seq=2
            ProcStop dt=1
...
            ProcStop dt=1
            ProcStart dt=1 p=1 p_seq=2698
            ProcStop dt=1
            ProcStart dt=1 p=1 p_seq=2699
            ProcStop dt=1
            ProcStart dt=1 p=1 p_seq=2700
            ProcStop dt=1
            ProcStart dt=1 p=

... log truncated ...
2023-11-13 22:12 freebsd-arm64-dmgk go@773039ed TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (4.69s)
    --- FAIL: TestTraceCPUProfile/Default (4.65s)
        trace_test.go:177: execution trace did not include all CPU profile samples; 29 in profile, 21 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 18 in profile, 12 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 4 in profile, 3 in trace
        trace_test.go:195: CPU profile included 1 samples at stack "runtime/internal/atomic.(*Uint32).CompareAndSwap:236|runtime.casgstatus:1135|runtime.execute:3051|runtime.schedule:3862|runtime.goschedImpl:4009|runtime.gosched_m:4014|runtime.mcall:192" not present in execution trace
        trace_test.go:202: execution trace CPU samples:
        trace_test.go:204: 1: "runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:3995|runtime.gosched_m:4014|runtime.mcall:192"
        trace_test.go:204: 1: "runtime.duffzero:8|runtime.traceStack:40|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:3995|runtime.gosched_m:4014|runtime.mcall:192"
        trace_test.go:204: 1: "runtime/internal/atomic.(*Uint32).Load:194|runtime.(*traceSchedResourceState).statusWasTraced:191|runtime.traceLocker.eventWriter:113|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:3995|runtime.gosched_m:4014|runtime.mcall:192"
...
            GoStop dt=89 reason_string=20 stack=13
            GoStart dt=47 g=1 g_seq=13463
            GoStop dt=87 reason_string=20 stack=13
            GoStart dt=48 g=1 g_seq=13464
            GoStop dt=87 reason_string=20 stack=13
            GoStart dt=45 g=1 g_seq=13465
            GoStop dt=88 reason_string=20 stack=13
            GoStart dt=48 g=

... log truncated ...
2023-11-14 16:45 linux-386-longtest go@e14b96cb internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (2.64s)
    --- FAIL: TestTraceCPUProfile/Stress (0.72s)
        trace_test.go:550: stderr: panic: runtime.plainError("send on closed channel")
            fatal error: panic on system stack

            runtime stack:
            runtime.throw({0x8155fbb, 0x15})
            	/workdir/go/src/runtime/panic.go:1016 +0x4d fp=0xffed8918 sp=0xffed8904 pc=0x807d26d
            panic({0x81405c0, 0x817cf80})
            	/workdir/go/src/runtime/panic.go:727 +0x30c fp=0xffed896c sp=0xffed8918 pc=0x807ca8c
...
            runtime.findRunnable()
            	/workdir/go/src/runtime/proc.go:3198 +0x3e0 fp=0xffed8bc8 sp=0xffed8b1c pc=0x80853a0
            runtime.schedule()
            	/workdir/go/src/runtime/proc.go:3812 +0xbd fp=0xffed8be4 sp=0xffed8bc8 pc=0x808709d
            runtime.goschedImpl(0xa408120, 0x0)
            	/workdir/go/src/runtime/proc.go:4009 +0x14d fp=0xffed8bfc sp=0xffed8be4 pc=0x808794d
            runtime.gosched_m(0xa408120)
            	/workdir/go/src/runtime/proc.go:4014 +0x26 fp=0xffed8c08 sp=0xffed8bfc pc=0x80879a6
            runtime.mcall(0x225220)
            	/workdir/go/src/runtime/asm_386.s:329 +0x44 fp=0xffed8c10 sp=0xffed8c08 pc=0x80af9f4
2023-11-15 20:08 freebsd-arm64-dmgk go@2380862c TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (4.27s)
    --- FAIL: TestTraceCPUProfile/Default (4.23s)
        trace_test.go:177: execution trace did not include all CPU profile samples; 28 in profile, 14 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 2 in profile, 1 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 20 in profile, 9 in trace
        trace_test.go:195: CPU profile included 1 samples at stack "aeshashbody:711|runtime.(*traceMap).put:62|runtime.(*traceStackTable).put:95|runtime.traceStack:79|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192" not present in execution trace
        trace_test.go:195: CPU profile included 1 samples at stack "runtime.traceStack:70|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192" not present in execution trace
        trace_test.go:202: execution trace CPU samples:
        trace_test.go:204: 9: "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 1: "main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
...
            GoStart dt=50 g=1 g_seq=13593
            GoStop dt=89 reason_string=20 stack=12
            GoStart dt=50 g=1 g_seq=13594
            GoStop dt=89 reason_string=20 stack=12
            GoStart dt=49 g=1 g_seq=13595
            GoStop dt=92 reason_string=20 stack=12
            GoStart dt=50 g=1 g_seq=13596
            GoStop

... log truncated ...
2023-11-16 19:26 openbsd-386-72 go@d573a8a8 runtime/trace.TestTraceCPUProfile (log)
fatal error: runtime: out of memory

runtime stack:
runtime.throw({0x81ffb4e, 0x16})
	/tmp/workdir/go/src/runtime/panic.go:1023 +0x4d fp=0x12e44704 sp=0x12e446f0 pc=0x8082ddd
runtime.sysMapOS(0xad000000, 0x800000)
	/tmp/workdir/go/src/runtime/mem_bsd.go:81 +0x142 fp=0x12e44730 sp=0x12e44704 pc=0x805ef32
runtime.sysMap(0xad000000, 0x800000, 0x834b118)
	/tmp/workdir/go/src/runtime/mem.go:155 +0x3f fp=0x12e44740 sp=0x12e44730 pc=0x805ed0f
runtime.(*mheap).grow(0x833dde0, 0x375)
...
	/tmp/workdir/go/src/internal/trace/parser.go:269 +0x152b fp=0x6dcd8ba0 sp=0x6dcd8a6c pc=0x816775b
internal/trace.parse({0x8239b70, 0x6dde00f0}, {0x0, 0x0})
	/tmp/workdir/go/src/internal/trace/parser.go:104 +0x27 fp=0x6dcd8bec sp=0x6dcd8ba0 pc=0x8165f87
internal/trace.Parse({0x8239b70, 0x6dde00f0}, {0x0, 0x0})
	/tmp/workdir/go/src/internal/trace/parser.go:91 +0x37 fp=0x6dcd8c1c sp=0x6dcd8bec pc=0x8165e67
runtime/trace_test.parseTrace(0x6dca8300, {0x8239b70, 0x6dde00f0})
	/tmp/workdir/go/src/runtime/trace/trace_test.go:154 +0x47 fp=0x6dcd8d54 sp=0x6dcd8c1c pc=0x81b11c7
runtime/trace_test.TestTraceCPUProfile(0x6dca8300)
	/tmp/workdir/go/src/runtime/trace/trace_test.go:699 +0x461 fp=0x6dcd8f98 sp=0x6dcd8d54 pc=0x81b4361
testing.tRunner(0x6dca8300, 0x820a06c)

watchflakes

@rhysh
Copy link
Contributor

rhysh commented Nov 17, 2023

This latest set of failures seem related to the v2 tracer; those on arm64 are in the internal/trace/v2 package (right after the ok internal/trace line).

The linux-386-longtest failure looks like a logical race in the v2 tracer's wakeableSleep mechanism, near

s.(*wakeableSleep).wake()

CC @mknyszek for those.


The openbsd-386-72 failure is for the v1 runtime/trace.TestTraceCPUProfile test, running out of memory while parsing the trace.

events = append(events, ev)
is events = append(events, ev), where ev is a type rawEvent struct {int; byte; []uint64; []string}, which on GOARCH=386 would be 8 words of 4 bytes each, 32 bytes in total. The failed allocation is runtime.growslice(0xac500000, 0x2c301, 0x2c300, 0x1, 0x81e4340), growing that slice beyond 180992 entries. That's 5.8 MB (growing to 0x375 pages of 8 kiB each, 7.3 MB) which doesn't seem large by itself, but maybe 180k+ entries in the test trace is unusually large? I don't have a baseline for that. There are nice log lines in mheap.grow when the call to mheap.sysAlloc fails, but the crash here is in sysMap (which throws directly in sysMapOS) so the log doesn't include context on the total heap size.

@bcmills
Copy link
Contributor

bcmills commented Nov 17, 2023

the crash here is in sysMap (which throws directly in sysMapOS) so the log doesn't include context on the total heap size.

Compare:

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-11-20 15:10 windows-amd64-longtest go@0709f1bb internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (4.67s)
    --- FAIL: TestTraceCPUProfile/Stress (2.38s)
        trace_test.go:109: failed to parse CPU profile summary in stderr: runtime: failing to flush
            	full:
            runtime: failing to flush
            runtime: m=3
            main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	3
            main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	16
            runtime.mcall:436	1
            runtime.stdcall2:1000|runtime.semasleep:696|runtime.lock2:88|runtime.lockWithRank:24|runtime.lock:35|runtime.(*traceMap).put.func1:72|runtime.(*traceMap).put:71|runtime.(*traceStackTable).put:95|runtime.traceStack:79|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:458	1
            runtime.execute:3108|runtime.schedule:3918|runtime.goschedImpl:4065|runtime.gosched_m:4070|runtime.mcall:458	1
            main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	1

watchflakes

@mknyszek
Copy link
Contributor

mknyszek commented Nov 20, 2023

Sorry for not replying sooner. The linux-386-longtest failure was fixed by https://go.dev/cl/542335.

The new failure is fun -- that may be an actual failure at runtime from the new tracer. I'll try to reproduce.

As a side-note, the runtime/trace tests barely use the new tracer at all (TestTraceCPUProfile doesn't). If there's a failure there, it's not related to that specifically.

@mknyszek
Copy link
Contributor

Oh! No, that new failure is not real actually. I put a low-quality optimistic "deadlock detector" but it turns out there was no deadlock in this case at all. The test just didn't expect the additional output.

I'll put the "deadlock detector" behind a constant. It's for debugging anyway.

@gopherbot
Copy link
Author

Change https://go.dev/cl/544235 mentions this issue: runtime: disable trace v2 optimistic deadlock debugger

gopherbot pushed a commit that referenced this issue Nov 21, 2023
The v2 execution tracer has a rudimentary deadlock detector, but it's
based on an arbitrary threshold that an actually get hit even if there's
no deadlock. This ends up breaking tests sometimes, and it would be bad
if this just appeared in production logs.

Put this 'deadlock detector' behind a flag.

For #55317.

Change-Id: I286f0c05b3ac9600f4f2f9696065cac8bbd25f00
Reviewed-on: https://go-review.googlesource.com/c/go/+/544235
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-11-21 22:23 freebsd-arm64-dmgk go@ff05cdbd TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (2.16s)
    --- FAIL: TestTraceCPUProfile/Default (2.13s)
        trace_test.go:177: execution trace did not include all CPU profile samples; 25 in profile, 4 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 19 in profile, 1 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 4 in profile, 1 in trace
        trace_test.go:202: execution trace CPU samples:
        trace_test.go:204: 1: "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 1: "main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 1: "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 1: "runtime.releasem:586|runtime.traceRelease:238|runtime.execute:3133|runtime.schedule:3922|runtime.goschedImpl:4069|runtime.gosched_m:4074|runtime.mcall:192"
...
            GoStop dt=73 reason_string=20 stack=14
            GoStart dt=7 g=1 g_seq=13726
            GoStop dt=73 reason_string=20 stack=14
            GoStart dt=7 g=1 g_seq=13727
            GoStop dt=74 reason_string=20 stack=14
            GoStart dt=8 g=1 g_seq=13728
            GoStop dt=74 reason_string=20 stack=14
            GoStart dt=6 g=1 g_seq=13729

... log truncated ...

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-11-21 17:24 freebsd-arm64-dmgk go@0cb45bac TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (4.36s)
    --- FAIL: TestTraceCPUProfile/Default (4.31s)
        trace_test.go:177: execution trace did not include all CPU profile samples; 29 in profile, 14 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 20 in profile, 9 in trace
        trace_test.go:195: CPU profile included 1 samples at stack "runtime.acquirem:577|runtime.traceAcquireEnabled:193|runtime.traceAcquire:176|runtime.execute:3124|runtime.schedule:3921|runtime.goschedImpl:4068|runtime.gosched_m:4073|runtime.mcall:192" not present in execution trace
        trace_test.go:195: CPU profile included 1 samples at stack "runtime.releasem:584|runtime.traceRelease:238|runtime.execute:3132|runtime.schedule:3921|runtime.goschedImpl:4068|runtime.gosched_m:4073|runtime.mcall:192" not present in execution trace
        trace_test.go:195: CPU profile included 2 samples at stack "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271" not present in execution trace
        trace_test.go:202: execution trace CPU samples:
        trace_test.go:204: 1: "runtime/internal/atomic.(*Uint32).Load:194|runtime.(*traceSchedResourceState).statusWasTraced:191|runtime.traceLocker.eventWriter:113|runtime.traceLocker.GoStart:398|runtime.execute:3131|runtime.schedule:3921|runtime.goschedImpl:4068|runtime.gosched_m:4073|runtime.mcall:192"
        trace_test.go:204: 1: "runtime.vdsoClockGettime:81|runtime.nanotime1:101|runtime.nanotime:19|runtime.traceClockNow:56|runtime.traceWriter.event:146|runtime.traceEventWriter.write:131|runtime.traceLocker.GoStart:399|runtime.execute:3131|runtime.schedule:3921|runtime.goschedImpl:4068|runtime.gosched_m:4073|runtime.mcall:192"
...
            GoStop dt=81 reason_string=20 stack=12
            GoStart dt=53 g=1 g_seq=12759
            GoStop dt=78 reason_string=20 stack=12
            GoStart dt=51 g=1 g_seq=12760
            GoStop dt=83 reason_string=20 stack=12
            GoStart dt=53 g=1 g_seq=12761
            GoStop dt=78 reason_string=20 stack=12
            GoStart dt=49 g=1 g_seq=12

... log truncated ...

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-11-23 17:33 freebsd-arm64-dmgk go@0c7e5d3b internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (2.13s)
    --- FAIL: TestTraceCPUProfile/Default (2.13s)
        trace_test.go:177: execution trace did not include all CPU profile samples; 26 in profile, 16 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 3 in profile, 2 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 16 in profile, 8 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 4 in profile, 3 in trace
        trace_test.go:202: execution trace CPU samples:
        trace_test.go:204: 1: "runtime/internal/atomic.(*Uint32).CompareAndSwap:236|runtime.casgstatus:1138|runtime.goschedImpl:4062|runtime.gosched_m:4086|runtime.mcall:192"
        trace_test.go:204: 2: "main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 1: "runtime.mcall:179"
        trace_test.go:204: 8: "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 1: "runtime.casgstatus:1121|runtime.goschedImpl:4062|runtime.gosched_m:4086|runtime.mcall:192"
        trace_test.go:204: 3: "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:206: CPU profile:
            main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	16
            runtime.casgstatus:1121|runtime.goschedImpl:4062|runtime.gosched_m:4086|runtime.mcall:192	1
            main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	4
            runtime/internal/atomic.(*Uint32).CompareAndSwap:236|runtime.casgstatus:1138|runtime.goschedImpl:4062|runtime.gosched_m:4086|runtime.mcall:192	1
            main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	3
            runtime.mcall:179	1
        trace_test.go:570: found bad trace; dumping to test log...
        trace_test.go:579: text trace too large to dump (7221666 bytes)

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-12-01 19:58 netbsd-386-9_3 go@40f6fbf1 internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (2.77s)
    --- FAIL: TestTraceCPUProfile/Default (2.77s)
        trace_test.go:551: stderr: SIGSEGV: segmentation violation
            PC=0x80a2971 m=3 sigcode=1 addr=0x0

            goroutine 0 gp=0x8406b48 m=3 mp=0x8440808 [idle]:
            runtime.readgstatus(...)
            	/tmp/workdir/go/src/runtime/proc.go:1056
            runtime.traceWriter.writeProcStatusForP({{0x8440808, 0x3}, 0x0}, 0x8432008, 0x0)
            	/tmp/workdir/go/src/runtime/trace2status.go:91 +0x61 fp=0x8455ea0 sp=0x8455e78 pc=0x80a2971
...
            esi    0xa3b0
            ebp    0x82212e0
            esp    0x8455e78
            eip    0x80a2971
            eflags 0x10246
            cs     0x17
            fs     0xab
            gs     0xb3
            exit status 2
        trace_test.go:553: exit status 1

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-12-01 21:24 linux-ppc64-sid-buildlet go@ecb9d9b9 internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (4.04s)
    --- FAIL: TestTraceCPUProfile/Default (4.04s)
        trace_test.go:551: stderr: SIGSEGV: segmentation violation
            PC=0x729a0 m=3 sigcode=1 addr=0x0

            goroutine 0 gp=0xc000003040 m=3 mp=0xc000040e08 [idle]:
            runtime.readgstatus(...)
            	/workdir/go/src/runtime/proc.go:1056
            runtime.traceWriter.writeProcStatusForP({{0xc000040e08?, 0x0?}, 0x0?}, 0x0?, 0x0?)
            	/workdir/go/src/runtime/trace2status.go:91 +0x110 fp=0xc000057cb0 sp=0xc000057c68 pc=0x729a0
...
            r22  0xc0000543a0	r23  0xc000057da8
            r24  0x0	r25  0x2016a0
            r26  0x22	r27  0x0
            r28  0x0	r29  0x0
            r30  0xc000003040	r31  0x0
            pc   0x729a0	ctr  0x0
            link 0x7a1b4	xer  0x0
            ccr  0x22420080	trap 0x300
            exit status 2
        trace_test.go:553: exit status 1

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-12-01 19:30 netbsd-arm64-bsiegert go@70c7fb75 internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (7.58s)
    --- FAIL: TestTraceCPUProfile/Default (7.57s)
        trace_test.go:551: stderr: SIGSEGV: segmentation violation
            PC=0x6b440 m=5 sigcode=1 addr=0x0

            goroutine 0 gp=0x40000844e0 m=5 mp=0x4000080708 [idle]:
            runtime.readgstatus(...)
            	/var/gobuilder/buildlet/go/src/runtime/proc.go:1056
            runtime.traceWriter.writeProcStatusForP({{0x4000080708?, 0x0?}, 0x0?}, 0x0?, 0x0?)
            	/var/gobuilder/buildlet/go/src/runtime/trace2status.go:91 +0xf0 fp=0x4000091ce0 sp=0x4000091ca0 pc=0x6b440
...
            r26     0x136df8
            r27     0x0
            r28     0x40000844e0
            r29     0x4000091c98
            lr      0x71888
            sp      0x4000091ca0
            pc      0x6b440
            fault   0x0
            exit status 2
        trace_test.go:553: exit status 1

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-12-08 00:50 freebsd-arm64-dmgk go@4bf1ca4b internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (2.19s)
    --- FAIL: TestTraceCPUProfile/Default (2.19s)
        trace_test.go:177: execution trace did not include all CPU profile samples; 25 in profile, 17 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 3 in profile, 2 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 21 in profile, 14 in trace
        trace_test.go:202: execution trace CPU samples:
        trace_test.go:204: 14: "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 1: "main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 2: "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:206: CPU profile:
            main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	21
            main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	1
            main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	3
        trace_test.go:570: found bad trace; dumping to test log...
        trace_test.go:579: text trace too large to dump (7168963 bytes)

watchflakes

@prattmic
Copy link
Member

Note that #55317 (comment) seems to have dropped CPU samples.

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-12-13 00:46 netbsd-arm-bsiegert go@7cac742f internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (17.92s)
    --- FAIL: TestTraceCPUProfile/Default (17.92s)
        reader_test.go:65: unexpected error while reading the trace: inconsistent thread for syscalling goroutine 6: thread has goroutine -1
        trace_test.go:570: found bad trace; dumping to test log...
        trace_test.go:579: text trace too large to dump (10855978 bytes)

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-12-18 19:20 netbsd-386-9_3 go@a7097243 internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (3.85s)
    --- FAIL: TestTraceCPUProfile/Default (3.85s)
        reader_test.go:65: unexpected error while reading the trace: inconsistent thread for syscalling goroutine 6: thread has goroutine 1
        trace_test.go:570: found bad trace; dumping to test log...
        trace_test.go:579: text trace too large to dump (9072023 bytes)

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-12-18 20:06 linux-s390x-ibm go@450f5d90 internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (6.80s)
    --- FAIL: TestTraceCPUProfile/Default (6.80s)
        trace_test.go:177: execution trace did not include all CPU profile samples; 25 in profile, 10 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 13 in profile, 4 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "runtime.nanotime1:380"; 2 in profile, 1 in trace
        trace_test.go:195: CPU profile included 1 samples at stack "runtime.(*traceSchedResourceState).statusWasTraced:200|runtime.traceLocker.eventWriter:113|runtime.traceLocker.GoStart:398|runtime.execute:3128|runtime.schedule:3918|runtime.goschedImpl:4065|runtime.gosched_m:4070|runtime.mcall:230" not present in execution trace
        trace_test.go:195: CPU profile included 1 samples at stack "runtime.pcvalue:877|runtime.pcdatavalue1:1073|runtime.(*inlineUnwinder).resolveInternal:69|runtime.(*inlineUnwinder).next:84|runtime.tracebackPCs:627|runtime.gcallers:1112|runtime.traceStack:56|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:230" not present in execution trace
        trace_test.go:195: CPU profile included 1 samples at stack "runtime.(*unwinder).resolveInternal:252|runtime.(*unwinder).next:512|runtime.tracebackPCs:622|runtime.gcallers:1112|runtime.traceStack:56|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:230" not present in execution trace
        trace_test.go:195: CPU profile included 1 samples at stack "runtime.(*unwinder).resolveInternal:252|runtime.(*unwinder).initAt:223|runtime.(*unwinder).init:128|runtime.gcallers:1111|runtime.traceStack:56|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:230" not present in execution trace
        trace_test.go:195: CPU profile included 1 samples at stack "runtime.releasem:589|runtime.pcvalue:895|runtime.funcspdelta:1032|runtime.(*unwinder).resolveInternal:325|runtime.(*unwinder).next:512|runtime.tracebackPCs:622|runtime.gcallers:1112|runtime.traceStack:56|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:230" not present in execution trace
...
            main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	13
            runtime.nanotime1:380	2
            runtime.(*moduledata).textAddr:606|runtime.funcInfo.entry:762|runtime.pcvalue:907|runtime.funcspdelta:1032|runtime.(*unwinder).resolveInternal:325|runtime.(*unwinder).next:512|runtime.tracebackPCs:622|runtime.gcallers:1112|runtime.traceStack:56|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:230	1
            runtime.releasem:589|runtime.pcvalue:895|runtime.funcspdelta:1032|runtime.(*unwinder).resolveInternal:325|runtime.(*unwinder).next:512|runtime.tracebackPCs:622|runtime.gcallers:1112|runtime.traceStack:56|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:230	1
            runtime.pcvalue:877|runtime.pcdatavalue1:1073|runtime.(*inlineUnwinder).resolveInternal:69|runtime.(*inlineUnwinder).next:84|runtime.tracebackPCs:627|runtime.gcallers:1112|runtime.traceStack:56|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:230	1
            runtime.(*unwinder).valid:227|runtime.tracebackPCs:622|runtime.gcallers:1112|runtime.traceStack:56|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:230	1
            runtime.(*moduledata).textAddr:592|runtime.funcInfo.entry:762|runtime.pcvalue:912|runtime.pcdatavalue1:1073|runtime.(*inlineUnwinder).resolveInternal:69|runtime.newInlineUnwinder:61|runtime.tracebackPCs:627|runtime.gcallers:1112|runtime.traceStack:56|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:230	1
            runtime.findmoduledatap:735|runtime.findfunc:772|runtime.(*unwinder).next:449|runtime.tracebackPCs:622|runtime.gcallers:1112|runtime.traceStack:56|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:230	1
        trace_test.go:570: found bad trace; dumping to test log...
        trace_test.go:579: text trace too large to dump (6827534 bytes)

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2023-12-19 04:39 freebsd-arm64-dmgk go@6fe0d375 internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (2.10s)
    --- FAIL: TestTraceCPUProfile/Default (2.10s)
        trace_test.go:177: execution trace did not include all CPU profile samples; 26 in profile, 17 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 4 in profile, 2 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 15 in profile, 10 in trace
        trace_test.go:195: CPU profile included 1 samples at stack "runtime.(*traceBuf).varint:189|runtime.traceWriter.event:157|runtime.traceEventWriter.write:131|runtime.traceEventWriter.commit:125|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192" not present in execution trace
        trace_test.go:195: CPU profile included 1 samples at stack "main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271" not present in execution trace
        trace_test.go:202: execution trace CPU samples:
        trace_test.go:204: 1: "runtime/internal/atomic.(*Uint32).CompareAndSwap:236|runtime.casgstatus:1122|runtime.execute:3107|runtime.schedule:3918|runtime.goschedImpl:4065|runtime.gosched_m:4070|runtime.mcall:192"
        trace_test.go:204: 1: "runtime.duffzero:10|runtime.traceStack:40|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192"
        trace_test.go:204: 2: "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 10: "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 1: "runtime.binuptime:64|runtime.vdsoClockGettime:93|runtime.nanotime1:101|runtime.nanotime:19|runtime.traceClockNow:56|runtime.traceWriter.event:146|runtime.traceEventWriter.write:131|runtime.traceLocker.GoStart:399|runtime.execute:3128|runtime.schedule:3918|runtime.goschedImpl:4065|runtime.gosched_m:4070|runtime.mcall:192"
        trace_test.go:204: 1: "runtime.traceAcquireEnabled:209|runtime.traceAcquire:176|runtime.execute:3121|runtime.schedule:3918|runtime.goschedImpl:4065|runtime.gosched_m:4070|runtime.mcall:192"
        trace_test.go:204: 1: "runtime/internal/atomic.(*Uint32).CompareAndSwap:236|runtime.casgstatus:1122|runtime.goschedImpl:4046|runtime.gosched_m:4070|runtime.mcall:192"
        trace_test.go:206: CPU profile:
            main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	4
            runtime.binuptime:64|runtime.vdsoClockGettime:93|runtime.nanotime1:101|runtime.nanotime:19|runtime.traceClockNow:56|runtime.traceWriter.event:146|runtime.traceEventWriter.write:131|runtime.traceLocker.GoStart:399|runtime.execute:3128|runtime.schedule:3918|runtime.goschedImpl:4065|runtime.gosched_m:4070|runtime.mcall:192	1
            runtime.traceAcquireEnabled:209|runtime.traceAcquire:176|runtime.execute:3121|runtime.schedule:3918|runtime.goschedImpl:4065|runtime.gosched_m:4070|runtime.mcall:192	1
            runtime/internal/atomic.(*Uint32).CompareAndSwap:236|runtime.casgstatus:1122|runtime.goschedImpl:4046|runtime.gosched_m:4070|runtime.mcall:192	1
            runtime/internal/atomic.(*Uint32).CompareAndSwap:236|runtime.casgstatus:1122|runtime.execute:3107|runtime.schedule:3918|runtime.goschedImpl:4065|runtime.gosched_m:4070|runtime.mcall:192	1
            runtime.duffzero:10|runtime.traceStack:40|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192	1
            main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	15
            runtime.(*traceBuf).varint:189|runtime.traceWriter.event:157|runtime.traceEventWriter.write:131|runtime.traceEventWriter.commit:125|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192	1
            main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	1
        trace_test.go:570: found bad trace; dumping to test log...
        trace_test.go:579: text trace too large to dump (7258728 bytes)

watchflakes

@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2024-01-08 20:15 freebsd-arm64-dmgk go@b702e043 internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (2.14s)
    --- FAIL: TestTraceCPUProfile/Default (2.14s)
        trace_test.go:177: execution trace did not include all CPU profile samples; 27 in profile, 17 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 17 in profile, 11 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 2 in profile, 1 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 2 in profile, 1 in trace
        trace_test.go:195: CPU profile included 1 samples at stack "runtime/internal/atomic.(*Uint32).Load:194|runtime.readgstatus:1040|runtime.goschedImpl:4041|runtime.gosched_m:4070|runtime.mcall:192" not present in execution trace
        trace_test.go:195: CPU profile included 1 samples at stack "memeqbody:58|runtime.(*traceMap).find:105|runtime.(*traceMap).put:64|runtime.(*traceStackTable).put:95|runtime.traceStack:79|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192" not present in execution trace
        trace_test.go:202: execution trace CPU samples:
        trace_test.go:204: 1: "main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 1: "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 1: "runtime.duffzero:54|runtime.traceStack:40|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192"
        trace_test.go:204: 1: "runtime.vdsoClockGettime:81|runtime.nanotime1:101|runtime.nanotime:19|runtime.traceClockNow:56|runtime.traceWriter.event:146|runtime.traceEventWriter.write:131|runtime.traceEventWriter.commit:125|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192"
        trace_test.go:204: 11: "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 1: "runtime.mcall:191"
        trace_test.go:204: 1: "runtime.execute:3108|runtime.schedule:3918|runtime.goschedImpl:4065|runtime.gosched_m:4070|runtime.mcall:192"
        trace_test.go:206: CPU profile:
            main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	2
            memeqbody:58|runtime.(*traceMap).find:105|runtime.(*traceMap).put:64|runtime.(*traceStackTable).put:95|runtime.traceStack:79|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192	1
            runtime.vdsoClockGettime:81|runtime.nanotime1:101|runtime.nanotime:19|runtime.traceClockNow:56|runtime.traceWriter.event:146|runtime.traceEventWriter.write:131|runtime.traceEventWriter.commit:125|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192	1
            main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	17
            runtime.mcall:191	1
            runtime.execute:3108|runtime.schedule:3918|runtime.goschedImpl:4065|runtime.gosched_m:4070|runtime.mcall:192	1
            main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	2
            runtime.duffzero:54|runtime.traceStack:40|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192	1
            runtime/internal/atomic.(*Uint32).Load:194|runtime.readgstatus:1040|runtime.goschedImpl:4041|runtime.gosched_m:4070|runtime.mcall:192	1
        trace_test.go:570: found bad trace; dumping to test log...
        trace_test.go:579: text trace too large to dump (7305444 bytes)

watchflakes

@bcmills
Copy link
Contributor

bcmills commented Jan 9, 2024

@prattmic, @mknyszek: this still has a fairly high failure rate. Is it still being investigated? (Would it make sense to add a testenv.SkipFlaky for this failure mode for the 1.22 release?)

@mknyszek
Copy link
Contributor

mknyszek commented Jan 9, 2024

Apologies, I got deeply side-tracked.

I think SkipFlaky on some platforms is correct here. It occurs to me that this test is inherently flaky. There's nothing guaranteeing that the goroutine that transfers CPU samples from the profile buffer to the trace will run before tracing ends. It can definitely happen that the goroutine just gets delayed for some reason and samples are dropped. More specifically, the goroutine that does the transfer awakens for the first time to see tracing is off, so it ignores the rest of the profile buffer's contents and drops them on the floor. It can't really do anything else because tracing is off; there's nowhere to put the information! This is fine -- samples are always allowed to drop.

I'll think about how to make this test less flaky, but given that it's not failing with any regularly on first-class platforms, I think doing platform-specific skips is fine. There's nothing really platform-specific about this code.

@bcmills
Copy link
Contributor

bcmills commented Jan 9, 2024

If there is nothing platform-specific about the code, why would a platform-specific skip be even remotely appropriate‽

One of the interesting properties of the reverse builders for secondary ports is that they run on more diverse hardware — with more diverse scheduling — than our GCE-based builders. It seems gravely mistaken to assume that external users on first-class ports are using hardware and kernels that match our GCE builders, so why should we assume that they will not also run into this flakiness when running go test all or similar on their own CI systems?

@mknyszek
Copy link
Contributor

mknyszek commented Jan 9, 2024

Yeah, you're right. I frequently forget that other users might run go test all; I was mainly just thinking about test coverage loss on our side.

I think I can just make the test more resilient to lost samples. It seems like in all these failures at least some of the samples show up in the trace. I figure we can just make the test pass if at least one of these samples shows up in the trace.

This only verifies that functionality works at a baseline, not how well it works, but given that the functionality is best-effort anyway, I think that's about as good as we can make it for now.

@mknyszek
Copy link
Contributor

Taking a step back, I think this actually represents a regression in behavior vs. the old tracer, but it was really hard to tell because of how best-effort this functionality is.

Specifically, the old tracer was subtly ensuring that there was a CPU profile buffer flush sometime close to when a trace would end, because the trace reader would get woken up a whole bunch. In the new tracer the CPU profile buffer is translated into the execution trace via a separate goroutine. This has in general been positive for understanding the best-effort-ness of this behavior, but what we lost was this property that a flush was likely to happen close in time to when the trace ended.

My proposal to restore this behavior is just to wake the trace CPU profile reader goroutine before we're about to change generations. It does not ensure that a flush happens, but makes it far more likely. It's still best-effort, but a bit more predictable and will provide more understandable results.

Overall, I'm not really that happy with the fact that this is so best-effort still. Early on in the trace design I tried to write this trace CPU profile reader goroutine to actually block on the CPU profile buffer getting filled, making it much more precise. But because of the problems on Darwin with CPU profiles we encountered earlier this year, I had to work around that.

In the future it would be good to consider blocking in traceAdvance until we can ensure a recent flush of the CPU profile buffer happened. I'm reluctant to introduce that now just because that's how you introduce deadlocks in a new release, but I'll file an issue to track that.

I'll send a CL with the better-effort wakeup which should hopefully resolve most, if not all of the failures seen in this issue.

@gopherbot
Copy link
Author

Change https://go.dev/cl/555495 mentions this issue: runtime: make a much better effort to emit CPU profile in a trace

@mknyszek
Copy link
Contributor

@prattmic convinced me to try to do the right thing here, and eliminate all parts of the tracer making this CPU profile -> trace process best-effort. https://go.dev/cl/555495 implements this. I was originally skeptical given that we're pretty late in the release cycle, but I figure at the very least we can have this on hand and decide whether it's worth submitting now. If it's not, I can land it for next cycle and I'll send the wakeup patch (which is like, 5 lines of code) which will certainly not make the situation any worse.

@mknyszek mknyszek assigned mknyszek and unassigned prattmic Jan 12, 2024
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2024-01-12 16:19 freebsd-arm64-dmgk go@e58e8139 internal/trace/v2.TestTraceCPUProfile (log)
--- FAIL: TestTraceCPUProfile (2.09s)
    --- FAIL: TestTraceCPUProfile/Default (2.09s)
        trace_test.go:177: execution trace did not include all CPU profile samples; 27 in profile, 9 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 22 in profile, 8 in trace
        trace_test.go:187: execution trace did not include all CPU profile samples for stack "main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"; 2 in profile, 1 in trace
        trace_test.go:195: CPU profile included 1 samples at stack "aeshashbody:597|runtime.(*traceMap).put:62|runtime.(*traceStackTable).put:95|runtime.traceStack:79|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192" not present in execution trace
        trace_test.go:195: CPU profile included 1 samples at stack "main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271" not present in execution trace
        trace_test.go:195: CPU profile included 1 samples at stack "runtime.releasem:588|runtime.traceRelease:238|runtime.execute:3129|runtime.schedule:3918|runtime.goschedImpl:4065|runtime.gosched_m:4070|runtime.mcall:192" not present in execution trace
        trace_test.go:202: execution trace CPU samples:
        trace_test.go:204: 8: "main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:204: 1: "main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271"
        trace_test.go:206: CPU profile:
            main.cpuHog0:130|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	22
            main.cpuHog0:126|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	2
            main.cpuHog0:125|main.cpuHog1:120|main.cpuHogger:107|main.main.func1.1:44|runtime/pprof.Do:51|main.main.func1:43|main.main:50|runtime.main:271	1
            runtime.releasem:588|runtime.traceRelease:238|runtime.execute:3129|runtime.schedule:3918|runtime.goschedImpl:4065|runtime.gosched_m:4070|runtime.mcall:192	1
            aeshashbody:597|runtime.(*traceMap).put:62|runtime.(*traceStackTable).put:95|runtime.traceStack:79|runtime.traceLocker.stack:166|runtime.traceLocker.GoStop:425|runtime.traceLocker.GoSched:415|runtime.goschedImpl:4051|runtime.gosched_m:4070|runtime.mcall:192	1
        trace_test.go:570: found bad trace; dumping to test log...
        trace_test.go:579: text trace too large to dump (7190038 bytes)

watchflakes

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 22, 2024
@dmitshur dmitshur modified the milestones: Backlog, Go1.23 Jan 22, 2024
@gopherbot
Copy link
Author

Change https://go.dev/cl/557838 mentions this issue: [release-branch.go1.22] runtime: make a much better effort to emit CPU profile in a trace

gopherbot pushed a commit that referenced this issue Jan 23, 2024
…U profile in a trace

Currently the new execution tracer's handling of CPU profile samples is
very best-effort. The same CPU profile buffer is used across
generations, leading to a high probability that CPU samples will bleed
across generations. Also, because the CPU profile buffer (not the trace
buffer the samples get written into) isn't guaranteed to be flushed when
we close out a generation, nor when tracing stops. This has led to test
failures, but can more generally just lead to lost samples.

In general, lost samples are considered OK. The CPU profile buffer is
only read from every 100 ms, so if it fills up too much before then, old
samples will get overwritten. The tests already account for this, and in
that sense the CPU profile samples are already best-effort. But with
actual CPU profiles, this is really the only condition under which
samples are dropped.

This CL aims to align CPU profiles better with traces by eliminating
all best-effort parts of the implementation aside from the possibility
of dropped samples from a full buffer.

To achieve this, this CL adds a second CPU profile buffer and has the
SIGPROF handler pick which CPU profile buffer to use based on the
generation, much like every other part of the tracer. The SIGPROF
handler then reads the trace generation, but not before ensuring it
can't change: it grabs its own thread's trace seqlock. It's possible
that a SIGPROF signal lands while this seqlock is already held by the
thread. Luckily this is detectable and the SIGPROF handler can simply
elide the locking if this happens (the tracer will already wait until
all threads exit their seqlock critical section).

Now that there are two CPU profile buffers written to, the read side
needs to change. Instead of calling traceAcquire/traceRelease for every
single CPU sample event, the trace CPU profile reader goroutine holds
this conceptual lock over the entirety of flushing a buffer. This means
it can pick the CPU profile buffer for the current generation to flush.

With all this machinery in place, we're now at a point where all CPU
profile samples get divided into either the previous generation or the
current generation. This is good, since it means that we're able to
emit profile samples into the correct generation, avoiding surprises in
the final trace. All that's missing is to flush the CPU profile buffer
from the previous generation, once the runtime has moved on from that
generation. That is, when the generation counter updates, there may yet
be CPU profile samples sitting in the last generation's buffer. So,
traceCPUFlush now first flushes the CPU profile buffer, followed by any
trace buffers containing CPU profile samples.

The end result of all this is that no sample gets left behind unless it
gets overwritten in the CPU profile buffer in the first place. CPU
profile samples in the trace will now also get attributed to the right
generation, since the SIGPROF handler now participates in the tracer's
synchronization across trace generations.

Fixes #55317.

Change-Id: I47719fad164c544eef0bb12f99c8f3c15358e344
Reviewed-on: https://go-review.googlesource.com/c/go/+/555495
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit f5e475e)
Reviewed-on: https://go-review.googlesource.com/c/go/+/557838
Reviewed-by: Cherry Mui <cherryyz@google.com>
@gopherbot gopherbot reopened this Feb 6, 2024
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- (pkg == "runtime/trace" && test == "TestTraceCPUProfile") || log ~ `--- FAIL: TestTraceCPUProfile`
2024-02-02 16:26 openbsd-386-72 go@2fdad8af runtime/trace.TestTraceCPUProfile (log)
fatal error: runtime: out of memory

runtime stack:
runtime.throw({0x81fbb69, 0x16})
	/tmp/workdir/go/src/runtime/panic.go:1077 +0x4d fp=0x493334cc sp=0x493334b8 pc=0x80826ed
runtime.sysMapOS(0x97c00000, 0x400000)
	/tmp/workdir/go/src/runtime/mem_bsd.go:81 +0x144 fp=0x493334f8 sp=0x493334cc pc=0x805e894
runtime.sysMap(0x97c00000, 0x400000, 0x8340958)
	/tmp/workdir/go/src/runtime/mem.go:155 +0x3f fp=0x49333508 sp=0x493334f8 pc=0x805e66f
runtime.(*mheap).grow(0x8333640, 0x42)
...
	/tmp/workdir/go/src/internal/trace/parser.go:269 +0x156f fp=0x5a841b90 sp=0x5a841a5c pc=0x816434f
internal/trace.parse({0x82348e0, 0x5a886588}, {0x0, 0x0})
	/tmp/workdir/go/src/internal/trace/parser.go:104 +0x27 fp=0x5a841bdc sp=0x5a841b90 pc=0x8162b27
internal/trace.Parse({0x82348e0, 0x5a886588}, {0x0, 0x0})
	/tmp/workdir/go/src/internal/trace/parser.go:91 +0x37 fp=0x5a841c0c sp=0x5a841bdc pc=0x8162a07
runtime/trace_test.parseTrace(0x5a8b03c0, {0x82348e0, 0x5a886588})
	/tmp/workdir/go/src/runtime/trace/trace_test.go:146 +0x47 fp=0x5a841d48 sp=0x5a841c0c pc=0x81ade97
runtime/trace_test.TestTraceCPUProfile(0x5a8b03c0)
	/tmp/workdir/go/src/runtime/trace/trace_test.go:676 +0x473 fp=0x5a841f98 sp=0x5a841d48 pc=0x81b10a3
testing.tRunner(0x5a8b03c0, 0x82060b8)

watchflakes

ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
Currently the new execution tracer's handling of CPU profile samples is
very best-effort. The same CPU profile buffer is used across
generations, leading to a high probability that CPU samples will bleed
across generations. Also, because the CPU profile buffer (not the trace
buffer the samples get written into) isn't guaranteed to be flushed when
we close out a generation, nor when tracing stops. This has led to test
failures, but can more generally just lead to lost samples.

In general, lost samples are considered OK. The CPU profile buffer is
only read from every 100 ms, so if it fills up too much before then, old
samples will get overwritten. The tests already account for this, and in
that sense the CPU profile samples are already best-effort. But with
actual CPU profiles, this is really the only condition under which
samples are dropped.

This CL aims to align CPU profiles better with traces by eliminating
all best-effort parts of the implementation aside from the possibility
of dropped samples from a full buffer.

To achieve this, this CL adds a second CPU profile buffer and has the
SIGPROF handler pick which CPU profile buffer to use based on the
generation, much like every other part of the tracer. The SIGPROF
handler then reads the trace generation, but not before ensuring it
can't change: it grabs its own thread's trace seqlock. It's possible
that a SIGPROF signal lands while this seqlock is already held by the
thread. Luckily this is detectable and the SIGPROF handler can simply
elide the locking if this happens (the tracer will already wait until
all threads exit their seqlock critical section).

Now that there are two CPU profile buffers written to, the read side
needs to change. Instead of calling traceAcquire/traceRelease for every
single CPU sample event, the trace CPU profile reader goroutine holds
this conceptual lock over the entirety of flushing a buffer. This means
it can pick the CPU profile buffer for the current generation to flush.

With all this machinery in place, we're now at a point where all CPU
profile samples get divided into either the previous generation or the
current generation. This is good, since it means that we're able to
emit profile samples into the correct generation, avoiding surprises in
the final trace. All that's missing is to flush the CPU profile buffer
from the previous generation, once the runtime has moved on from that
generation. That is, when the generation counter updates, there may yet
be CPU profile samples sitting in the last generation's buffer. So,
traceCPUFlush now first flushes the CPU profile buffer, followed by any
trace buffers containing CPU profile samples.

The end result of all this is that no sample gets left behind unless it
gets overwritten in the CPU profile buffer in the first place. CPU
profile samples in the trace will now also get attributed to the right
generation, since the SIGPROF handler now participates in the tracer's
synchronization across trace generations.

Fixes golang#55317.

Change-Id: I47719fad164c544eef0bb12f99c8f3c15358e344
Reviewed-on: https://go-review.googlesource.com/c/go/+/555495
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Projects
Status: In Progress
Status: Done
Development

No branches or pull requests

7 participants