Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: TestCPUProfileMultithreadMagnitude flaky due to Linux kernel bug #49065

Open
prattmic opened this issue Oct 19, 2021 · 43 comments
Open

Comments

@prattmic prattmic added this to the Go1.18 milestone Oct 19, 2021
@prattmic
Copy link
Member Author

@prattmic prattmic commented Oct 19, 2021

Some of the other tests I see failing in these, w.r.t. whether they were failing prior to the https://golang.org/cl/342054 patch stack (looking since April):

  • TestCPUProfileInlining: 1 failure in August, and 1 in May, but failing for different reason.
  • TestCPUProfileRecursion: no earlier failures.
  • TestMathBigDivide: no earlier failures.
  • TestMorestack: 4 failures in June on Darwin.
  • TestCPUProfileLabel: no earlier failures.
  • TestTimeVDSO: Added on Sept 22. Several earlier failures on Android, later marked as flaky on Android.

IMO, this is pretty strong evidence that these CLs have actually broke something w.r.t. profiling. I wouldn't be surprised if riscv or wsl have fundamental platform issues with profiling, but 386 and amd64 should certainly be OK.

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 19, 2021

Thank you for the links. Here's what I've found so far in the 386 and amd64 (not WSL, not RISC-V) logs.

The failing test results for TestCPUProfileMultithreadMagnitude on those platforms all look like "got full profiling info, but only from a subset of threads", followed by some number of tests that get a very small number of profile samples—often 0, sometimes 1, sometimes dozens—when the tests expect several hundred. The samples that do appear come from the runtime, in the scheduler or scavenger, in functions that might run on otherwise-idle threads.


386


2021-10-19T07:45:46-ee92daa/linux-386-longtest

HOSTNAME=buildlet-linux-stretch-morecpu-rn873c60c, which per https://farmer.golang.org/builders is "Debian Stretch, but on e2-highcpu-16", which per https://cloud.google.com/compute/docs/general-purpose-machines#e2-high-cpu is 16 vCPUs and 16 GB memory.

    pprof_test.go:541: runtime/pprof.cpuHog1: 498
    pprof_test.go:541: runtime/pprof.cpuHog3: 3471
    --- FAIL: TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof (0.00s)
        pprof_test.go:179: compare 1m19.558967s vs 34.71s
        pprof_test.go:191: CPU usage reports are too different (limit -10.0%, got -56.4%)

The single-goroutine portion of TestCPUProfileMultithreadMagnitude saw 4.98 seconds of samples (in a 5-second test). The parallel portion (using GOMAXPROCS goroutines) also ran for 5 seconds, during which the OS reported 79.56 seconds of user+system time in rusage but the runtime only saw 34.71 seconds worth of samples. That looks like "the cliff at 2.5 that we got with setitimer is now present at 7 vCPU, sometimes".

The failure in that test run of TestCPUProfileMultithreadMagnitude is followed by failures in TestCPUProfileInlining, TestCPUProfileRecursion, TestMathBigDivide, TestMorestack, TestCPUProfileLabel, and TestTimeVDSO. All of those have "total 0 CPU profile samples collected" except for TestCPUProfileInlining which got a single sample (below the minimum of 125 and target of 500).


2021-10-18T21:21:54-6c0daa7/linux-386-longtest

HOSTNAME=buildlet-linux-stretch-morecpu-rnce27d47, so GOMAXPROCS=16 again.

    pprof_test.go:541: runtime/pprof.cpuHog1: 499
    pprof_test.go:541: runtime/pprof.cpuHog3: 2994
    --- FAIL: TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof (0.00s)
        pprof_test.go:179: compare 1m19.983449s vs 29.94s
        pprof_test.go:191: CPU usage reports are too different (limit -10.0%, got -62.6%)

The single-goroutine portion saw 4.99 seconds of samples, and the parallel portion saw 29.94 seconds of samples. This looks like "the cliff at 2.5 that we got with setitimer is now present at 6 vCPU, sometimes".

The failure in that test run of TestCPUProfileMultithreadMagnitude is followed by failures in TestCPUProfileRecursion, TestMathBigDivide, TestMorestack, TestCPUProfileLabel, and TestTimeVDSO. Again all of those failures were with "total 0 CPU profile samples collected" except for TestCPUProfileRecursion and TestMorestack which got a single sample each (vs min of 125 and target of 500).


amd64


2021-10-18T19:46:27-394a1ad/linux-amd64-longtest

No environment header at the top of this log.

    pprof_test.go:541: runtime/pprof.cpuHog1: 500
    pprof_test.go:541: runtime/pprof.cpuHog3: 4987
    --- FAIL: TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof (0.00s)
        pprof_test.go:179: compare 1m19.876446s vs 49.87s
        pprof_test.go:191: CPU usage reports are too different (limit -10.0%, got -37.6%)

The single-goroutine portion saw 5.00 seconds of samples, and the parallel portion saw 49.87 seconds of samples. This looks like "the cliff at 2.5 that we got with setitimer is now present at 10 vCPU, sometimes".

Followed by failure of TestMorestack (10 samples, 0 where expected), TestCPUProfileLabel (22 samples, 0 where expected), and TestTimeVDSO (9 samples, 0 where expected).


2021-09-27T18:58:36-5b90958/linux-amd64-longtest

HOSTNAME=buildlet-linux-stretch-morecpu-rn83cdb7c, so GOMAXPROCS=16.

    pprof_test.go:541: runtime/pprof.cpuHog1: 502
    pprof_test.go:541: runtime/pprof.cpuHog3: 5050
    --- FAIL: TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof (0.00s)
        pprof_test.go:179: compare 1m19.992473s vs 50.5s
        pprof_test.go:191: CPU usage reports are too different (limit -10.0%, got -36.9%)

The single-goroutine portion saw 5.02 seconds of samples (ran a little long!), and the parallel portion saw 50.50 seconds of samples (also ran a little long, but on fewer than GOMAXPROCS threads?). This could look like "the cliff at 2.5 that we got with setitimer is now present at 10 vCPU, sometimes" ... at least, 50.50 seconds is only slightly above 10 x 5.00 seconds.

There were no other failures in this run.


2021-09-27T18:58:29-8cfd8c3/linux-amd64-longtest

HOSTNAME=buildlet-linux-stretch-morecpu-rn6ff4f0f, so GOMAXPROCS=16.

    pprof_test.go:541: runtime/pprof.cpuHog1: 500
    pprof_test.go:541: runtime/pprof.cpuHog3: 5083
    --- FAIL: TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof (0.00s)
        pprof_test.go:179: compare 1m19.99259s vs 50.83s
        pprof_test.go:191: CPU usage reports are too different (limit -10.0%, got -36.5%)

The single-goroutine portion saw 5.00 seconds of samples, and the parallel portion saw 50.83 seconds of samples (ran a little long, but on fewer than GOMAXPROCS threads?). This could look like "the cliff at 2.5 that we got with setitimer is now present at 10 vCPU, sometimes" ... at least, 50.83 seconds is only slightly above 10 x 5.00 seconds.

There were no other failures in this run.

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 21, 2021

Change https://golang.org/cl/357650 mentions this issue: runtime: debug pprof test failures

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 21, 2021

These failures look like three problems:

  1. The failures on linux/amd64 for 8cfd8c3 and 5b90958 (both on 2021-09-27) are because I put the test too early in the CL chain. We discussed at #35057 (comment). The first WSL failures are on those two commits as well.
  2. Failures in TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof. These appear on linux-riscv64, linux-amd64-wsl, and linux-ppc64. The first of these was 2021-09-30T16:43:32-9112d29, and they've occured regularly since.
  3. Failures in TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof, coupled with failures in other CPU-profiling-related test cases. These appear on linux-amd64-longtest and linux-386-longtest. The first of these was 2021-10-18T19:46:27-394a1ad`

Problem 1 is a couple bad commits while the CL stack was landing. I'll filter those out of the results.

$ grep -r -l 'FAIL\truntime/pprof' . | egrep '2021-(09|10)' | grep '/linux-' | egrep '(8cfd8c3|5b90958)' | xargs grep -H 'FAIL:' | grep -v TestBlockProfile | sed -n -e 's,:[ ]*--- FAIL:,,p' | awk '{print $2,$1}' | sort | column -t
TestCPUProfileMultithreadMagnitude                                 ./2021-09-27T18:58:29-8cfd8c3/linux-amd64-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-09-27T18:58:29-8cfd8c3/linux-amd64-wsl
TestCPUProfileMultithreadMagnitude                                 ./2021-09-27T18:58:36-5b90958/linux-amd64-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-09-27T18:58:36-5b90958/linux-amd64-wsl
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-09-27T18:58:29-8cfd8c3/linux-amd64-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-09-27T18:58:29-8cfd8c3/linux-amd64-wsl
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-09-27T18:58:36-5b90958/linux-amd64-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-09-27T18:58:36-5b90958/linux-amd64-wsl

Problem 2 often shows garbage-collection-related functions in the profile. If the GC is running during the test, it would increase the CPU time for the whole process as reported by getrusage(2) without increasing the amount of time in the cpuHog functions that the test examines. I plan to update the test to disable the GC while it runs.

$ grep -r -l 'FAIL\truntime/pprof' . | egrep '2021-(09|10)' | grep '/linux-' | egrep -v '(8cfd8c3|5b90958)' | xargs grep -H 'FAIL:' | grep -v TestBlockProfile | sed -n -e 's,:[ ]*--- FAIL:,,p' | awk '{print $2,$1}' | sort | column -t | grep -v longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-09-30T16:43:32-9112d29/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude                                 ./2021-10-01T04:30:22-822f349/linux-amd64-wsl
TestCPUProfileMultithreadMagnitude                                 ./2021-10-01T17:27:24-952df98/linux-riscv64-jsing
TestCPUProfileMultithreadMagnitude                                 ./2021-10-01T17:27:29-3357624/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude                                 ./2021-10-02T13:52:59-07a7c6f/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude                                 ./2021-10-05T16:06:49-060cd73/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude                                 ./2021-10-06T16:11:20-aecf4b1/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude                                 ./2021-10-08T14:08:12-5b9206f/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude                                 ./2021-10-08T17:32:31-d7ba1d2/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude                                 ./2021-10-11T12:53:11-34f7b1f/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude                                 ./2021-10-12T20:20:41-d032b2b/linux-amd64-wsl
TestCPUProfileMultithreadMagnitude                                 ./2021-10-15T20:29:48-85cbdda/linux-riscv64-jsing
TestCPUProfileMultithreadMagnitude                                 ./2021-10-19T18:56:08-07e5527/linux-riscv64-unleashed
TestCPUProfileMultithreadMagnitude                                 ./2021-10-19T21:05:11-1b24c9e/linux-riscv64-unleashed
TestCPUProfileMultithreadMagnitude                                 ./2021-10-19T22:47:48-2be5b84/linux-riscv64-unleashed
TestCPUProfileMultithreadMagnitude                                 ./2021-10-20T05:28:45-b091189/linux-riscv64-unleashed
TestCPUProfileMultithreadMagnitude                                 ./2021-10-20T20:58:15-73652af/linux-ppc64-buildlet
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-09-30T16:43:32-9112d29/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-01T04:30:22-822f349/linux-amd64-wsl
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-01T17:27:24-952df98/linux-riscv64-jsing
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-01T17:27:29-3357624/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-02T13:52:59-07a7c6f/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-05T16:06:49-060cd73/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-06T16:11:20-aecf4b1/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-08T14:08:12-5b9206f/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-08T17:32:31-d7ba1d2/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-11T12:53:11-34f7b1f/linux-riscv64-unmatched
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-12T20:20:41-d032b2b/linux-amd64-wsl
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-15T20:29:48-85cbdda/linux-riscv64-jsing
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-19T18:56:08-07e5527/linux-riscv64-unleashed
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-19T21:05:11-1b24c9e/linux-riscv64-unleashed
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-19T22:47:48-2be5b84/linux-riscv64-unleashed
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-20T05:28:45-b091189/linux-riscv64-unleashed
TestCPUProfileMultithreadMagnitude/serial_execution_OS_vs_pprof    ./2021-10-20T20:58:15-73652af/linux-ppc64-buildlet

Problem 3 is new in the last few days. When it shows up, it affects the single-goroutine CPU profiling tests too. It's only appeared on the longtest builders. Those have 16 hyperthreads / GOMAXPROCS=16.

The failures in single-goroutine tests take the form of zero or near-zero samples in the desired functions, and the failures in TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof take the form of the CPU profile reporting only 6 to 14 threads worth of CPU time. (The tests expects 16 threads worth of reporting, and its 10% error margin means that reports from 15 threads worth of work would be considered a success.)

It's interesting that the amount of CPU time reported in the profile is always very close to an integer number of threads (each running for 5 seconds). It's also interesting that the smallest report is 6 out of the 16 threads. I disabled the new timer_create-based profiler and ran the test on the linux-386-longtest builder, and found that setitimer is able to deliver enough SIGPROF events to the process to cover the work of 10 threads (https://storage.googleapis.com/go-build-log/d3ff14b2/linux-386-longtest_a3d8929c.log). My working theory on this failure mode is that on some fraction of the process's threads, the kernel delivers the signals from setitimer first, so the deliveries from timer_create are crowded out up to setitimer's maximum signal delivery rate.

I also find it suspicious that this failure mode is new in the last few days, but none of the nearby commits look related and I also haven't been able to reproduce the problem myself.

$ grep -r -l 'FAIL\truntime/pprof' . | egrep '2021-(09|10)' | grep '/linux-' | egrep -v '(8cfd8c3|5b90958)' | xargs grep -H 'FAIL:' | grep -v TestBlockProfile | sed -n -e 's,:[ ]*--- FAIL:,,p' | awk '{print $2,$1}' | sort | column -t | grep longtest
TestCPUProfileInlining                                             ./2021-10-19T07:45:46-ee92daa/linux-386-longtest
TestCPUProfileInlining                                             ./2021-10-19T11:52:34-bde0463/linux-386-longtest
TestCPUProfileInlining                                             ./2021-10-19T18:42:57-7999fd4/linux-386-longtest
TestCPUProfileInlining                                             ./2021-10-19T22:47:48-2be5b84/linux-386-longtest
TestCPUProfileInlining                                             ./2021-10-20T07:49:28-d2ec73c/linux-amd64-longtest
TestCPUProfileInlining                                             ./2021-10-20T19:29:12-4320949/linux-386-longtest
TestCPUProfileInlining                                             ./2021-10-20T20:58:15-73652af/linux-386-longtest
TestCPUProfileLabel                                                ./2021-10-18T19:46:27-394a1ad/linux-amd64-longtest
TestCPUProfileLabel                                                ./2021-10-18T21:21:54-6c0daa7/linux-386-longtest
TestCPUProfileLabel                                                ./2021-10-19T07:45:46-ee92daa/linux-386-longtest
TestCPUProfileLabel                                                ./2021-10-19T11:52:34-bde0463/linux-386-longtest
TestCPUProfileLabel                                                ./2021-10-19T22:47:48-2be5b84/linux-386-longtest
TestCPUProfileLabel                                                ./2021-10-20T00:05:22-4cc6a91/linux-386-longtest
TestCPUProfileLabel                                                ./2021-10-20T20:39:19-d419a80/linux-386-longtest
TestCPUProfileLabel                                                ./2021-10-20T20:58:15-73652af/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-18T19:46:27-394a1ad/linux-amd64-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-18T21:21:54-6c0daa7/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-19T07:45:46-ee92daa/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-19T11:52:34-bde0463/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-19T16:19:27-f92a358/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-19T17:01:35-254c497/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-19T18:42:57-7999fd4/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-19T22:47:48-2be5b84/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-20T00:05:22-4cc6a91/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-20T07:49:28-d2ec73c/linux-amd64-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-20T19:29:12-4320949/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-20T20:39:19-d419a80/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-20T20:39:25-e904928/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-20T20:58:15-73652af/linux-386-longtest
TestCPUProfileMultithreadMagnitude                                 ./2021-10-21T15:42:41-8b0bea9/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-18T19:46:27-394a1ad/linux-amd64-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-18T21:21:54-6c0daa7/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-19T07:45:46-ee92daa/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-19T11:52:34-bde0463/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-19T16:19:27-f92a358/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-19T17:01:35-254c497/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-19T18:42:57-7999fd4/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-19T22:47:48-2be5b84/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-20T00:05:22-4cc6a91/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-20T07:49:28-d2ec73c/linux-amd64-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-20T19:29:12-4320949/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-20T20:39:19-d419a80/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-20T20:39:25-e904928/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-20T20:58:15-73652af/linux-386-longtest
TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof  ./2021-10-21T15:42:41-8b0bea9/linux-386-longtest
TestCPUProfileRecursion                                            ./2021-10-18T21:21:54-6c0daa7/linux-386-longtest
TestCPUProfileRecursion                                            ./2021-10-19T07:45:46-ee92daa/linux-386-longtest
TestCPUProfileRecursion                                            ./2021-10-19T11:52:34-bde0463/linux-386-longtest
TestCPUProfileRecursion                                            ./2021-10-19T22:47:48-2be5b84/linux-386-longtest
TestCPUProfileRecursion                                            ./2021-10-20T00:05:22-4cc6a91/linux-386-longtest
TestCPUProfileRecursion                                            ./2021-10-20T20:58:15-73652af/linux-386-longtest
TestMathBigDivide                                                  ./2021-10-18T21:21:54-6c0daa7/linux-386-longtest
TestMathBigDivide                                                  ./2021-10-19T07:45:46-ee92daa/linux-386-longtest
TestMathBigDivide                                                  ./2021-10-19T16:19:27-f92a358/linux-386-longtest
TestMathBigDivide                                                  ./2021-10-20T20:58:15-73652af/linux-386-longtest
TestMorestack                                                      ./2021-10-18T19:46:27-394a1ad/linux-amd64-longtest
TestMorestack                                                      ./2021-10-18T21:21:54-6c0daa7/linux-386-longtest
TestMorestack                                                      ./2021-10-19T07:45:46-ee92daa/linux-386-longtest
TestMorestack                                                      ./2021-10-19T11:52:34-bde0463/linux-386-longtest
TestMorestack                                                      ./2021-10-20T20:39:19-d419a80/linux-386-longtest
TestTimeVDSO                                                       ./2021-10-18T19:46:27-394a1ad/linux-amd64-longtest
TestTimeVDSO                                                       ./2021-10-18T21:21:54-6c0daa7/linux-386-longtest
TestTimeVDSO                                                       ./2021-10-19T07:45:46-ee92daa/linux-386-longtest
TestTimeVDSO                                                       ./2021-10-19T11:52:34-bde0463/linux-386-longtest
TestTimeVDSO                                                       ./2021-10-19T16:19:27-f92a358/linux-386-longtest
TestTimeVDSO                                                       ./2021-10-20T00:05:22-4cc6a91/linux-386-longtest
TestTimeVDSO                                                       ./2021-10-21T15:42:41-8b0bea9/linux-386-longtest
$ find . -name linux-386-longtest | egrep -v '(8cfd8c3|5b90958)' | egrep '2021-(09|10)' | sort | xargs egrep 'pprof_test.go.*compare'
./2021-10-18T21:21:54-6c0daa7/linux-386-longtest:        pprof_test.go:179: compare 1m19.983449s vs 29.94s
./2021-10-19T07:45:46-ee92daa/linux-386-longtest:        pprof_test.go:179: compare 1m19.558967s vs 34.71s
./2021-10-19T11:52:34-bde0463/linux-386-longtest:        pprof_test.go:179: compare 1m19.932465s vs 34.89s
./2021-10-19T16:19:27-f92a358/linux-386-longtest:        pprof_test.go:179: compare 1m19.987855s vs 59.95s
./2021-10-19T17:01:35-254c497/linux-386-longtest:        pprof_test.go:179: compare 1m19.980093s vs 1m4.9s
./2021-10-19T18:42:57-7999fd4/linux-386-longtest:        pprof_test.go:179: compare 1m19.950427s vs 1m4.86s
./2021-10-19T22:47:48-2be5b84/linux-386-longtest:        pprof_test.go:179: compare 1m19.920228s vs 1m4.86s
./2021-10-20T00:05:22-4cc6a91/linux-386-longtest:        pprof_test.go:179: compare 1m19.976284s vs 54.89s
./2021-10-20T19:29:12-4320949/linux-386-longtest:        pprof_test.go:179: compare 1m19.86348s vs 59.83s
./2021-10-20T20:39:19-d419a80/linux-386-longtest:        pprof_test.go:179: compare 1m19.975459s vs 1m9.87s
./2021-10-20T20:39:25-e904928/linux-386-longtest:        pprof_test.go:179: compare 1m18.992738s vs 44.33s
./2021-10-20T20:58:15-73652af/linux-386-longtest:        pprof_test.go:179: compare 1m19.982008s vs 1m4.89s
./2021-10-21T15:42:41-8b0bea9/linux-386-longtest:        pprof_test.go:179: compare 1m19.972142s vs 1m4.91s

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 21, 2021

I got an interesting trybot failure on https://go-review.googlesource.com/c/go/+/357650/8: https://storage.googleapis.com/go-build-log/c3d0e67e/linux-386-longtest_222a0fb6.log

The result in TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof shows that the profile has samples for 1m14.77s, versus the 1m19.882056s of time reported by getrusage(2). (That would usually be within the acceptable bounds for the long-form test, but I've changed it to always fail and dump out the results for inspection.) That test ran with GOMAXPROCS=16, and the number of samples lines up with only 15 of those 16 threads being profiled. The labels in the profile confirm: there are no samples from worker:parallel-7. The parallel portion of the test (for the moment, in that debugging CL) uses runtime.LockOSThread.

It would be more interesting if the test run showed more than a single thread not reporting in at all, but this is still pretty good.

$ pbpaste | grep worker | awk '{print $2,$NF,$1}' | sort -k 2
0x8171ffc {worker:parallel-0} 109:
0x80ab580 {worker:parallel-0} 1:
0x8171fff {worker:parallel-0} 24:
0x8172002 {worker:parallel-0} 365:
0x8171ffc {worker:parallel-10} 122:
0x8171fff {worker:parallel-10} 16:
0x8171fea {worker:parallel-10} 2:
0x8172002 {worker:parallel-10} 360:
0x8171ffc {worker:parallel-11} 123:
0x8171fff {worker:parallel-11} 29:
0x8172002 {worker:parallel-11} 346:
0x8171ffc {worker:parallel-12} 107:
0x8171fff {worker:parallel-12} 21:
0x8172002 {worker:parallel-12} 372:
0x8171ffc {worker:parallel-13} 123:
0x8171fff {worker:parallel-13} 12:
0x8171fea {worker:parallel-13} 1:
0x8172002 {worker:parallel-13} 363:
0x8171ffc {worker:parallel-14} 124:
0x80ab580 {worker:parallel-14} 1:
0x80bc65d {worker:parallel-14} 1:
0x8171fff {worker:parallel-14} 24:
0x8172002 {worker:parallel-14} 349:
0x8171ffc {worker:parallel-15} 108:
0x8171fff {worker:parallel-15} 18:
0x80ab1e9 {worker:parallel-15} 1:
0x8171fea {worker:parallel-15} 1:
0x8172002 {worker:parallel-15} 362:
0x80ab580 {worker:parallel-15} 3:
0x80ab580 {worker:parallel-15} 3:
0x8171ffc {worker:parallel-1} 110:
0x80ab580 {worker:parallel-1} 1:
0x80ab580 {worker:parallel-1} 1:
0x8171fea {worker:parallel-1} 1:
0x8171fff {worker:parallel-1} 25:
0x8172002 {worker:parallel-1} 362:
0x8171ffc {worker:parallel-2} 137:
0x8171fff {worker:parallel-2} 23:
0x8172002 {worker:parallel-2} 339:
0x8171ffc {worker:parallel-3} 133:
0x8171fff {worker:parallel-3} 21:
0x8172002 {worker:parallel-3} 344:
0x8171ffc {worker:parallel-4} 125:
0x8171fff {worker:parallel-4} 17:
0x8172002 {worker:parallel-4} 356:
0x8171ffc {worker:parallel-5} 128:
0x8171fff {worker:parallel-5} 16:
0x8172002 {worker:parallel-5} 354:
0x8171ffc {worker:parallel-6} 126:
0x8171f09 {worker:parallel-6} 1:
0x8171fff {worker:parallel-6} 27:
0x8172002 {worker:parallel-6} 344:
0x8171ffc {worker:parallel-8} 128:
0x8171fff {worker:parallel-8} 17:
0x8172002 {worker:parallel-8} 353:
0x8171ffc {worker:parallel-9} 124:
0x8171fff {worker:parallel-9} 23:
0x8172002 {worker:parallel-9} 353:
0x8171f7c {worker:serial} 118:
0x80ab580 {worker:serial} 1:
0x80ab580 {worker:serial} 1:
0x8171f7f {worker:serial} 2:
0x8171f82 {worker:serial} 377:

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 23, 2021

I'd like to see the output of perf trace --no-syscalls -e 'signal:*' when the parallel portion of the test fails (as seen on linux-{386,amd64}-longtest), but I haven't been able to reproduce the failure on a system I control.

Gomote access might help, but the instructions at https://github.com/golang/go/wiki/Gomote#access-token say that new accounts are on hold.

I've built the toolchain with GOARCH=386 and GOHOSTARCH=386 on a 16-hyperthread machine (a c5.4xlarge Intel EC2 instance) running Debian Stretch and have run go test runtime/pprof -test.count=1 many times (198 so far, versus the 16 failures of linux-386-longtest across 66 recent commits), but have not seen a single failure in that package's tests on my test machine.

In the linux-386-longtest failures, the first failing test is TestCPUProfileMultithreadMagnitude. The part of that test that fails is the one that exercises GOMAXPROCS threads/goroutines in parallel. If that test fails, subsequent tests of the CPU profiler may also fail, including ones that use a single thread/goroutine. However, 1/ the two tests of the CPU profiler preceding TestCPUProfileMultithreadMagnitude (TestCPUProfile using 1 thread/goroutine and TestCPUProfileMultithreaded using 2 threads/goroutines) do not fail, and 2/ the single-threaded / serial portion (the first half) of TestCPUProfileMultithreadMagnitude also does not fail.

It looks like CPU profiling is working for the process, then the process uses lots of threads in parallel, and then CPU profiling no longer consistently works for the process. And, this is new as of October 18.

I don't know how to move forward with the "longtest" half of the problem. Do you have advice, @prattmic ?

  • Are gomote access requests being processed, and am I the sort of person (have "may-start-trybots", haven't pursued gerrit "approvers") who would get access? I see recent requests being denied.
  • Is there someone with gomote access who is able to reproduce the failure, and could share the perf-trace results (plus the test log output) from a failing run?
  • Is there something I'm missing in my repro attempts—especially something that lets me stay on EC2, where I have easy access through my employer?

My rough non-repro steps:

% ssh admin@vm-ip
Linux ip-172-31-53-15 4.9.0-16-amd64 #1 SMP Debian 4.9.272-2 (2021-07-19) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.

$ sudo apt-get install git build-essential libc6-dev-i386
$ git clone https://go.googlesource.com/go ~/go-bisect
$ git -C ~/go-bisect worktree add --detach ~/go1.4
$ git -C ~/go1.4 checkout origin/release-branch.go1.4
$ (cd ~/go1.4/src && ./make.bash)
$ git -C ~/go-bisect checkout 6c0daa733192031eab23d09ed6515c4cd959aa92
$ (cd ~/go-bisect/src && env "GOARCH=386" "GOHOSTARCH=386" ./make.bash)
$ ~/go-bisect/bin/go test runtime/pprof -test.count=1
$ # Though in practice I copied the stress tool to the machine and am running 
$ # $ nohup ./stress.linux_amd64.exe -p 1 ~/go-bisect/bin/go test runtime/pprof -test.count=1 &

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Oct 25, 2021

The failure rate of runtime/pprof has become quite high on the linux-386-longtest builder since around Oct. 18.

That builder is especially important for this release cycle, because it is the only longtest builder running non-short fuzz tests with non-instrumented binaries and has been catching a lot of fuzzing testing issues. We need to do something to get it to produce useful coverage again — perhaps either rolling back runtime/pprof changes, or adding testenv.SkipFlaky calls to specific new tests that are failing. (CC @golang/release)

greplogs --dashboard -md -l -e FAIL\\s+runtime/pprof --since=2021-09-01

2021-10-25T16:07:11-c6e82e5/linux-amd64-longtest
2021-10-25T03:00:02-7b55457/linux-386-longtest
2021-10-25T02:59:35-f686f6a/linux-amd64-wsl
2021-10-23T20:23:35-8dbf3e9/linux-386-longtest
2021-10-23T16:18:19-16318df/linux-386-longtest
2021-10-23T16:18:19-16318df/linux-amd64-longtest
2021-10-23T16:18:15-12e4404/linux-amd64-longtest
2021-10-23T16:17:58-3cd28ba/linux-386-longtest
2021-10-22T21:12:07-76dd01f/linux-386-longtest
2021-10-21T18:20:19-f99e40a/linux-386-longtest
2021-10-21T17:31:42-392bb06/linux-386-longtest
2021-10-21T15:45:05-6ec9a1d/linux-386-longtest
2021-10-21T15:42:41-8b0bea9/linux-386-longtest
2021-10-20T20:58:15-73652af/linux-386-longtest
2021-10-20T20:58:15-73652af/linux-ppc64-buildlet
2021-10-20T20:39:25-e904928/linux-386-longtest
2021-10-20T20:39:19-d419a80/linux-386-longtest
2021-10-20T19:29:12-4320949/linux-386-longtest
2021-10-20T07:49:28-d2ec73c/linux-amd64-longtest
2021-10-20T05:28:45-b091189/linux-riscv64-unleashed
2021-10-20T00:05:22-4cc6a91/linux-386-longtest
2021-10-19T22:47:48-2be5b84/linux-386-longtest
2021-10-19T22:47:48-2be5b84/linux-riscv64-unleashed
2021-10-19T21:05:11-1b24c9e/linux-riscv64-unleashed
2021-10-19T18:56:08-07e5527/linux-riscv64-unleashed
2021-10-19T18:42:57-7999fd4/linux-386-longtest
2021-10-19T17:01:35-254c497/linux-386-longtest
2021-10-19T16:19:27-f92a358/linux-386-longtest
2021-10-19T11:52:34-bde0463/linux-386-longtest
2021-10-19T07:45:46-ee92daa/linux-386-longtest
2021-10-18T21:21:54-6c0daa7/linux-386-longtest
2021-10-18T19:46:27-394a1ad/linux-amd64-longtest
2021-10-15T20:29:48-85cbdda/linux-riscv64-jsing
2021-10-12T20:20:41-d032b2b/linux-amd64-wsl
2021-10-11T20:46:14-7023535/linux-ppc64-buildlet
2021-10-11T12:53:11-34f7b1f/linux-riscv64-unmatched
2021-10-08T17:32:31-d7ba1d2/linux-riscv64-unmatched
2021-10-08T14:08:12-5b9206f/linux-riscv64-unmatched
2021-10-08T02:09:58-e74db46/openbsd-arm64-jsing
2021-10-06T16:11:20-aecf4b1/linux-riscv64-unmatched
2021-10-05T16:06:49-060cd73/linux-riscv64-unmatched
2021-10-02T13:52:59-07a7c6f/linux-riscv64-unmatched
2021-10-02T00:44:24-afe43f1/linux-riscv64-unmatched
2021-10-02T00:44:24-afe43f1/openbsd-arm64-jsing
2021-10-01T17:27:29-3357624/linux-riscv64-unmatched
2021-10-01T17:27:24-952df98/linux-riscv64-jsing
2021-10-01T04:30:22-822f349/linux-amd64-wsl
2021-09-30T16:43:32-9112d29/linux-riscv64-unmatched
2021-09-27T21:31:14-04f7521/android-amd64-emu
2021-09-27T20:58:06-e54843f/android-arm-corellium
2021-09-27T18:58:36-5b90958/android-arm-corellium
2021-09-27T18:58:36-5b90958/linux-amd64-longtest
2021-09-27T18:58:36-5b90958/linux-amd64-wsl
2021-09-27T18:58:29-8cfd8c3/linux-amd64-longtest
2021-09-27T18:58:29-8cfd8c3/linux-amd64-wsl
2021-09-27T18:57:20-3d795ea/android-amd64-emu
2021-09-27T18:14:10-ecac351/android-arm-corellium
2021-09-27T16:16:57-d5cfba0/android-arm-corellium
2021-09-27T11:52:07-6e5dd0b/android-arm-corellium
2021-09-27T10:10:16-dac89a9/android-arm-corellium
2021-09-27T05:10:56-54079df/android-amd64-emu
2021-09-27T03:07:49-ff8a7e5/android-arm-corellium
2021-09-24T21:42:31-d60ad1e/android-amd64-emu
2021-09-24T21:38:56-1ce6fd0/android-arm-corellium
2021-09-24T19:39:49-c94543b/android-arm-corellium
2021-09-24T18:21:14-f6b5ffb/android-arm-corellium
2021-09-24T17:53:04-b00222f/android-arm-corellium
2021-09-24T17:52:54-584afc2/android-arm-corellium
2021-09-24T17:23:19-d5d3f80/android-arm-corellium
2021-09-24T14:52:47-217507e/android-amd64-emu
2021-09-24T14:52:47-217507e/android-arm-corellium
2021-09-24T01:25:24-483533d/openbsd-arm64-jsing

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 25, 2021

Change https://golang.org/cl/358574 mentions this issue: runtime: disable timer_create profiler on 386

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 26, 2021

I think the longtest failures are related to the Linux Kernel version in use.

From one side, the linux-386-longtest failures started abruptly on 2021-10-18 and none of the Go commits close to the first failure look like they'd be related. It looks like the linux-386-longtest runs as a container on GCE and does not require nested virtualization, and so runs on the latest stable version of GCE's Container Optimized OS.

https://github.com/golang/build/blob/a72dd9f27a813c2d30db1b6d0f058f0fe2b32751/dashboard/builders.go#L1830
https://github.com/golang/build/blob/a72dd9f27a813c2d30db1b6d0f058f0fe2b32751/buildlet/gce.go#L84
https://github.com/golang/build/blob/a72dd9f27a813c2d30db1b6d0f058f0fe2b32751/buildlet/gce.go#L346

Following on to that, it looks like COS m93 with Kernel 5.10.68 became a stable release on 2021-10-18, replacing m87 with Kernel 5.4.144. (I only have a date, not a timestamp ... I haven't figured out how to list VM images without a billing-enabled account.) https://cloud.google.com/container-optimized-os/docs/release-notes/m93 https://cloud.google.com/container-optimized-os/docs/release-notes/m89

From the other side, I've seen the following on 16-thread machines I control:

Debian Stretch
Successes: 4270
Failures: 0
uname -a: Linux ip-172-31-53-15 4.9.0-16-amd64 #1 SMP Debian 4.9.272-2 (2021-07-19) x86_64 GNU/Linux

Ubuntu 21.04
Successes: 75
Failures: 1
uname -a: Linux ip-172-31-57-140 5.11.0-1007-aws #7-Ubuntu SMP Thu May 6 20:17:21 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu 21.10
Successes: 47
Failures: 0
uname -a: Linux ip-172-31-51-72 5.13.0-1005-aws #6-Ubuntu SMP Fri Oct 8 07:48:02 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I've only gotten one failure so far, and I don't have the perf trace output from it, but it's a start.

@golang/release , does the "kernel / COS image version might be related to these failures" idea sound plausible to you? Did the container-based builders pick up COS m93, and does the timeline of that match this set of linux-386-longtest failures? (I don't know if I'm reading and interpreting the code correctly.) Thanks.

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 26, 2021

I continue to see about a 1% failure rate on my test machines that use very new kernels.

Debian Stretch, Linux 4.9: 4492 runs, 0 failures
Ubuntu 21.04, Linux 5.11: 302 runs, 5 failures
Ubuntu 21.10, Linux 5.13: 271 runs, 3 failures

They're c5.4xlarge machines in us-west-2, running images ami-025f3bcb64ebe6e83, ami-02684f1c7f36cdd5b, ami-0d4a468c8fcc4b5f0.

I've started an additional Ubuntu 21.10 machine to collect perf trace output.

$ sudo apt-get install linux-tools-common linux-tools-aws
$ ~/go-bisect/bin/go test -c -o ~/pprof.test runtime/pprof
$ (cd /home/ubuntu/go-bisect/src/runtime/pprof/ && \
sudo nohup /home/ubuntu/stress.linux_amd64.exe -p=1 -failure='FAIL' \
bash -c 'perf trace --no-syscalls -e '\''signal:*'\'' /home/ubuntu/pprof.test -test.count=1 ; false' \
&>>/home/ubuntu/nohup.out &)

Loading

@heschi
Copy link
Contributor

@heschi heschi commented Oct 26, 2021

I haven't looked at OS versions, but October 18 is also the day we switched to N2 machines in https://golang.org/cl/356630. @toothrot

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 26, 2021

I think we should not disable profiling-related features, and should instead disable the tests when running on versions of the kernel where user-space profiling is flaky. The data I've collected point to the problem:

  1. being related to a change in the kernel
  2. affecting both the new per-thread (timer_create) profiler and the previous process-wide (setitimer) profiler, and
  3. also being present in the latest stable release of Go, go1.17.2.

I've reproduced the failure with go1.17.2 and kernel 5.13.0. That predates the per-thread profiler, so also predates its many-threads test, and so has a very low failure rate. But I've seen it fail twice in the same way that it's failed on the linux-386-longtest builder since 2021-10-18: once on TestCPUProfileInlining ("too few samples; got 0, want at least 125, ideally 500") and once on TestCPUProfileLabel ("too few samples; got 3, want at least 125, ideally 500").

Back to tip at go1.18-6c0daa7331, when the per-CPU profiler reports a smaller number of samples than expected in the parallel portion of the test, the data from perf record show:

  1. the right number of threads (16 on my test machines) get the right number of cpu-clock events,
  2. a subset of threads generate signals with timer_create, and each generates the right number of those signal_generate events,
  3. that same subset of threads generate signals with setitimer (at what appears to be the maximum possible rate, one every 4ms),
  4. that same subset of threads receive signals from timer_create, with each getting its correct number of signal_deliver events,
  5. the signals from setitimer end up going mostly to that same subset of threads, totaling the number of deliveries that would be right if the prior subset of threads were the only ones doing work (and with an unpredictable distribution of signal_deliver events across those threads, as I've come to expect from setitimer and its coarse process-targeted signals).

It looks like user-space CPU profiling is broken (or at least somehow changed) on recent Linux Kernel versions, and that the new per-thread profiler doesn't have much to do with that aside from adding a test that makes that more obvious. I don't have root cause yet, but I don't think that disabling or reverting the new profiler is the right way to address that.


Here's the data from one of the most extreme examples I have (with only 25 seconds of work reported for the 16 workers * 5 seconds = 80 second parallel portion of the test). It's from go1.18-6c0daa7331 and kernel 5.13.0. The six-digit numbers starting with "351" are process/thread IDs, the columns that are mostly three-digit numbers are counts of lines/events.

$ uname -a
Linux ip-172-31-62-240 5.13.0-1005-aws #6-Ubuntu SMP Fri Oct 8 07:48:02 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

$ ./go-bisect/bin/go version
go version devel go1.18-6c0daa7331 Mon Oct 18 21:21:54 2021 +0000 linux/386

$ sudo nohup ./stress.linux_amd64.exe -p=1 -failure='FAIL' bash -c 'export id="perf.$$-$RANDOM.data" ; echo "/tmp/$id" ; perf record -F 97 -e cpu-clock -e '\''signal:*'\'' -e '\''raw_syscalls:*'\'' -o "/tmp/$id" -- bash -c '\''/home/ubuntu/pprof.test -test.count=1 -test.run=TestCPUProfileMultithreadMagnitude && rm "/tmp/$id" '\'' ; false' &

$ sudo cat /tmp/go-stress-20211026T191407-1309648738
/tmp/perf.351581-542.data
--- FAIL: TestCPUProfileMultithreadMagnitude (10.14s)
    pprof_test.go:506: total 2997 CPU profile samples collected:
        364: 0x81713a2 (runtime/pprof.cpuHog0:64 runtime/pprof.cpuHog1:55) 0x81712ee (runtime/pprof.cpuHogger:41) 0x81724ba (runtime/pprof.TestCPUProfileMultithreadMagnitude.func1.1:141) 0x81829a0 (runtime/pprof.diffCPUTimeRUsage:28) 0x817337d (runtime/pprof.diffCPUTime:448) 0x817217c (runtime/pprof.TestCPUProfileMultithreadMagnitude.func1:139) 0x8173147 (runtime/pprof.testCPUProfile:416) 0x8171ce2 (runtime/pprof.TestCPUProfileMultithreadMagnitude:138) 0x8109ef2 (testing.tRunner:1377)
        
        136: 0x817139c (runtime/pprof.cpuHog0:64 runtime/pprof.cpuHog1:55) 0x81712ee (runtime/pprof.cpuHogger:41) 0x81724ba (runtime/pprof.TestCPUProfileMultithreadMagnitude.func1.1:141) 0x81829a0 (runtime/pprof.diffCPUTimeRUsage:28) 0x817337d (runtime/pprof.diffCPUTime:448) 0x817217c (runtime/pprof.TestCPUProfileMultithreadMagnitude.func1:139) 0x8173147 (runtime/pprof.testCPUProfile:416) 0x8171ce2 (runtime/pprof.TestCPUProfileMultithreadMagnitude:138) 0x8109ef2 (testing.tRunner:1377)
        
        951: 0x8171422 (runtime/pprof.cpuHog0:64 runtime/pprof.cpuHog3:83) 0x81712ee (runtime/pprof.cpuHogger:41) 0x81723a7 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func1.2.1:153)
        
        2: 0x817140a (runtime/pprof.cpuHog0:60 runtime/pprof.cpuHog3:83) 0x81712ee (runtime/pprof.cpuHogger:41) 0x81723a7 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func1.2.1:153)
        
        1540: 0x817141c (runtime/pprof.cpuHog0:64 runtime/pprof.cpuHog3:83) 0x81712ee (runtime/pprof.cpuHogger:41) 0x81723a7 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func1.2.1:153)
        
        1: 0x80ab3d0 (runtime.asyncPreempt:7) 0x817141b (runtime/pprof.cpuHog0:62 runtime/pprof.cpuHog3:83) 0x81712ee (runtime/pprof.cpuHogger:41) 0x81723a7 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func1.2.1:153)
        
        1: 0x80aab3f (runtime.duffzero:135) 0x80bc4c3 (time.Since:875) 0x817131e (runtime/pprof.cpuHogger:40) 0x81723a7 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func1.2.1:153)
        
        1: 0x80ab3d0 (runtime.asyncPreempt:7) 0x8171421 (runtime/pprof.cpuHog0:64 runtime/pprof.cpuHog3:83) 0x81712ee (runtime/pprof.cpuHogger:41) 0x81723a7 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func1.2.1:153)
        
        1: 0x809b588 (runtime.nanotime:20) 0x80bc4d4 (time.Since:878) 0x817131e (runtime/pprof.cpuHogger:40) 0x81723a7 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func1.2.1:153)
        
    pprof_test.go:541: runtime/pprof.cpuHog1: 500
    pprof_test.go:541: runtime/pprof.cpuHog3: 2495
    --- FAIL: TestCPUProfileMultithreadMagnitude/parallel_execution_OS_vs_pprof (0.00s)
        pprof_test.go:179: compare 1m19.997571s vs 24.95s
        pprof_test.go:191: CPU usage reports are too different (limit -10.0%, got -68.8%)
FAIL
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 7.471 MB /tmp/perf.351581-542.data (76891 samples) ]


ERROR: exit status 1

$ # point 1, all 16 worker threads (cpuHog3) get cpu-clock events. Working for 5 seconds with a sample rate of 97 Hz should earn 485 events.
$ sudo perf script -i /tmp/perf.351581-542.data | grep cpu-clock | grep pprof.test | awk '{print $2,$7}' | sed -e 's/+0x[0-9a-f]*$//' | sort | uniq -c | sort -k 3 -k 1
      1 351587 [unknown]
      1 351595 [unknown]
      1 351587 __lock_text_start
      1 351588 __lock_text_start
      1 351593 __lock_text_start
      3 351591 __lock_text_start
      1 351588 __perf_event_overflow
      1 351602 __put_user_nocheck_4
      1 351587 exit_to_user_mode_loop
      1 351589 exit_to_user_mode_loop
      1 351593 exit_to_user_mode_loop
      1 351602 exit_to_user_mode_loop
      1 351587 get_signal
      1 351600 get_signal
      1 351593 memcpy_erms
      1 351588 perf_event_output_forward
      1 351587 runtime.adjustSignalStack
      1 351604 runtime.duffcopy
      1 351589 runtime.epollwait
      1 351592 runtime.nanotime1
      1 351604 runtime.nanotime1
      1 351588 runtime.tgkill
    483 351587 runtime/pprof.cpuHog1
    477 351587 runtime/pprof.cpuHog3
    481 351591 runtime/pprof.cpuHog3
    482 351589 runtime/pprof.cpuHog3
    482 351592 runtime/pprof.cpuHog3
    482 351593 runtime/pprof.cpuHog3
    482 351602 runtime/pprof.cpuHog3
    482 351604 runtime/pprof.cpuHog3
    483 351595 runtime/pprof.cpuHog3
    483 351600 runtime/pprof.cpuHog3
    484 351594 runtime/pprof.cpuHog3
    484 351603 runtime/pprof.cpuHog3
    485 351596 runtime/pprof.cpuHog3
    485 351597 runtime/pprof.cpuHog3
    485 351598 runtime/pprof.cpuHog3
    485 351599 runtime/pprof.cpuHog3
    485 351601 runtime/pprof.cpuHog3
      1 351589 signal_setup_done
      1 351592 time.Since
      1 351600 time.Time.Sub


$ # point 2, the per-thread timers generate the right number of signals (with code!=128), but only to a particular subset of the threads
$ sudo perf script -i /tmp/perf.351581-542.data | grep signal_generate | grep sig=27 | grep -v code=128 | awk '{print $2}' | sort | uniq -c | cat -n
     1	    997 351587
     2	    500 351589
     3	    500 351591
     4	    500 351592
     5	    500 351593

$ # point 3, the process-wide timer generates a bunch of signals (with code=128), but only for that same subset of threads
$ sudo perf script -i /tmp/perf.351581-542.data | grep signal_generate | grep sig=27 | grep code=128 | awk '{print $2}' | sort | uniq -c | cat -n
     1	   1748 351587
     2	   1245 351589
     3	   1247 351591
     4	   1249 351592
     5	   1245 351593

$ # point 4, the per-thread timers deliver the right number of signals, but only to that same subset of threads
$ sudo perf script -i /tmp/perf.351581-542.data | grep signal_deliver | grep sig=27 | grep -v code=128 | awk '{print $2}' | sort | uniq -c | cat -n
     1	    997 351587
     2	    500 351589
     3	    500 351591
     4	    500 351592
     5	    500 351593

$ # point 5, the process-wide timer delivers signals -- with particular limitations on both volume and fairness -- primarily to that same subset of threads.
$ sudo perf script -i /tmp/perf.351581-542.data | grep signal_deliver | grep sig=27 | grep code=128 | awk '{print $2}' | sort | uniq -c | cat -n
     1	    839 351587
     2	    156 351588
     3	    185 351589
     4	     93 351590
     5	    267 351591
     6	    831 351592
     7	    436 351593
     8	    172 351594
     9	      1 351595
    10	      1 351604

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 27, 2021

I'm making progress on bisecting the kernel. I've only semi-automated the process; when my other work is interruptible, I can check about one commit per hour. There are "roughly 6 steps" left. During the bisect, the only thing I'm changing is the version of the linux kernel (created via make deb-pkg, installed with dpkg -i). I've been copying the same statically-linked linux/386 pprof.test binary to each test machine. Some kernel versions yield about a 1% failure rate. Some yield 0 failures across many hundreds of runs, and I've kept those tests running to build more confidence in that over time. I don't know what to call the cause of the linux-386-longtest failures other than a "kernel bug".

What I hope to get out of the bisect is a better understanding of what caused the kernel bug, to inform either a/ a change to the profiling features to not trigger the kernel bug, or b/ a change to the profiling tests to detect when their failure would be "unfortunate but not surprising" and to skip them in those cases.

Here's the bisect log:

git bisect start '--first-parent'
# bad: [4d8524048a35c03ff04c25ca1bcdf569f9aeb14a] Linux 5.10.68
git bisect bad 4d8524048a35c03ff04c25ca1bcdf569f9aeb14a
# good: [219d54332a09e8d8741c1e1982f5eae56099de85] Linux 5.4
git bisect good 219d54332a09e8d8741c1e1982f5eae56099de85
# bad: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
git bisect bad 2c85ebc57b3e1817b6ce1a6b703928e113a90442
# good: [435faf5c218a47fd6258187f62d9bb1009717896] Merge tag 'riscv-for-linus-5.8-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
git bisect good 435faf5c218a47fd6258187f62d9bb1009717896
# bad: [fc3abb53250a90ba2150eebd182137c136f4d25a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
git bisect bad fc3abb53250a90ba2150eebd182137c136f4d25a
# good: [23ee3e4e5bd27bdbc0f1785eef7209ce872794c7] Merge tag 'pci-v5.8-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into master
git bisect good 23ee3e4e5bd27bdbc0f1785eef7209ce872794c7
# good: [f6235eb189706bf38c82b5fa5f2db0d21bc1dcef] Merge tag 'pm-5.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect good f6235eb189706bf38c82b5fa5f2db0d21bc1dcef
# bad: [4b6c093e21d36bede0fd88fd0aeb3b03647260e4] Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block
git bisect bad 4b6c093e21d36bede0fd88fd0aeb3b03647260e4

From the machines with "bad" kernels, uname -a ; cat ./nohup.out | grep failures | tail -n 1 ; echo gives me:

Linux ip-172-31-55-234 5.10.68-custom #1 SMP Wed Oct 27 05:16:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
14h38m35s: 5192 runs so far, 46 failures (0.89%)

Linux ip-172-31-57-30 5.10.0-custom-2c85ebc #1 SMP Wed Oct 27 15:53:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
5h5m35s: 1806 runs so far, 21 failures (1.16%)

Linux ip-172-31-61-45 5.9.0-rc3-custom-fc3abb5 #1 SMP Wed Oct 27 17:50:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
3h2m55s: 1081 runs so far, 15 failures (1.39%)

Linux ip-172-31-61-230 5.8.0-custom-4b6c093 #1 SMP Wed Oct 27 20:15:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
40m45s: 240 runs so far, 2 failures (0.83%)

From the machines with "good" kernels, that gives me:

Linux ip-172-31-56-249 5.4.0-custom-219d543 #1 SMP Wed Oct 27 14:41:34 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
6h4m30s: 2150 runs so far, 0 failures

Linux ip-172-31-62-33 5.7.0-custom-435faf5 #1 SMP Wed Oct 27 16:50:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
4h1m45s: 1428 runs so far, 0 failures

Linux ip-172-31-61-26 5.8.0-rc6-custom-23ee3e4 #1 SMP Wed Oct 27 18:22:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
2h34m25s: 912 runs so far, 0 failures

Linux ip-172-31-63-93 5.8.0-custom-f6235eb #1 SMP Wed Oct 27 19:25:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
1h28m10s: 521 runs so far, 0 failures

I'm building the kernels on Ubuntu Xenial 16.04 (ami-0dd273d94ed0540c0) because I ran into problems with very-new versions of gcc complaining about code patterns in less-new kernel source checkouts. The kernel build commands look like:

$ ( cd ~/linux/src && make distclean && git checkout 4b6c093e21d36bede0fd88fd0aeb3b03647260e4 && make distclean && cp ~/config-5.15 ./.config && make olddefconfig && ./scripts/config --disable CONFIG_DEBUG_INFO && make clean )
$ ( cd ~/linux/src && time make -j $(getconf _NPROCESSORS_ONLN) deb-pkg LOCALVERSION=-custom-$(git rev-parse --short HEAD) )

I'm running the kernels on Ubuntu Bionic 18.04 (ami-090717c950a5c34d3), where the kernels I install are more recent than what the machine had installed already and so get selected for use at boot without extra trouble. I copy pprof.test, stress, and the kernel image package to the test machine and then run commands like these:

$ sudo dpkg -i ./linux-image-5.8.0-custom-4b6c093_5.8.0-custom-4b6c093-1_amd64.deb && sudo reboot
$ nohup ./stress.linux_amd64.exe -p 1 ./pprof.test -test.run=TestCPUProfileMultithreadMagnitude -test.count=1 &

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 28, 2021

git bisect start '--first-parent'
# bad: [4d8524048a35c03ff04c25ca1bcdf569f9aeb14a] Linux 5.10.68
git bisect bad 4d8524048a35c03ff04c25ca1bcdf569f9aeb14a
# good: [219d54332a09e8d8741c1e1982f5eae56099de85] Linux 5.4
git bisect good 219d54332a09e8d8741c1e1982f5eae56099de85
# bad: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
git bisect bad 2c85ebc57b3e1817b6ce1a6b703928e113a90442
# good: [435faf5c218a47fd6258187f62d9bb1009717896] Merge tag 'riscv-for-linus-5.8-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
git bisect good 435faf5c218a47fd6258187f62d9bb1009717896
# bad: [fc3abb53250a90ba2150eebd182137c136f4d25a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
git bisect bad fc3abb53250a90ba2150eebd182137c136f4d25a
# good: [23ee3e4e5bd27bdbc0f1785eef7209ce872794c7] Merge tag 'pci-v5.8-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into master
git bisect good 23ee3e4e5bd27bdbc0f1785eef7209ce872794c7
# good: [f6235eb189706bf38c82b5fa5f2db0d21bc1dcef] Merge tag 'pm-5.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect good f6235eb189706bf38c82b5fa5f2db0d21bc1dcef
# bad: [4b6c093e21d36bede0fd88fd0aeb3b03647260e4] Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block
git bisect bad 4b6c093e21d36bede0fd88fd0aeb3b03647260e4
# good: [05a5b5d8a2cd82e2bf5a01ad064efa396ec7fbef] Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
git bisect good 05a5b5d8a2cd82e2bf5a01ad064efa396ec7fbef
# bad: [b6b178e38f40f34842b719a8786d346d4cfec5dc] Merge tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad b6b178e38f40f34842b719a8786d346d4cfec5dc
# good: [e764a1e32337aaf325fc5b14a5bbd06eabba4699] Merge branch 'i2c/for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
git bisect good e764a1e32337aaf325fc5b14a5bbd06eabba4699
# good: [8ab49526b53d3172d1d8dd03a75c7d1f5bd21239] x86/fsgsbase/64: Fix NULL deref in 86_fsgsbase_read_task
git bisect good 8ab49526b53d3172d1d8dd03a75c7d1f5bd21239
# good: [7fca4dee610dffbe119714231cac0d59496bc193] Merge tag 'powerpc-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
git bisect good 7fca4dee610dffbe119714231cac0d59496bc193
# good: [1d229a65b419cf51a9921d73907f1998a0e14daa] Merge tag 'irq-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 1d229a65b419cf51a9921d73907f1998a0e14daa
# first bad commit: [b6b178e38f40f34842b719a8786d346d4cfec5dc] Merge tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 28, 2021

The bisect completed and the results are holding steady: the versions I marked "good" still have zero failures, and now have at least 6700 total runs (compared with an error rate close to 1% on the versions I marked "bad"). It points to torvalds/linux@b6b178e which changed how interval timers are processed, deferring the work out of the interrupt handler and into the thread's own context.

It looks like the new state to track whether that work has been enqueued needs to get cleared when a thread forks / clones.


I was able to get some kprobes into run_posix_cpu_timers: one on entry to run_posix_cpu_timers (https://github.com/torvalds/linux/blob/b6b178e38f40f34842b719a8786d346d4cfec5dc/kernel/time/posix-cpu-timers.c#L1296-L1317), and one on entry to task_work_add (https://github.com/torvalds/linux/blob/b6b178e38f40f34842b719a8786d346d4cfec5dc/kernel/time/posix-cpu-timers.c#L1131)

I'm collecting data with this monstrosity:

$ uname -a
Linux ip-172-31-62-240 5.13.0-1005-aws #6-Ubuntu SMP Fri Oct 8 07:48:02 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

$ sudo nohup ./stress.linux_amd64.exe -p=1 -failure='FAIL' bash -c 'export id="perf.$$-$RANDOM.data" ; echo "/tmp/$id" ; perf record -F 97 -e cpu-clock -e '\''signal:*'\'' -e '\''raw_syscalls:*'\'' -e probe:run_posix_cpu_timers -e probe:task_work_add -o "/tmp/$id" -- bash -c '\''/home/ubuntu/pprof.test -test.count=1 -test.run=TestCPUProfileMultithreadMagnitude && rm "/tmp/$id" '\'' ; false' &

The problem appears at a high level as particular threads where timers never trigger. When that happens, the kprobes show threads where calls to run_posix_cpu_timers aren't paired with calls to task_work_add. It appears that the kernel has determined that the task of checking timers has already been enqueued for this thread and so skips enqueuing it again.

Here's how it looks for a process with 10 broken threads (reporting 6 * 5 = 30 seconds of CPU time rather than 16 * 5 = 80 seconds):

$ sudo cat /tmp/go-stress-20211028T195233-2058828530 | sed -n -e '1p' -e '/compare/p'
/tmp/perf.716624-32638.data
        pprof_test.go:179: compare 1m19.991692s vs 29.96s
$ sudo perf script -i /tmp/perf.716624-32638.data | grep 'pprof.test' | grep probe | awk '{print $4,$5}' | sed -e 's,[0-9]\{3\}:,,' | sort | uniq -c | grep -C 3 ' 16 ' | sed -n -e '10,19 p'
     16 234777.506 probe:run_posix_cpu_timers:
      6 234777.506 probe:task_work_add:
     16 234777.510 probe:run_posix_cpu_timers:
      6 234777.510 probe:task_work_add:
     16 234777.514 probe:run_posix_cpu_timers:
      6 234777.514 probe:task_work_add:
     16 234777.518 probe:run_posix_cpu_timers:
      6 234777.518 probe:task_work_add:
     16 234777.522 probe:run_posix_cpu_timers:
      6 234777.522 probe:task_work_add:

Looking by thread ID to see which have only one of probe:run_posix_cpu_timers or probe:task_work_add, we get these 10 threads. They're in sequence, starting with 716639.

$ sudo perf script -i /tmp/perf.716624-32638.data | grep 'pprof.test' | grep probe | awk '{print $2,$5}' | sort | uniq -c | awk '{print $2}' | uniq -u
716639
716640
716641
716642
716643
716644
716645
716646
716647
716648

The syscall ID for clone on linux/386 is 120 (https://github.com/golang/go/blob/go1.17.2/src/runtime/sys_linux_386.s#L41). Checking in on how the timer-related kprobes interact with the timing of thread creation, we see that thread 716638 was in the middle of a call to clone when the time came for that thread to process a timer. The result of that call, thread 716639, was unable to process timers, in a way consistent with its tsk->posix_cputimers_work.scheduled flag being stuck on true with no actual task enqueued to clear it.

$ sudo perf script -i /tmp/perf.716624-32638.data | grep 'pprof.test' | sort -k 2 -k 4 | grep -C 2 'NR 120' | grep -C 2 probe
      pprof.test 716631 [013] 234772.482964:     raw_syscalls:sys_enter: NR 224 (50f00, 944fff0, 0, 80a9440, 0, 4d2)
--
      pprof.test 716632 [012] 234782.608595:        probe:task_work_add: (ffffffffb6ccaa30)
      pprof.test 716632 [012] 234782.608596:        probe:task_work_add: (ffffffffb6ccaa30)
      pprof.test 716632 [014] 234772.483467:      raw_syscalls:sys_exit: NR 120 = 0
      pprof.test 716632 [014] 234772.483469:     raw_syscalls:sys_enter: NR 224 (50f00, 9451ff0, 0, 80a9440, 0, 4d2)
--
      pprof.test 716638 [011] 234777.486944:     raw_syscalls:sys_enter: NR 175 (2, 82a71b0, 96b5f00, 8, 9790000, 1)
      pprof.test 716638 [011] 234777.486947:     raw_syscalls:sys_enter: NR 120 (50f00, 9791ff0, 0, 80a9440, 0, 1)
      pprof.test 716638 [011] 234777.486962: probe:run_posix_cpu_timers: (ffffffffb6d507f0)
      pprof.test 716638 [011] 234777.486963:        probe:task_work_add: (ffffffffb6ccaa30)
      pprof.test 716638 [011] 234777.487418:      raw_syscalls:sys_exit: NR 120 = 716639
      pprof.test 716638 [011] 234777.487420:      raw_syscalls:sys_exit: NR 175 = 0
--
      pprof.test 716645 [005] 234777.490702:     raw_syscalls:sys_enter: NR 175 (2, 82a71b0, 9a91f00, 8, 9a8c000, 1)
      pprof.test 716645 [005] 234777.490705:     raw_syscalls:sys_enter: NR 120 (50f00, 9a8dff0, 0, 80a9440, 0, 1)
      pprof.test 716645 [005] 234777.490960: probe:run_posix_cpu_timers: (ffffffffb6d507f0)
      pprof.test 716645 [005] 234777.491173:      raw_syscalls:sys_exit: NR 120 = 716646
      pprof.test 716645 [005] 234777.491174:     raw_syscalls:sys_enter: NR 175 (2, 9a91f00, 0, 8, 0, 1)

That's how 716639 was created with broken timers. And it's a direct ancestor of every other thread that came into being with broken timers.

$ sudo perf script -i /tmp/perf.716624-32638.data | grep 'pprof.test' | grep 'NR 120 = [1-9]'
      pprof.test 716630 [012] 234772.482947:      raw_syscalls:sys_exit: NR 120 = 716631
      pprof.test 716630 [012] 234772.483455:      raw_syscalls:sys_exit: NR 120 = 716632
      pprof.test 716630 [012] 234772.484211:      raw_syscalls:sys_exit: NR 120 = 716633
      pprof.test 716632 [014] 234772.484293:      raw_syscalls:sys_exit: NR 120 = 716634
      pprof.test 716634 [000] 234777.485803:      raw_syscalls:sys_exit: NR 120 = 716636
      pprof.test 716636 [009] 234777.486367:      raw_syscalls:sys_exit: NR 120 = 716637
      pprof.test 716637 [010] 234777.486885:      raw_syscalls:sys_exit: NR 120 = 716638
      pprof.test 716638 [011] 234777.487418:      raw_syscalls:sys_exit: NR 120 = 716639
      pprof.test 716639 [013] 234777.487970:      raw_syscalls:sys_exit: NR 120 = 716640
      pprof.test 716640 [015] 234777.488474:      raw_syscalls:sys_exit: NR 120 = 716641
      pprof.test 716641 [007] 234777.488998:      raw_syscalls:sys_exit: NR 120 = 716642
      pprof.test 716642 [006] 234777.489565:      raw_syscalls:sys_exit: NR 120 = 716643
      pprof.test 716643 [000] 234777.490099:      raw_syscalls:sys_exit: NR 120 = 716644
      pprof.test 716644 [003] 234777.490630:      raw_syscalls:sys_exit: NR 120 = 716645
      pprof.test 716645 [005] 234777.491173:      raw_syscalls:sys_exit: NR 120 = 716646
      pprof.test 716646 [002] 234777.491706:      raw_syscalls:sys_exit: NR 120 = 716647
      pprof.test 716647 [009] 234777.492298:      raw_syscalls:sys_exit: NR 120 = 716648

Loading

@prattmic
Copy link
Member Author

@prattmic prattmic commented Oct 29, 2021

Thanks Rhys for this great investigation! I was planning to comment this morning about possibilities of how that commit could issues, but it seems you've already figured it out!

If you haven't already, I recommend trying a patch to clear tsk->posix_cputimers_work.scheduled on clone, which I believe is the correct fix. The enqueued task work is already cleared here, and I think this should clear tsk->posix_cputimers_work.scheduled as well.

You have quite strong evidence that this is a kernel bug, so I think we can start an LKML thread to report this.

Loading

@prattmic
Copy link
Member Author

@prattmic prattmic commented Oct 29, 2021

In addition to that, it is unclear to me if tsk->posix_cputimers_work.work should be cleared/cancelled on clone. Conceptually it seems it should so that the new thread can't end up with the old thread's work on its linked list. Yet I don't see other users of task work doing this.

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 29, 2021

Thanks @prattmic . I've got 1171 runs and 0 failures now running torvalds/linux@b6b178e plus a patch to clear tsk->posix_cputimers_work.scheduled in copy_process.

What is the right way to move forward on fixing the kernel bug? I don't lurk on LKML, so don't know the norms to follow or to expect there. I'm open to learning, but handing off to someone with more facility there would be fine too.


It looks like tsk->posix_cputimers_work.work gets initialized only once, via posix_cputimers_init_work called by init_timers called by start_kernel. (My test run of b6b178e plus a patch to posix_cputimers_init_work, to set scheduled at the same time as work, continues to see the ~1% failure rate.) So I don't think it's correct to clear the work field at any time. It looks like it's only a function pointer, and that the job itself is an entry in tsk->task_works which as you said gets cleared during copy_process.

More questions I'd have re an LKML conversation are "which function should do the clearing", "or if it's copy_process, where", "or is there a way to keep all the code close to posix-cpu-timers.c, such as recording the thread id where the job is enqueued instead of true/false, so no clearing is required", and "does it need a reproducer written in C".


There's also the question of what this means for Go. Should we disable the tests unconditionally on linux/386 and linux/amd64 (because the kernel feature that introduced the bug is implemented for x86), or is there a good mechanism and precedent for disabling tests on particular kernel versions? Should we disable profiling when we clone threads? Is this worth an update to #13841?

Or: do we expect most distros will pick up the kernel bugfix in the next three months, so this will all shake out before the Go 1.18 release.

I also haven't addressed the flakiness on RISC-V and WSL. Those look related to CPU time spent on GC (which lead to an imbalance between rusage and the expected profile samples for the application code), so I plan to change the test to disable GC.

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 30, 2021

Change https://golang.org/cl/359934 mentions this issue: runtime/pprof: disable GC during calibration test

Loading

@prattmic
Copy link
Member Author

@prattmic prattmic commented Nov 1, 2021

What is the right way to move forward on fixing the kernel bug? I don't lurk on LKML, so don't know the norms to follow or to expect there. I'm open to learning, but handing off to someone with more facility there would be fine too.

No problem, I will engage with them and CC you on the thread.

Or: do we expect most distros will pick up the kernel bugfix in the next three months, so this will all shake out before the Go 1.18 release.

As you note earlier, this problem affects setitimer as well, so it isn't just a 1.18 problem. If we add a workaround, then it should likely be backported as well. I need to think a bit more about whether this is severe enough for a workaround like disabling profiling during clone.

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Nov 1, 2021

No problem, I will engage with them and CC you on the thread.

Thanks @prattmic . And yes, the problem continues to reproduce in yesterday's release of 5.15.

Loading

@prattmic
Copy link
Member Author

@prattmic prattmic commented Nov 1, 2021

Loading

@prattmic
Copy link
Member Author

@prattmic prattmic commented Nov 2, 2021

That patch has been merged into the -tip tree, timers/urgent branch, so it should be making its way in.

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=ca7752caeaa70bd31d1714af566c9809688544af

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 2, 2021

so... do we think this patch will hit the Go builders soon, or do we need some kind of workaround in the interim?

Loading

@prattmic
Copy link
Member Author

@prattmic prattmic commented Nov 2, 2021

From discussion today, I'm planning to write a general workaround for tip (but not backport), as the user impact seems fairly minimal.

In addition, we can mark the tests flaky until the workaround is available.

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Nov 5, 2021

@bcmills this particular kernel bug wouldn't explain the failure on linux-amd64-longtest on October 4 (well before October 18).

Yes, those failures look like what's described in #37201. That issue closed when a particular source of signals was eliminated; there may be another (less frequent) source of signals.

The October 4 failure shows Duration: 200.88ms, Total samples = 10ms ( 4.98%). Stopping the CPU profiler seems to involve 200ms worth of sleeps, so I take that to mean that it got stopped almost immediately, and due to the "traceback function called 2 times" condition rather than "test ran for one second" (

for C.getCpuHogCount() < 2 && time.Since(t0) < time.Second {
). Samples covering 10ms, at 100Hz, means 1 sample. That's consistent with the process receiving signals in addition to SIGPROF, causing the test to exit early.

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Nov 5, 2021

Turning the GC off during the test didn't resolve all sources of flakes on riscv64: https://build.golang.org/log/088b21dcceb4774c2f5ed27a7929611730fc0192 is a recent failure that shows 39 samples in 0x77188 (runtime.sigtramp:429).

That seems like a strange function to show up in CPU profiles (

TEXT runtime·sigtramp(SB),NOSPLIT,$64
). How would it be able to get a signal while already handling a signal?

I also see that it's a single PC. Maybe it's in the profile via sigprofNonGoPC (

sigprofNonGoPC(c.sigpc())
), which would mean g==nil. FWIW, that also means (from validSIGPROF) that it's from setitimer rather than timer_create.

If that's the case, what would get in the way of finding g in signal handlers on riscv64? The PC isn't in the VDSO: we see the result of c.sigpc() (0x77188), and if inVDSOPage works correctly that should return false when sigFetchG calls it. So the g should come from getg(). It looks like that's X27 (https://github.com/golang/go/blob/go1.17.3/src/cmd/compile/internal/ssa/gen/RISCV64Ops.go#L27), which is supposed to get preserved across calls / not clobbered (if I've found the right doc and am reading it correctly: https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc#integer-register-convention).

I'm not sure where to go from here. Are there RISCV or ABI experts who might have a theory, or is there a way I can access hardware to debug?

Loading

@prattmic prattmic self-assigned this Nov 8, 2021
@prattmic
Copy link
Member Author

@prattmic prattmic commented Nov 10, 2021

Are there parts of it that you're waiting on from me?

No, though I'm working on a workaround to disable the timers during clone and would like your review of that once it is ready.

I tried this out, but unfortunately it does not actually work. Just looking at clearing ITIMER_PROF, the behavior is:

This is normally fine, as handle_posix_cpu_timers won't actually send a signal since tsk->signal->it[CPUCLOCK_PROF] is cleared, but simply scheduling the work is sufficient for this bug.

As an alternative to disabling the timers around clone, one wonders if perhaps we can set them to an extremely high value so that fastpath_timer_check will determine that none have expired? No, unfortunately not, as setting the timer only changes nextevt if the new value is less than the old value (to avoid walking the entire linked list to search for the next expiration time). Instead, the timer work is scheduled whenever the next expiration would occur, at which point the list is walked and next expiration updated.

Thus, I don't currently have a workaround idea beyond setting the timers and spinning until the next expiration before clone (which is obviously unacceptable). I'm all ears to other ideas, otherwise I'll just disable this test for now.

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 10, 2021

Change https://golang.org/cl/363214 mentions this issue: runtime/pprof: disable TestCPUProfileMultithreadMagnitude

Loading

@rhysh
Copy link
Contributor

@rhysh rhysh commented Nov 11, 2021

I see some options to hit the bug less often, but not to work around it entirely.

First of all, the initial timer expiration in the new profiler is chosen from a uniform random distribution in the range [1, 10_000_000] ns. In TestCPUProfileMultithreadMagnitude, many goroutines are ready to work right away, so we expect the runtime to create new threads, for each new thread to create a profiling timer for itself, and then to immediately create an additional new thread (until there's an M for each P). In that scenario, pushing back the earliest possible timer_create expiration (e.g. uniformly distributed over 10–90% of the 10ms target instead of 0–100%, keeping the mean the same) could make the expiration less likely to coincide with a call to clone for that type of workload.

Another option, we could use the template thread every time the runtime needs a new M (and not enable per-thread profiling for that thread). That thread would still be at risk of doing the buggy bookkeeping for the process-wide setitimer. And there's probably a good performance reason to not do this all the time already.

Loading

gopherbot pushed a commit that referenced this issue Nov 12, 2021
The Linux kernel starting in 5.9 and fixed in 5.16 has a bug that can
break CPU timer signal delivery on new new threads if the timer
interrupt fires during handling of the clone system call.

Broken CPU timer signal deliver will skew CPU profile results and cause
this test to fail.

There is currently no known workaround, so mark the test as flaky on
builders with known broken kernels.

For #49065

Change-Id: I37ceb9ea244869b0aab5cd9a36b27ca2f7e5d315
Reviewed-on: https://go-review.googlesource.com/c/go/+/363214
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@prattmic
Copy link
Member Author

@prattmic prattmic commented Nov 12, 2021

re @rhysh

Similar to your option (1), we do have a few tests that use runtime_test.warmupScheduler to get the thread count up before starting tests. We could do that for TestCPUProfileMultithreadMagnitude, but I don't think it is worth it vs just waiting.

Re: option (2), again since that is only a partial mitigation, I'd rather avoid making such a change unless this becomes really problematic.

Loading

@prattmic
Copy link
Member Author

@prattmic prattmic commented Nov 12, 2021

With http://golang.org/cl/363214 submitted, I believe we have addressed the flakiness on builders due to the Linux issue.

Additionally, all failures not related to the Linux kernel bug have been split into other issues (please correct me if I forgot some, in which case they should be moved).

Thus, this issue is done for now, but I'll leave open to track the upstream Linux fix rollout and ultimately dropping/adjusting the kernel version checks.

Loading

@prattmic prattmic changed the title runtime/pprof: TestCPUProfileMultithreadMagnitude flaky since addition runtime/pprof: TestCPUProfileMultithreadMagnitude flaky due to Linux kernel bug Nov 12, 2021
@prattmic prattmic removed this from the Go1.18 milestone Nov 12, 2021
@prattmic prattmic added this to the Backlog milestone Nov 12, 2021
@prattmic
Copy link
Member Author

@prattmic prattmic commented Nov 15, 2021

Patch update: my upstream patch has been merged into Linus' tree and queued for backport to 5.10, 5.14, and 5.15.

Loading

@prattmic
Copy link
Member Author

@prattmic prattmic commented Nov 18, 2021

This fix is now released in:

For reference, I took a look at some distros that I expect off the top of my head are widely used. Reminder this bug affects Linux 5.9 through 5.15 inclusive.

Distro Kernel Version Notes
RHEL 8 Linux 4.18 unaffected
Debian oldstable (buster) Linux 4.19 unaffected
Debian stable (bullseye) Linux 5.10.70 currently Appears to update patch release every ~3 months.
Debian testing (bookworm) Linux 5.14.16 currently Appears to update patch release monthly.
Ubuntu 20.04 LTS Linux 5.4 unaffected
Ubuntu 22.10 Linux 5.13 This is no longer supported upstream, so it was not patched. It is unclear what their backport policy is, but the release notes do list all the changes they integrate.
Amazon Linux 2 Linux 4.14 unaffected
Google Container-Optimized OS M93 LTS Linux 5.10.68 currently Updates patch release in "LTS Refresh" releases every 3 months.
Google Container-Optimized OS M89 LTS Linux 5.4 unaffected

Overall this looks pretty promising to me that most of the ecosystem should get the patches fairly quickly.

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 19, 2021

Do we also need to apply the test-skip to some of the other runtime and/or runtime/pprof tests mentioned in #49065 (comment)?

For example, I still see some recent TestMorestack failures, and there was a large uptick in those starting Oct. 18 as well:

greplogs --dashboard -md -l -e 'FAIL: TestMorestack' --since=2021-07-01

2021-11-18T02:16:39-353cb71/linux-386-longtest
2021-11-11T20:25:07-eb68e33/linux-amd64-longtest
2021-11-10T20:45:04-a881409/linux-386-longtest
2021-11-10T18:06:32-097aaa9/linux-386-longtest
2021-11-09T22:11:33-4aa0746/linux-386-longtest
2021-11-09T20:10:44-74b9939/linux-386-longtest
2021-11-09T18:03:59-36dbf7f/linux-386-longtest
2021-11-08T21:45:43-cc49178/linux-amd64-longtest
2021-11-08T18:54:21-830b393/linux-386-longtest
2021-11-07T04:57:22-9e6ad46/linux-amd64-longtest
2021-11-06T19:41:10-565f457/linux-amd64-longtest
2021-11-05T22:56:55-2f71c86/linux-amd64-longtest
2021-11-05T22:30:17-b07c41d/linux-386-longtest
2021-11-05T19:14:22-fa16efb/linux-386-longtest
2021-11-05T17:46:59-e48e4b4/linux-amd64-longtest
2021-11-05T00:52:04-1c4cfd8/linux-amd64-longtest
2021-11-04T21:40:51-2c32f29/linux-amd64-longtest
2021-11-04T15:43:59-6ba68a0/linux-386-longtest
2021-11-04T02:57:37-9e4e23c/linux-386-longtest
2021-11-04T00:46:41-e72d715/linux-386-longtest
2021-11-04T00:15:44-91962df/linux-386-longtest
2021-11-03T05:00:00-519c0a2/linux-386-longtest
2021-11-02T18:35:29-433ba58/linux-386-longtest
2021-11-02T15:15:09-da7173a/linux-386-longtest
2021-11-02T01:28:01-a59e332/linux-386-longtest
2021-11-01T16:31:02-2bcf1c0/linux-386-longtest
2021-11-01T16:30:49-80bedb8/linux-386-longtest
2021-10-31T02:24:29-0bef30d/linux-386-longtest
2021-10-30T18:30:05-f582778/linux-386-longtest
2021-10-29T23:15:00-8dfb447/linux-386-longtest
2021-10-29T18:34:24-903f313/linux-386-longtest
2021-10-26T22:05:53-80be4a4/linux-386-longtest
2021-10-25T17:02:16-52b10ab/linux-386-longtest
2021-10-23T20:23:35-8dbf3e9/linux-386-longtest
2021-10-23T16:18:19-16318df/linux-386-longtest
2021-10-22T21:12:07-76dd01f/linux-386-longtest
2021-10-20T20:39:19-d419a80/linux-386-longtest
2021-10-19T11:52:34-bde0463/linux-386-longtest
2021-10-19T07:45:46-ee92daa/linux-386-longtest
2021-10-18T21:21:54-6c0daa7/linux-386-longtest
2021-10-18T19:46:27-394a1ad/linux-amd64-longtest

Loading

@bcmills bcmills removed this from the Backlog milestone Nov 19, 2021
@bcmills bcmills added this to the Go1.18 milestone Nov 19, 2021
@prattmic
Copy link
Member Author

@prattmic prattmic commented Nov 19, 2021

Yes, we probably should. TestMorestack for example probably doesn't create as many threads as TestCPUProfileMultithreadMagnitude, but it could still be affected if unlucky.

Most of the pprof tests are probably at least theoretically affected, but I think we only want to add skips to tests that are particularly affected, lest we lose all pprof coverage.

Perhaps better would be to pin the builders to COS M89, which is unaffected, until M93 is updated to include the patch.

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 19, 2021

Perhaps better would be to pin the builders to COS M89, which is unaffected, until M93 is updated to include the patch.

@golang/release, is that feasible to do in the short term?

Loading

@prattmic
Copy link
Member Author

@prattmic prattmic commented Nov 19, 2021

IIUC, that could be achieved by changing https://cs.opensource.google/go/x/build/+/master:buildlet/gce.go;l=461;drc=05d22632dd9ae9f87a48a12960617e223cecdf9c?ss=go from cos-stable to cos-89-lts (https://cloud.google.com/container-optimized-os/docs/release-notes#lts_image_families), though that would affect every single builder using COS, which may or may not be desirable.

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 19, 2021

Change https://golang.org/cl/365777 mentions this issue: buildlet,dashboard: tweaks to Darwin, OpenBSD, and Linux builders

Loading

gopherbot pushed a commit to golang/build that referenced this issue Nov 19, 2021
- Add more Monterey builders and remove the known issue: it's stable.
  Also use it for race and nocgo. Update slowbot aliases to point to it.
- Don't test 1.16 on OpenBSD 7.0, it segfaults.
- Pin the Linux (ContainerOS) builders to an older version to avoid the
  pprof kernel bug.

For golang/go#49065, golang/go#49149, and golang/go#48977.

Change-Id: Ibec2fa735183ec65e5066c7c752ac356b7360550
Reviewed-on: https://go-review.googlesource.com/c/build/+/365777
Trust: Heschi Kreinick <heschi@google.com>
Run-TryBot: Heschi Kreinick <heschi@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Alexander Rakoczy <alex@golang.org>
@heschi
Copy link
Contributor

@heschi heschi commented Nov 24, 2021

Is there anything left to do on this now that the builders are updated? @dmitshur points out it might deserve a release note.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
@rhysh @prattmic @bcmills @gopherbot @cherrymui @heschi and others