runtime/pprof: TestCPUProfileMultithreadMagnitude flaky due to Linux kernel bug #49065
Some of the other tests I see failing in these, w.r.t. whether they were failing prior to the https://golang.org/cl/342054 patch stack (looking since April):
IMO, this is pretty strong evidence that these CLs have actually broke something w.r.t. profiling. I wouldn't be surprised if riscv or wsl have fundamental platform issues with profiling, but 386 and amd64 should certainly be OK.
Thank you for the links. Here's what I've found so far in the 386 and amd64 (not WSL, not RISC-V) logs.
The failing test results for TestCPUProfileMultithreadMagnitude on those platforms all look like "got full profiling info, but only from a subset of threads", followed by some number of tests that get a very small number of profile samples—often 0, sometimes 1, sometimes dozens—when the tests expect several hundred. The samples that do appear come from the runtime, in the scheduler or scavenger, in functions that might run on otherwise-idle threads.
HOSTNAME=buildlet-linux-stretch-morecpu-rn873c60c, which per https://farmer.golang.org/builders is "Debian Stretch, but on e2-highcpu-16", which per https://cloud.google.com/compute/docs/general-purpose-machines#e2-high-cpu is 16 vCPUs and 16 GB memory.
The single-goroutine portion of TestCPUProfileMultithreadMagnitude saw 4.98 seconds of samples (in a 5-second test). The parallel portion (using GOMAXPROCS goroutines) also ran for 5 seconds, during which the OS reported 79.56 seconds of user+system time in rusage but the runtime only saw 34.71 seconds worth of samples. That looks like "the cliff at 2.5 that we got with setitimer is now present at 7 vCPU, sometimes".
The failure in that test run of TestCPUProfileMultithreadMagnitude is followed by failures in TestCPUProfileInlining, TestCPUProfileRecursion, TestMathBigDivide, TestMorestack, TestCPUProfileLabel, and TestTimeVDSO. All of those have "total 0 CPU profile samples collected" except for TestCPUProfileInlining which got a single sample (below the minimum of 125 and target of 500).
HOSTNAME=buildlet-linux-stretch-morecpu-rnce27d47, so GOMAXPROCS=16 again.
The single-goroutine portion saw 4.99 seconds of samples, and the parallel portion saw 29.94 seconds of samples. This looks like "the cliff at 2.5 that we got with setitimer is now present at 6 vCPU, sometimes".
The failure in that test run of TestCPUProfileMultithreadMagnitude is followed by failures in TestCPUProfileRecursion, TestMathBigDivide, TestMorestack, TestCPUProfileLabel, and TestTimeVDSO. Again all of those failures were with "total 0 CPU profile samples collected" except for TestCPUProfileRecursion and TestMorestack which got a single sample each (vs min of 125 and target of 500).
No environment header at the top of this log.
The single-goroutine portion saw 5.00 seconds of samples, and the parallel portion saw 49.87 seconds of samples. This looks like "the cliff at 2.5 that we got with setitimer is now present at 10 vCPU, sometimes".
Followed by failure of TestMorestack (10 samples, 0 where expected), TestCPUProfileLabel (22 samples, 0 where expected), and TestTimeVDSO (9 samples, 0 where expected).
HOSTNAME=buildlet-linux-stretch-morecpu-rn83cdb7c, so GOMAXPROCS=16.
The single-goroutine portion saw 5.02 seconds of samples (ran a little long!), and the parallel portion saw 50.50 seconds of samples (also ran a little long, but on fewer than GOMAXPROCS threads?). This could look like "the cliff at 2.5 that we got with setitimer is now present at 10 vCPU, sometimes" ... at least, 50.50 seconds is only slightly above 10 x 5.00 seconds.
There were no other failures in this run.
HOSTNAME=buildlet-linux-stretch-morecpu-rn6ff4f0f, so GOMAXPROCS=16.
The single-goroutine portion saw 5.00 seconds of samples, and the parallel portion saw 50.83 seconds of samples (ran a little long, but on fewer than GOMAXPROCS threads?). This could look like "the cliff at 2.5 that we got with setitimer is now present at 10 vCPU, sometimes" ... at least, 50.83 seconds is only slightly above 10 x 5.00 seconds.
There were no other failures in this run.
These failures look like three problems:
Problem 1 is a couple bad commits while the CL stack was landing. I'll filter those out of the results.
Problem 2 often shows garbage-collection-related functions in the profile. If the GC is running during the test, it would increase the CPU time for the whole process as reported by getrusage(2) without increasing the amount of time in the cpuHog functions that the test examines. I plan to update the test to disable the GC while it runs.
Problem 3 is new in the last few days. When it shows up, it affects the single-goroutine CPU profiling tests too. It's only appeared on the longtest builders. Those have 16 hyperthreads / GOMAXPROCS=16.
The failures in single-goroutine tests take the form of zero or near-zero samples in the desired functions, and the failures in
It's interesting that the amount of CPU time reported in the profile is always very close to an integer number of threads (each running for 5 seconds). It's also interesting that the smallest report is 6 out of the 16 threads. I disabled the new timer_create-based profiler and ran the test on the linux-386-longtest builder, and found that setitimer is able to deliver enough SIGPROF events to the process to cover the work of 10 threads (https://storage.googleapis.com/go-build-log/d3ff14b2/linux-386-longtest_a3d8929c.log). My working theory on this failure mode is that on some fraction of the process's threads, the kernel delivers the signals from setitimer first, so the deliveries from timer_create are crowded out up to setitimer's maximum signal delivery rate.
I also find it suspicious that this failure mode is new in the last few days, but none of the nearby commits look related and I also haven't been able to reproduce the problem myself.
I got an interesting trybot failure on https://go-review.googlesource.com/c/go/+/357650/8: https://storage.googleapis.com/go-build-log/c3d0e67e/linux-386-longtest_222a0fb6.log
The result in
It would be more interesting if the test run showed more than a single thread not reporting in at all, but this is still pretty good.
I'd like to see the output of
Gomote access might help, but the instructions at https://github.com/golang/go/wiki/Gomote#access-token say that new accounts are on hold.
I've built the toolchain with GOARCH=386 and GOHOSTARCH=386 on a 16-hyperthread machine (a c5.4xlarge Intel EC2 instance) running Debian Stretch and have run
In the linux-386-longtest failures, the first failing test is TestCPUProfileMultithreadMagnitude. The part of that test that fails is the one that exercises GOMAXPROCS threads/goroutines in parallel. If that test fails, subsequent tests of the CPU profiler may also fail, including ones that use a single thread/goroutine. However, 1/ the two tests of the CPU profiler preceding TestCPUProfileMultithreadMagnitude (TestCPUProfile using 1 thread/goroutine and TestCPUProfileMultithreaded using 2 threads/goroutines) do not fail, and 2/ the single-threaded / serial portion (the first half) of TestCPUProfileMultithreadMagnitude also does not fail.
It looks like CPU profiling is working for the process, then the process uses lots of threads in parallel, and then CPU profiling no longer consistently works for the process. And, this is new as of October 18.
I don't know how to move forward with the "longtest" half of the problem. Do you have advice, @prattmic ?
My rough non-repro steps:
I think the longtest failures are related to the Linux Kernel version in use.
From one side, the linux-386-longtest failures started abruptly on 2021-10-18 and none of the Go commits close to the first failure look like they'd be related. It looks like the linux-386-longtest runs as a container on GCE and does not require nested virtualization, and so runs on the latest stable version of GCE's Container Optimized OS.
Following on to that, it looks like COS m93 with Kernel 5.10.68 became a stable release on 2021-10-18, replacing m87 with Kernel 5.4.144. (I only have a date, not a timestamp ... I haven't figured out how to list VM images without a billing-enabled account.) https://cloud.google.com/container-optimized-os/docs/release-notes/m93 https://cloud.google.com/container-optimized-os/docs/release-notes/m89
From the other side, I've seen the following on 16-thread machines I control:
I've only gotten one failure so far, and I don't have the
@golang/release , does the "kernel / COS image version might be related to these failures" idea sound plausible to you? Did the container-based builders pick up COS m93, and does the timeline of that match this set of linux-386-longtest failures? (I don't know if I'm reading and interpreting the code correctly.) Thanks.
I continue to see about a 1% failure rate on my test machines that use very new kernels.
Debian Stretch, Linux 4.9: 4492 runs, 0 failures
They're c5.4xlarge machines in us-west-2, running images ami-025f3bcb64ebe6e83, ami-02684f1c7f36cdd5b, ami-0d4a468c8fcc4b5f0.
I've started an additional Ubuntu 21.10 machine to collect
I think we should not disable profiling-related features, and should instead disable the tests when running on versions of the kernel where user-space profiling is flaky. The data I've collected point to the problem:
I've reproduced the failure with go1.17.2 and kernel 5.13.0. That predates the per-thread profiler, so also predates its many-threads test, and so has a very low failure rate. But I've seen it fail twice in the same way that it's failed on the linux-386-longtest builder since 2021-10-18: once on TestCPUProfileInlining ("too few samples; got 0, want at least 125, ideally 500") and once on TestCPUProfileLabel ("too few samples; got 3, want at least 125, ideally 500").
Back to tip at
It looks like user-space CPU profiling is broken (or at least somehow changed) on recent Linux Kernel versions, and that the new per-thread profiler doesn't have much to do with that aside from adding a test that makes that more obvious. I don't have root cause yet, but I don't think that disabling or reverting the new profiler is the right way to address that.
Here's the data from one of the most extreme examples I have (with only 25 seconds of work reported for the 16 workers * 5 seconds = 80 second parallel portion of the test). It's from
I'm making progress on bisecting the kernel. I've only semi-automated the process; when my other work is interruptible, I can check about one commit per hour. There are "roughly 6 steps" left. During the bisect, the only thing I'm changing is the version of the linux kernel (created via
What I hope to get out of the bisect is a better understanding of what caused the kernel bug, to inform either a/ a change to the profiling features to not trigger the kernel bug, or b/ a change to the profiling tests to detect when their failure would be "unfortunate but not surprising" and to skip them in those cases.
Here's the bisect log:
From the machines with "bad" kernels,
From the machines with "good" kernels, that gives me:
I'm building the kernels on Ubuntu Xenial 16.04 (ami-0dd273d94ed0540c0) because I ran into problems with very-new versions of gcc complaining about code patterns in less-new kernel source checkouts. The kernel build commands look like:
I'm running the kernels on Ubuntu Bionic 18.04 (ami-090717c950a5c34d3), where the kernels I install are more recent than what the machine had installed already and so get selected for use at boot without extra trouble. I copy pprof.test, stress, and the kernel image package to the test machine and then run commands like these:
The bisect completed and the results are holding steady: the versions I marked "good" still have zero failures, and now have at least 6700 total runs (compared with an error rate close to 1% on the versions I marked "bad"). It points to torvalds/linux@b6b178e which changed how interval timers are processed, deferring the work out of the interrupt handler and into the thread's own context.
It looks like the new state to track whether that work has been enqueued needs to get cleared when a thread forks / clones.
I was able to get some kprobes into
I'm collecting data with this monstrosity:
The problem appears at a high level as particular threads where timers never trigger. When that happens, the kprobes show threads where calls to
Here's how it looks for a process with 10 broken threads (reporting 6 * 5 = 30 seconds of CPU time rather than 16 * 5 = 80 seconds):
Looking by thread ID to see which have only one of
The syscall ID for clone on linux/386 is 120 (https://github.com/golang/go/blob/go1.17.2/src/runtime/sys_linux_386.s#L41). Checking in on how the timer-related kprobes interact with the timing of thread creation, we see that thread 716638 was in the middle of a call to
That's how 716639 was created with broken timers. And it's a direct ancestor of every other thread that came into being with broken timers.
Thanks Rhys for this great investigation! I was planning to comment this morning about possibilities of how that commit could issues, but it seems you've already figured it out!
If you haven't already, I recommend trying a patch to clear
You have quite strong evidence that this is a kernel bug, so I think we can start an LKML thread to report this.
In addition to that, it is unclear to me if
What is the right way to move forward on fixing the kernel bug? I don't lurk on LKML, so don't know the norms to follow or to expect there. I'm open to learning, but handing off to someone with more facility there would be fine too.
It looks like
More questions I'd have re an LKML conversation are "which function should do the clearing", "or if it's copy_process, where", "or is there a way to keep all the code close to posix-cpu-timers.c, such as recording the thread id where the job is enqueued instead of true/false, so no clearing is required", and "does it need a reproducer written in C".
There's also the question of what this means for Go. Should we disable the tests unconditionally on linux/386 and linux/amd64 (because the kernel feature that introduced the bug is implemented for x86), or is there a good mechanism and precedent for disabling tests on particular kernel versions? Should we disable profiling when we clone threads? Is this worth an update to #13841?
Or: do we expect most distros will pick up the kernel bugfix in the next three months, so this will all shake out before the Go 1.18 release.
I also haven't addressed the flakiness on RISC-V and WSL. Those look related to CPU time spent on GC (which lead to an imbalance between rusage and the expected profile samples for the application code), so I plan to change the test to disable GC.
No problem, I will engage with them and CC you on the thread.
As you note earlier, this problem affects setitimer as well, so it isn't just a 1.18 problem. If we add a workaround, then it should likely be backported as well. I need to think a bit more about whether this is severe enough for a workaround like disabling profiling during clone.
That patch has been merged into the -tip tree, timers/urgent branch, so it should be making its way in.
This fix is now released in:
For reference, I took a look at some distros that I expect off the top of my head are widely used. Reminder this bug affects Linux 5.9 through 5.15 inclusive.
Overall this looks pretty promising to me that most of the ecosystem should get the patches fairly quickly.
Yes, we probably should.
Most of the pprof tests are probably at least theoretically affected, but I think we only want to add skips to tests that are particularly affected, lest we lose all pprof coverage.
Perhaps better would be to pin the builders to COS M89, which is unaffected, until M93 is updated to include the patch.
IIUC, that could be achieved by changing https://cs.opensource.google/go/x/build/+/master:buildlet/gce.go;l=461;drc=05d22632dd9ae9f87a48a12960617e223cecdf9c?ss=go from
- Add more Monterey builders and remove the known issue: it's stable. Also use it for race and nocgo. Update slowbot aliases to point to it. - Don't test 1.16 on OpenBSD 7.0, it segfaults. - Pin the Linux (ContainerOS) builders to an older version to avoid the pprof kernel bug. For golang/go#49065, golang/go#49149, and golang/go#48977. Change-Id: Ibec2fa735183ec65e5066c7c752ac356b7360550 Reviewed-on: https://go-review.googlesource.com/c/build/+/365777 Trust: Heschi Kreinick <firstname.lastname@example.org> Run-TryBot: Heschi Kreinick <email@example.com> TryBot-Result: Go Bot <firstname.lastname@example.org> Reviewed-by: Alexander Rakoczy <email@example.com>
There is no remaining immediate work to be done beyond tracking rollout of upstream fixes.
Perhaps adding a release note would be a good place to note this bug, though I want to emphasize that this is not a new issue with Go 1.18. It affects old releases as well.
As of 2022-01-18 all of the distros I tracked in #49065 (comment) have released a fixed kernel.
As a summary, this bug was:
The latest COS version (m93) has released an updated kernel containing the fix to golang/go#49065, so we no longer need to pin away from it. For golang/go#49065. Change-Id: Ie74a432e9229eec5613df7e23ead3f252390ae5f Reviewed-on: https://go-review.googlesource.com/c/build/+/381454 Trust: Michael Pratt <firstname.lastname@example.org> Run-TryBot: Michael Pratt <email@example.com> TryBot-Result: Gopher Robot <firstname.lastname@example.org> Reviewed-by: Alex Rakoczy <email@example.com>