Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestCgoPprof{,PIE} is flaky #37201

Closed
bcmills opened this issue Feb 12, 2020 · 21 comments
Closed

runtime: TestCgoPprof{,PIE} is flaky #37201

bcmills opened this issue Feb 12, 2020 · 21 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 12, 2020

2020-02-12T18:22:50-363bcd0/linux-ppc64le-power9osu
2020-02-07T18:08:01-b806182/linux-amd64-jessie

--- FAIL: TestCgoPprofPIE (8.17s)
    crash_cgo_test.go:311: [/workdir/go/bin/go tool pprof -traces /workdir/tmp/go-build499031835/testprogcgo_-buildmode=pie.exe /workdir/tmp/prof767840548]:
        File: testprogcgo_-buildmode=pie.exe
        Build ID: 8fb7f17909e63decefe761e095ce9dbd1c97e738
        Type: cpu
        Time: Feb 7, 2020 at 6:35pm (UTC)
        Duration: 200.51ms, Total samples = 0 
        -----------+-------------------------------------------------------
    crash_cgo_test.go:319: cpuHog traceback missing.
    crash_cgo_test.go:311: [/workdir/go/bin/go tool pprof -traces /workdir/tmp/prof767840548]:
        File: testprogcgo_-buildmode=pie.exe
        Build ID: 8fb7f17909e63decefe761e095ce9dbd1c97e738
        Type: cpu
        Time: Feb 7, 2020 at 6:35pm (UTC)
        Duration: 200.51ms, Total samples = 0 
        -----------+-------------------------------------------------------
    crash_cgo_test.go:319: cpuHog traceback missing.
FAIL
FAIL	runtime	33.607s

Given the timing of the logs, it's not clear to me whether this is a regression in Go 1.14 or simply an already-flaky test coming to the surface.

CC @mknyszek @aclements @hyangah @ianlancetaylor @mpx

@bcmills bcmills added this to the Backlog milestone Feb 12, 2020
@gopherbot
Copy link

@gopherbot gopherbot commented Feb 13, 2020

Change https://golang.org/cl/219417 mentions this issue: runtime: deflake CGO traceback tests

@mpx
Copy link
Contributor

@mpx mpx commented Feb 13, 2020

I suspect the test is flaky since it only collects 2+ cpuHog frames. I'm not familiar with how Pprof decides which samples to include in the profile, but it looks like in some rare cases cpuHog doesn't make the cut. The above log shows 0 samples reported in 200ms. That indicates the test loop exited early with at least 2 samples (since 200ms < 1 second loop timeout), but they weren't reported.

If so, collecting more samples should significantly reduce the likelihood of a flake.

I'm currently running a large number of tests to see if I can catch the flake. If that works, I should be able to confirm the above patch fixes the problem.

@mpx
Copy link
Contributor

@mpx mpx commented Feb 15, 2020

I couldn't reproduce the flake on an unloaded machine, but I managed to reproduce it after ~13000 on a reasonably heavily loaded machine (lots of IO, CPU load and scheduling keeping the cores 100% busy).

--- FAIL: TestCgoPprofPIE (0.22s)
    crash_cgo_test.go:311: [/home/mark/go-dev/bin/go tool pprof -traces /tmp/go-build052039209/testprogcgo_-buildmode=pie.exe /tmp/prof021439459]:
        File: testprogcgo_-buildmode=pie.exe
        Build ID: 57c37c1d4a6f0a6bb608c00c63c35c050438b0c3
        Type: cpu
        Time: Feb 15, 2020 at 4:02am (AEDT)
        Duration: 200.27ms, Total samples = 0 
        -----------+-------------------------------------------------------
    crash_cgo_test.go:319: cpuHog traceback missing.

Retrying with the patch now.

@mpx
Copy link
Contributor

@mpx mpx commented Feb 16, 2020

I was unable to reproduce the flake with the patch after ~200000 tests. Without the patch it typically fails within 5000-20000 tests (both tested from +123f7dd3e1).

I had a quick look through the profiling code and I can't see any good reason why frames should be dropped at all for this test. It's not clear to me why "2" frames were originally chosen.

Going in the other direction, waiting for only a single cpuHog frame appears reliable on go1.13.7, but fails easily on go1.14rc1 (usually <100 tests).

Running a bisect now to see where this test started failing intermittently.

@mpx
Copy link
Contributor

@mpx mpx commented Feb 16, 2020

The test is flaky from 177a36a (runtime: implement async scheduler preemption) onwards. The test is reliable with GODEBUG=asyncpreemptoff=1. @aclements

The CGO traceback is definitely called (as noted above, I also confirmed with a trace write separately). There is some bad interaction between signal pre-emption and profiling which causes the sample to be lost.

Perhaps this test should only use a single frame? That would have surfaced the issue much earlier.
Updated the earlier change to make it more sensitive rather than less.

@mpx
Copy link
Contributor

@mpx mpx commented Feb 21, 2020

@bcmills @aclements @ianlancetaylor

It might be worth investigating this issue further before Go 1.14 is released to confirm the damage is limited to loss of some profile samples (and not something more fundamental).

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 21, 2020

The test doesn't count the number of profiling samples that are taken. It counts the number of times the pprofCgoTraceback function is called. That function is called any time a signal occurs while executing in non-Go code. Before we started using signal based preemption, the only signal that would occur in non-Go code was SIGPROF. Now we can also see SIGURG. That is not a bug. But it means that this test is flaky: if we see two SIGURG signals in non-Go code before we see any SIGPROF signals in non-Go code, the test will fail.

So I don't think there is a fundamental problem here. It's a problem with the way that the test is written. It's not a problem that would happen with real code.

Unfortunately I don't see a simple fix off hand, other than disabling signal based preemption for the test.

@mpx
Copy link
Contributor

@mpx mpx commented Feb 21, 2020

Ah, great. Thanks for confirming it is expected that pprofCgoTraceback is called outside of profile samples, I missed that.

I've reverted my change so it waits for more pprofCgoTraceback calls since it is extremely unlikely that every one of them will not result in a sample. That should effectively deflake the test until a better test is desired/written.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 21, 2020

Maybe we should simply disable signal-based preemption when running the test. The point of the test is just to see that we pick up a stack trace for SIGPROF, disabling signal-based preemption doesn't invalidate what we are testing for.

@mpx
Copy link
Contributor

@mpx mpx commented Mar 1, 2020

Sorry, missed that suggestion earlier - much better idea. I've updated the deflake change. Given this flake was only noticed after signal preemption, it should more or less make it disappear again. I haven't been able to reproduce it after disabling preemption.

gopherbot pushed a commit that referenced this issue Mar 1, 2020
The CGO traceback function is called whenever CGO code is executing and
a signal is received. This occurs much more frequently now SIGURG
is used for preemption.

Disable signal preemption to significantly increase the likelihood that
a signal results in a profile sample during the test.

Updates #37201

Change-Id: Icb1a33ab0754d1a74882a4ee265b4026abe30bdc
Reviewed-on: https://go-review.googlesource.com/c/go/+/219417
Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@mpx
Copy link
Contributor

@mpx mpx commented Mar 8, 2020

FYI, I left this issue open in case this issue should track fixing the test approach and/or runtime.

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 12, 2020

@danscales
Copy link

@danscales danscales commented May 12, 2020

I also got the failure with a recently rebased changed:

https://storage.googleapis.com/go-build-log/3d4facca/linux-amd64-race_bb62fb2a.log

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 21, 2020

Including both TestCgoPprofPIE and TestCgoPprof, it looks like this got quite a bit worse around 2020-05-11:
2020-05-20T21:52:34-567556d/linux-amd64-fedora
2020-05-13T11:59:26-810c27e/linux-amd64-race
2020-05-12T15:01:56-a0698a6/linux-amd64-jessie
2020-05-11T22:38:32-8c1db77/linux-amd64-race
2020-02-12T18:22:50-363bcd0/linux-ppc64le-power9osu
2020-02-07T18:08:01-b806182/linux-amd64-jessie

I've also seen it in quite a few TryBot results recently.

Marking as release-blocker for Go 1.15 due to the likelihood of a regression here.

@bcmills bcmills modified the milestones: Backlog, Go1.15 May 21, 2020
@bcmills bcmills changed the title runtime: TestCgoPprofPIE is flaky runtime: TestCgoPprof{,PIE} is flaky May 21, 2020
@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented May 21, 2020

@bcmills Can I make this a non-beta1 blocking issue?

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 21, 2020

Yep, that seems fine.

@odeke-em
Copy link
Member

@odeke-em odeke-em commented May 29, 2020

I just got a failure at https://storage.googleapis.com/go-build-log/9b94d0b5/linux-amd64-race_d785929a.log

--- FAIL: TestCgoPprof (3.47s)
    crash_cgo_test.go:317: [/workdir/go/bin/go tool pprof -traces /workdir/tmp/go-build787198486/testprogcgo_.exe /workdir/tmp/prof847643684]:
        File: testprogcgo_.exe
        Build ID: a25028f2fff69a0864f81c4cca9180f2f354c52b
        Type: cpu
        Time: May 29, 2020 at 10:00am (UTC)
        Duration: 200.66ms, Total samples = 0 
        -----------+-------------------------------------------------------
    crash_cgo_test.go:325: cpuHog traceback missing.
    crash_cgo_test.go:317: [/workdir/go/bin/go tool pprof -traces /workdir/tmp/prof847643684]:
        File: testprogcgo_.exe
        Build ID: a25028f2fff69a0864f81c4cca9180f2f354c52b
        Type: cpu
        Time: May 29, 2020 at 10:00am (UTC)
        Duration: 200.66ms, Total samples = 0 
        -----------+-------------------------------------------------------
    crash_cgo_test.go:325: cpuHog traceback missing.
FAIL
FAIL	runtime	54.956s
FAIL
go tool dist: Failed: exit status 1
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 29, 2020

Could this be similar to #39021 , in that the program gets signaled with SIGWINCH that sent to the the process group, and triggered pprofCgoTraceback to run too early?

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 29, 2020

I think CL https://golang.org/cl/235557 also fixes this. On my machine it was ~10% failure rate (running runtime test in a loop), after applying that CL it doesn't fail with 100+ iterations.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 16, 2020

This hasn't happened since CL 235557 was committed, and findflakes estimates a 0.00% probability that this is still broken. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.