Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestCgoPprofPIE is flaky #37201

Open
bcmills opened this issue Feb 12, 2020 · 9 comments
Open

runtime: TestCgoPprofPIE is flaky #37201

bcmills opened this issue Feb 12, 2020 · 9 comments
Milestone

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 12, 2020

2020-02-12T18:22:50-363bcd0/linux-ppc64le-power9osu
2020-02-07T18:08:01-b806182/linux-amd64-jessie

--- FAIL: TestCgoPprofPIE (8.17s)
    crash_cgo_test.go:311: [/workdir/go/bin/go tool pprof -traces /workdir/tmp/go-build499031835/testprogcgo_-buildmode=pie.exe /workdir/tmp/prof767840548]:
        File: testprogcgo_-buildmode=pie.exe
        Build ID: 8fb7f17909e63decefe761e095ce9dbd1c97e738
        Type: cpu
        Time: Feb 7, 2020 at 6:35pm (UTC)
        Duration: 200.51ms, Total samples = 0 
        -----------+-------------------------------------------------------
    crash_cgo_test.go:319: cpuHog traceback missing.
    crash_cgo_test.go:311: [/workdir/go/bin/go tool pprof -traces /workdir/tmp/prof767840548]:
        File: testprogcgo_-buildmode=pie.exe
        Build ID: 8fb7f17909e63decefe761e095ce9dbd1c97e738
        Type: cpu
        Time: Feb 7, 2020 at 6:35pm (UTC)
        Duration: 200.51ms, Total samples = 0 
        -----------+-------------------------------------------------------
    crash_cgo_test.go:319: cpuHog traceback missing.
FAIL
FAIL	runtime	33.607s

Given the timing of the logs, it's not clear to me whether this is a regression in Go 1.14 or simply an already-flaky test coming to the surface.

CC @mknyszek @aclements @hyangah @ianlancetaylor @mpx

@bcmills bcmills added this to the Backlog milestone Feb 12, 2020
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Feb 13, 2020

Change https://golang.org/cl/219417 mentions this issue: runtime: deflake CGO traceback tests

@mpx

This comment has been minimized.

Copy link
Contributor

@mpx mpx commented Feb 13, 2020

I suspect the test is flaky since it only collects 2+ cpuHog frames. I'm not familiar with how Pprof decides which samples to include in the profile, but it looks like in some rare cases cpuHog doesn't make the cut. The above log shows 0 samples reported in 200ms. That indicates the test loop exited early with at least 2 samples (since 200ms < 1 second loop timeout), but they weren't reported.

If so, collecting more samples should significantly reduce the likelihood of a flake.

I'm currently running a large number of tests to see if I can catch the flake. If that works, I should be able to confirm the above patch fixes the problem.

@mpx

This comment has been minimized.

Copy link
Contributor

@mpx mpx commented Feb 15, 2020

I couldn't reproduce the flake on an unloaded machine, but I managed to reproduce it after ~13000 on a reasonably heavily loaded machine (lots of IO, CPU load and scheduling keeping the cores 100% busy).

--- FAIL: TestCgoPprofPIE (0.22s)
    crash_cgo_test.go:311: [/home/mark/go-dev/bin/go tool pprof -traces /tmp/go-build052039209/testprogcgo_-buildmode=pie.exe /tmp/prof021439459]:
        File: testprogcgo_-buildmode=pie.exe
        Build ID: 57c37c1d4a6f0a6bb608c00c63c35c050438b0c3
        Type: cpu
        Time: Feb 15, 2020 at 4:02am (AEDT)
        Duration: 200.27ms, Total samples = 0 
        -----------+-------------------------------------------------------
    crash_cgo_test.go:319: cpuHog traceback missing.

Retrying with the patch now.

@mpx

This comment has been minimized.

Copy link
Contributor

@mpx mpx commented Feb 16, 2020

I was unable to reproduce the flake with the patch after ~200000 tests. Without the patch it typically fails within 5000-20000 tests (both tested from +123f7dd3e1).

I had a quick look through the profiling code and I can't see any good reason why frames should be dropped at all for this test. It's not clear to me why "2" frames were originally chosen.

Going in the other direction, waiting for only a single cpuHog frame appears reliable on go1.13.7, but fails easily on go1.14rc1 (usually <100 tests).

Running a bisect now to see where this test started failing intermittently.

@mpx

This comment has been minimized.

Copy link
Contributor

@mpx mpx commented Feb 16, 2020

The test is flaky from 177a36a (runtime: implement async scheduler preemption) onwards. The test is reliable with GODEBUG=asyncpreemptoff=1. @aclements

The CGO traceback is definitely called (as noted above, I also confirmed with a trace write separately). There is some bad interaction between signal pre-emption and profiling which causes the sample to be lost.

Perhaps this test should only use a single frame? That would have surfaced the issue much earlier.
Updated the earlier change to make it more sensitive rather than less.

@mpx

This comment has been minimized.

Copy link
Contributor

@mpx mpx commented Feb 21, 2020

@bcmills @aclements @ianlancetaylor

It might be worth investigating this issue further before Go 1.14 is released to confirm the damage is limited to loss of some profile samples (and not something more fundamental).

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 21, 2020

The test doesn't count the number of profiling samples that are taken. It counts the number of times the pprofCgoTraceback function is called. That function is called any time a signal occurs while executing in non-Go code. Before we started using signal based preemption, the only signal that would occur in non-Go code was SIGPROF. Now we can also see SIGURG. That is not a bug. But it means that this test is flaky: if we see two SIGURG signals in non-Go code before we see any SIGPROF signals in non-Go code, the test will fail.

So I don't think there is a fundamental problem here. It's a problem with the way that the test is written. It's not a problem that would happen with real code.

Unfortunately I don't see a simple fix off hand, other than disabling signal based preemption for the test.

@mpx

This comment has been minimized.

Copy link
Contributor

@mpx mpx commented Feb 21, 2020

Ah, great. Thanks for confirming it is expected that pprofCgoTraceback is called outside of profile samples, I missed that.

I've reverted my change so it waits for more pprofCgoTraceback calls since it is extremely unlikely that every one of them will not result in a sample. That should effectively deflake the test until a better test is desired/written.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 21, 2020

Maybe we should simply disable signal-based preemption when running the test. The point of the test is just to see that we pick up a stack trace for SIGPROF, disabling signal-based preemption doesn't invalidate what we are testing for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.