Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: TestStdPipe flake on darwin-amd64 #20315

Closed
mvdan opened this issue May 10, 2017 · 8 comments
Closed

os: TestStdPipe flake on darwin-amd64 #20315

mvdan opened this issue May 10, 2017 · 8 comments

Comments

@mvdan
Copy link
Member

@mvdan mvdan commented May 10, 2017

From the build dashboard: https://build.golang.org/log/aaf3d220b3228b07acbdecd55dd878b265595623

--- FAIL: TestStdPipe (0.10s)
	pipe_test.go:90: unexpected exit status exit status 2 for descriptor 1s sig false
FAIL
FAIL	os	1.195s
@mvdan mvdan added this to the Go1.9 milestone May 10, 2017
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 10, 2017

Obviously I'm missing something but I really don't see how that is possible. I don't see any code path at all that leads to an exit status of 2.

@mvdan

This comment has been minimized.

Copy link
Member Author

@mvdan mvdan commented May 11, 2017

@ianlancetaylor seems like it - could it be that it's the test binary that returns the exit status 2 somehow? Our usage seems to be simple, but if something weird happened there it could be possible that the exit code comes from somewhere other than the test we run via -test.run.

@bradfitz bradfitz added the Testing label May 24, 2017
@broady broady modified the milestones: Go1.9Maybe, Go1.9 Jul 17, 2017
@bradfitz bradfitz modified the milestones: Go1.9Maybe, Go1.10 Jul 20, 2017
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 6, 2017

@aclements Can you let us know how often this test failure occurs? Or tell us how we can run your tool to find that out ourselves? Thanks.

@bradfitz

This comment has been minimized.

Copy link
Contributor

@bradfitz bradfitz commented Dec 6, 2017

Oh, I forgot he did check in his tool: x/build/cmd/fetchlogs

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Dec 6, 2017

They're all technically checked in, but some are in personal repos because they're a mess (go get github.com/aclements/go-misc/greplogs; go get github.com/aclements/go-misc/findflakes)

$ fetchlogs
$ greplogs -dashboard -E 'pipe_test.*exit status 2' -md -l

2016-05-11T16:48:46-80423f1/darwin-amd64-10_9
2016-05-11T17:07:24-20e362d/darwin-amd64-10_9
2016-05-12T16:48:09-a8a2b38/darwin-amd64-10_9
2016-05-13T17:44:48-0cc710d/darwin-amd64-10_9
2016-05-14T13:49:51-d8b08c3/darwin-amd64-10_9
2016-05-14T13:49:57-6181db5/darwin-amd64-10_9
2016-05-14T17:27:44-5f83312/darwin-amd64-10_9
2016-05-14T17:30:16-2699da1/darwin-amd64-10_9
2016-05-14T23:06:17-aba7b3e/darwin-amd64-10_9
2016-05-15T06:24:02-b4bf066/darwin-amd64-10_9
2016-05-15T20:56:39-a101b85/darwin-amd64-10_9
2016-08-22T22:59:20-550caa1/darwin-amd64-10_8
2016-09-08T21:36:37-58de268/darwin-amd64-10_8
2016-09-10T17:04:21-3d562de/darwin-amd64-10_8
2017-01-24T19:56:23-98842ca/darwin-amd64-10_11
2017-03-31T14:05:12-4a11404/darwin-amd64-10_10
2017-05-10T14:47:41-41d0bbd/darwin-amd64-10_10
2017-07-10T20:09:00-6f83b75/darwin-amd64-10_12
2017-07-11T14:01:04-58ae050/darwin-amd64-10_12
2017-08-30T19:23:48-81b9d73/darwin-amd64-10_11
2017-10-18T19:57:28-9a84e52/darwin-amd64-10_12

$ greplogs -dashboard -E 'pipe_test.*exit status 2' -l | findflakes -paths
First observed 98842ca 24 Jan 19:56 2017 (3839 commits ago)
Last observed  9a84e52 18 Oct 19:57 2017 (656 commits ago)
36% chance failure is still happening
0.16% failure probability (6 of 3184 commits)
Likely culprits:
    0% 98842ca net/http: don't send body on redirects for 301, 302, 303 when GetBody is set
    0% 314180e net/http: fix a nit
    0% aad06da cmd/link: mark DWARF function symbols as reachable
    0% be9dcfe doc: mention testing.MainStart signature change
    0% a96e117 runtime: amd64, use 4-byte ops for memmove of 4 bytes
    0% 4cce27a cmd/compile: fix constant propagation through s390x MOVDNE instructions
    0% 1be957d misc/cgo/test: pass current environment to syscall.Exec
    0% ec654e2 misc/cgo/test: fix test when using GCC 7
    0% 256a605 cmd/compile: don't use nilcheck information until the next block
    0% e8d5989 cmd/compile: fix compilebench -alloc
Past failures:
  550caa1 22 Aug 22:59 2016 to 3d562de 10 Sep 17:04 2016
    0.65% failure probability (3 of 310 commits)
  80423f1 11 May 16:48 2016 to a101b85 15 May 20:56 2016
    25% failure probability (11 of 41 commits)
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 7, 2017

Thanks, @aclements .

I do now see a code path that can lead to an exit status of 2. The failure cases are when signal.Notify is not called. The program will get a SIGPIPE. The signal handler will catch the signal and return. The Write will see an EPIPE error, call os.epipecheck, then os.sigpipe (in runtime/signal_unix.go), then dieFromSignal. If dieFromSignal fails to actually die from the signal, it will call exit(2), which is what we are seeing.

Darwin is unusual in that the runtime.raise function used by dieFromSignal does not simply send the signal to the current thread, but instead sends it to the entire process using the kill system call. So Darwin is more likely to encounter an unexpected delay in signal delivery here, and that may be what is causing these very occasional failures. Also the Darwin osyield function just sleeps for 1 microsecond, rather than calling sched_yield which generally lets all other threads run before the current thread resumes.

(Darwin does actually have pthread_kill and sched_yield functions. I don't know why we don't use them.)

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Dec 7, 2017

See https://golang.org/cl/37890 for some of why it doesn't use sched_yield. Apparently we're waiting on loop preemption (sigh).

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Dec 7, 2017

Change https://golang.org/cl/82375 mentions this issue: runtime: sleep longer in dieFromSignal on Darwin

@gopherbot gopherbot closed this in 0ec59e4 Dec 7, 2017
@golang golang locked and limited conversation to collaborators Dec 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.