Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: frequent "communication error to buildlet" failures on plan9-arm #52677

Open
bcmills opened this issue May 3, 2022 · 7 comments
Open

x/build: frequent "communication error to buildlet" failures on plan9-arm #52677

bcmills opened this issue May 3, 2022 · 7 comments
Labels
arch-arm NeedsFix OS-Plan9
Milestone

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented May 3, 2022

plan9-arm at 349cc83389f71c459b7820b0deecdf81221ba46c
…
communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain

greplogs --dashboard -md -l -e '\Aplan9-arm.*(\n.*)*communication error to buildlet' --since=2022-01-01
2022-05-02T14:54:05-349cc83/plan9-arm
2022-04-27T14:23:28-f0c0e0f/plan9-arm
2022-04-26T02:28:58-17d7983/plan9-arm
2022-04-11T16:31:53-0179331/plan9-arm
2022-04-07T23:06:24-c451a02/plan9-arm
2022-04-05T14:15:59-62bceae/plan9-arm
2022-03-31T05:34:15-2b8178c/plan9-arm
2022-03-31T00:27:01-0a6ddcc/plan9-arm
2022-03-31T00:26:58-0775730/plan9-arm
2022-03-30T01:12:57-8fefeab/plan9-arm
2022-03-21T19:10:16-efbff6e/plan9-arm
2022-03-07T18:17:40-dcb6547/plan9-arm
2022-03-03T21:19:37-87a345c/plan9-arm
2022-03-01T19:32:51-44e92e1/plan9-arm
2022-02-25T00:25:34-b8b3196/plan9-arm
2022-02-01T18:15:07-125c5a3/plan9-arm
2022-01-27T21:25:18-ad345c2/plan9-arm
2022-01-19T16:33:11-985d97e/plan9-arm
2022-01-10T22:49:07-4ceb5a9/plan9-arm

@millerresearch, can something be done to prevent this builder from getting wedged?

(Compare #49756.)

@gopherbot gopherbot added the Builders label May 3, 2022
@gopherbot gopherbot added this to the Unreleased milestone May 3, 2022
@bcmills bcmills added OS-Plan9 arch-arm and removed Builders labels May 3, 2022
@bcmills bcmills removed this from the Unreleased milestone May 3, 2022
@bcmills bcmills added this to the Backlog milestone May 3, 2022
@millerresearch
Copy link
Contributor

@millerresearch millerresearch commented May 4, 2022

@millerresearch, can something be done to prevent this builder from getting wedged?

I've reconfigured the plan9-arm cluster with a filesystem version which I hope is more stable.

Looking at the local log for the most recent of these failures, I see it ends this way:

2022/05/03 17:40:28 [0x10e9e4d0] Running /boot/workdir/go/bin/go with args ["/boot/workdir/go/bin/go" "tool" "dist" "test" "--no-rebuild" "--banner=XXXBANNERXXX:" "go_test:errors" "go_test:expvar" "go_test:flag"] and env ["home=/usr/glenda" "path=/boot/workdir/go/bin\x00.\x00/bin" "type=host-plan9-arm-0intro" "GOARM=7" "GO_BUILD_KEY_DELETE_AFTER_READ=false" "status=" "GO_TEST_TIMEOUT_SCALE=6" "fs=aoe" "GOCACHE=/boot/cache" "GOROOT_BOOTSTRAP=/boot/workdir/go1.4" "sysname=pi4h" "workdir=/boot/workdir" "objtype=arm" "*=aoe" "WORKDIR=/boot/workdir" "GO_BUILDER_NAME=plan9-arm" "GOROOT=/boot/workdir/go" "GOPATH=/boot/workdir/gopath" "GOPROXY=http://gk3-services-nap-1pnbo1ui-1f1ba50a-97gm.c.symbolic-datum-552.internal:30157" "GOPROXY=off" "GOROOT_BOOTSTRAP=/sys/lib/go1.17"] in dir /boot/workdir
2022/05/03 17:40:54 [0x10e9e4d0] Run = ok, after 25.937931117s
2022/05/03 17:40:54 [0x10e68d10] Running /boot/workdir/go/bin/go with args ["/boot/workdir/go/bin/go" "tool" "dist" "test" "--no-rebuild" "--banner=XXXBANNERXXX:" "go_test:fmt" "go_test:go/ast" "go_test:go/build"] and env ["home=/usr/glenda" "path=/boot/workdir/go/bin\x00.\x00/bin" "type=host-plan9-arm-0intro" "GOARM=7" "GO_BUILD_KEY_DELETE_AFTER_READ=false" "status=" "GO_TEST_TIMEOUT_SCALE=6" "fs=aoe" "GOCACHE=/boot/cache" "GOROOT_BOOTSTRAP=/boot/workdir/go1.4" "sysname=pi4h" "workdir=/boot/workdir" "objtype=arm" "*=aoe" "WORKDIR=/boot/workdir" "GO_BUILDER_NAME=plan9-arm" "GOROOT=/boot/workdir/go" "GOPATH=/boot/workdir/gopath" "GOPROXY=http://gk3-services-nap-1pnbo1ui-1f1ba50a-97gm.c.symbolic-datum-552.internal:30157" "GOPROXY=off" "GOROOT_BOOTSTRAP=/sys/lib/go1.17"] in dir /boot/workdir
2022/05/03 18:00:54 Halting in 1 second.
2022/05/03 18:00:54 buildlet reverse mode exiting.

It seems that a dist test has timed out after 20 minutes, and the buildlet has then exited without sending an error status back to the coordinator. The builder machine will then have rebooted, breaking the TCP connection, and the coordinator sees a "communication error".

Wouldn't it be better if the buildlet sent back an explicit error to the coordinator in such cases, so the test run could be retried?

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 4, 2022

It seems that a dist test has timed out after 20 minutes, and the buildlet has then exited without sending an error status back to the coordinator. The builder machine will then have rebooted, breaking the TCP connection, and the coordinator sees a "communication error".

Hrm. Is there a way to set that timeout to something longer?

Wouldn't it be better if the buildlet sent back an explicit error to the coordinator in such cases, so the test run could be retried?

IMO it would not be appropriate to retry the test — if it times out on one run, what's to stop it from timing out on the next one? And, moreover, if we do 20 minutes of work and then time out and retry, we've just wasted 20 minutes of builder time that could have been put to more productive use. (#42699 is closely related.)

@millerresearch
Copy link
Contributor

@millerresearch millerresearch commented May 5, 2022

Hrm. Is there a way to set that timeout to something longer?

Bad idea, I think. The test of {fmt, go/ast, go/build} on plan9-arm normally takes about 50 seconds. If it's timing out after 20 minutes, that's not just slow, it's stalled. Lengthening the timeout would just waste more time.

@millerresearch
Copy link
Contributor

@millerresearch millerresearch commented May 5, 2022

IMO it would not be appropriate to retry the test — if it times out on one run, what's to stop it from timing out on the next one?

Whenever I do a manual retry using the retrybuilds command after a communication error failure, the next attempt always succeeds. My strong hunch is that it's something in the underlying platform that's stalling non-deterministically, not within the go code.

I will set up a process on my local builders to monitor progress on the log output file. If nothing is emitted for say 15 minutes, it will send an alert so I can go in with the debugger and try to find out what's stalled.

@heschi heschi added the NeedsFix label May 9, 2022
@bcmills
Copy link
Member Author

@bcmills bcmills commented May 11, 2022

greplogs -l -e '\Aplan9-arm.*(\n.*)*communication error to buildlet' --since=2022-05-04
2022-05-10T18:37:58-508cb32/plan9-arm

@millerresearch
Copy link
Contributor

@millerresearch millerresearch commented May 12, 2022

I've found a likely cause: an assertion failure in the Plan 9 filesystem. While working on diagnosing that, I can try making it retry instead of halting, so it won't stall the builder.

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 26, 2022

Is the retry in place? The builder seems a little more stable, but there's still a recent one of these.

greplogs -l -e '\Aplan9-arm.*(\n.*)*communication error to buildlet' --since=2022-05-11
2022-05-17T19:57:24-88c5324/plan9-arm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm NeedsFix OS-Plan9
Projects
None yet
Development

No branches or pull requests

4 participants