Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: frequent "communication error to buildlet" failures on plan9-arm #52677

Closed
bcmills opened this issue May 3, 2022 · 21 comments
Closed
Labels
arch-arm Issues solely affecting the 32-bit arm architecture. NeedsFix The path to resolution is known, but the work has not been done. OS-Plan9 WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented May 3, 2022

#!watchflakes
post <- builder == "plan9-arm" && `communication error to buildlet`
plan9-arm at 349cc83389f71c459b7820b0deecdf81221ba46c
…
communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain

greplogs --dashboard -md -l -e '\Aplan9-arm.*(\n.*)*communication error to buildlet' --since=2022-01-01
2022-05-02T14:54:05-349cc83/plan9-arm
2022-04-27T14:23:28-f0c0e0f/plan9-arm
2022-04-26T02:28:58-17d7983/plan9-arm
2022-04-11T16:31:53-0179331/plan9-arm
2022-04-07T23:06:24-c451a02/plan9-arm
2022-04-05T14:15:59-62bceae/plan9-arm
2022-03-31T05:34:15-2b8178c/plan9-arm
2022-03-31T00:27:01-0a6ddcc/plan9-arm
2022-03-31T00:26:58-0775730/plan9-arm
2022-03-30T01:12:57-8fefeab/plan9-arm
2022-03-21T19:10:16-efbff6e/plan9-arm
2022-03-07T18:17:40-dcb6547/plan9-arm
2022-03-03T21:19:37-87a345c/plan9-arm
2022-03-01T19:32:51-44e92e1/plan9-arm
2022-02-25T00:25:34-b8b3196/plan9-arm
2022-02-01T18:15:07-125c5a3/plan9-arm
2022-01-27T21:25:18-ad345c2/plan9-arm
2022-01-19T16:33:11-985d97e/plan9-arm
2022-01-10T22:49:07-4ceb5a9/plan9-arm

@millerresearch, can something be done to prevent this builder from getting wedged?

(Compare #49756.)

@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label May 3, 2022
@gopherbot gopherbot added this to the Unreleased milestone May 3, 2022
@bcmills bcmills added OS-Plan9 arch-arm Issues solely affecting the 32-bit arm architecture. and removed Builders x/build issues (builders, bots, dashboards) labels May 3, 2022
@bcmills bcmills modified the milestones: Unreleased, Backlog May 3, 2022
@millerresearch
Copy link
Contributor

@millerresearch, can something be done to prevent this builder from getting wedged?

I've reconfigured the plan9-arm cluster with a filesystem version which I hope is more stable.

Looking at the local log for the most recent of these failures, I see it ends this way:

2022/05/03 17:40:28 [0x10e9e4d0] Running /boot/workdir/go/bin/go with args ["/boot/workdir/go/bin/go" "tool" "dist" "test" "--no-rebuild" "--banner=XXXBANNERXXX:" "go_test:errors" "go_test:expvar" "go_test:flag"] and env ["home=/usr/glenda" "path=/boot/workdir/go/bin\x00.\x00/bin" "type=host-plan9-arm-0intro" "GOARM=7" "GO_BUILD_KEY_DELETE_AFTER_READ=false" "status=" "GO_TEST_TIMEOUT_SCALE=6" "fs=aoe" "GOCACHE=/boot/cache" "GOROOT_BOOTSTRAP=/boot/workdir/go1.4" "sysname=pi4h" "workdir=/boot/workdir" "objtype=arm" "*=aoe" "WORKDIR=/boot/workdir" "GO_BUILDER_NAME=plan9-arm" "GOROOT=/boot/workdir/go" "GOPATH=/boot/workdir/gopath" "GOPROXY=http://gk3-services-nap-1pnbo1ui-1f1ba50a-97gm.c.symbolic-datum-552.internal:30157" "GOPROXY=off" "GOROOT_BOOTSTRAP=/sys/lib/go1.17"] in dir /boot/workdir
2022/05/03 17:40:54 [0x10e9e4d0] Run = ok, after 25.937931117s
2022/05/03 17:40:54 [0x10e68d10] Running /boot/workdir/go/bin/go with args ["/boot/workdir/go/bin/go" "tool" "dist" "test" "--no-rebuild" "--banner=XXXBANNERXXX:" "go_test:fmt" "go_test:go/ast" "go_test:go/build"] and env ["home=/usr/glenda" "path=/boot/workdir/go/bin\x00.\x00/bin" "type=host-plan9-arm-0intro" "GOARM=7" "GO_BUILD_KEY_DELETE_AFTER_READ=false" "status=" "GO_TEST_TIMEOUT_SCALE=6" "fs=aoe" "GOCACHE=/boot/cache" "GOROOT_BOOTSTRAP=/boot/workdir/go1.4" "sysname=pi4h" "workdir=/boot/workdir" "objtype=arm" "*=aoe" "WORKDIR=/boot/workdir" "GO_BUILDER_NAME=plan9-arm" "GOROOT=/boot/workdir/go" "GOPATH=/boot/workdir/gopath" "GOPROXY=http://gk3-services-nap-1pnbo1ui-1f1ba50a-97gm.c.symbolic-datum-552.internal:30157" "GOPROXY=off" "GOROOT_BOOTSTRAP=/sys/lib/go1.17"] in dir /boot/workdir
2022/05/03 18:00:54 Halting in 1 second.
2022/05/03 18:00:54 buildlet reverse mode exiting.

It seems that a dist test has timed out after 20 minutes, and the buildlet has then exited without sending an error status back to the coordinator. The builder machine will then have rebooted, breaking the TCP connection, and the coordinator sees a "communication error".

Wouldn't it be better if the buildlet sent back an explicit error to the coordinator in such cases, so the test run could be retried?

@bcmills
Copy link
Member Author

bcmills commented May 4, 2022

It seems that a dist test has timed out after 20 minutes, and the buildlet has then exited without sending an error status back to the coordinator. The builder machine will then have rebooted, breaking the TCP connection, and the coordinator sees a "communication error".

Hrm. Is there a way to set that timeout to something longer?

Wouldn't it be better if the buildlet sent back an explicit error to the coordinator in such cases, so the test run could be retried?

IMO it would not be appropriate to retry the test — if it times out on one run, what's to stop it from timing out on the next one? And, moreover, if we do 20 minutes of work and then time out and retry, we've just wasted 20 minutes of builder time that could have been put to more productive use. (#42699 is closely related.)

@millerresearch
Copy link
Contributor

Hrm. Is there a way to set that timeout to something longer?

Bad idea, I think. The test of {fmt, go/ast, go/build} on plan9-arm normally takes about 50 seconds. If it's timing out after 20 minutes, that's not just slow, it's stalled. Lengthening the timeout would just waste more time.

@millerresearch
Copy link
Contributor

IMO it would not be appropriate to retry the test — if it times out on one run, what's to stop it from timing out on the next one?

Whenever I do a manual retry using the retrybuilds command after a communication error failure, the next attempt always succeeds. My strong hunch is that it's something in the underlying platform that's stalling non-deterministically, not within the go code.

I will set up a process on my local builders to monitor progress on the log output file. If nothing is emitted for say 15 minutes, it will send an alert so I can go in with the debugger and try to find out what's stalled.

@heschi heschi added the NeedsFix The path to resolution is known, but the work has not been done. label May 9, 2022
@bcmills
Copy link
Member Author

bcmills commented May 11, 2022

greplogs -l -e '\Aplan9-arm.*(\n.*)*communication error to buildlet' --since=2022-05-04
2022-05-10T18:37:58-508cb32/plan9-arm

@millerresearch
Copy link
Contributor

I've found a likely cause: an assertion failure in the Plan 9 filesystem. While working on diagnosing that, I can try making it retry instead of halting, so it won't stall the builder.

@bcmills
Copy link
Member Author

bcmills commented May 26, 2022

Is the retry in place? The builder seems a little more stable, but there's still a recent one of these.

greplogs -l -e '\Aplan9-arm.*(\n.*)*communication error to buildlet' --since=2022-05-11
2022-05-17T19:57:24-88c5324/plan9-arm

@millerresearch
Copy link
Contributor

There was another failure mode: one of the raspberry pi builders had only 1GB of RAM and no swap configured. I've added some swap space so it should be more stable now.

@bcmills
Copy link
Member Author

bcmills commented Jun 28, 2022

One more of these after the swap change:
greplogs --dashboard -md -l -e '(?ms)\Aplan9-arm.*communication error to buildlet' --since=2022-05-18
2022-06-06T18:37:38-fc97075/plan9-arm

@millerresearch
Copy link
Contributor

I'd like to propose excusing the plan9-arm builders from the repeatedCommunicationError check. The commit introducing this check was labelled "don't let plan9 buildlets retry forever", to solve an issue where repeated retries on plan9-386 virtual machines would "waste our resources" (issue #31261).

But the plan9-arm builders don't run on a virtual resource, but on real hardware. That hardware is not particularly enterprise-quality: a cluster of Raspberry Pi boards sharing a power supply. Sometimes a glitch of the hardware or the Plan 9 filesystem causes a builder to crash or reboot, leading to this "communication error" failure. In my experience, retrying a test after such a failure will invariably succeed. Therefore this check is not saving resources by preventing a "retry forever", but just acting as a nuisance preventing a successful automatic retry.

If this is acceptable, I'll submit a CL to x/build/cmd/coordinator to remove plan9-arm from this check.

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- builder == "plan9-arm" && `communication error to buildlet`
2022-12-01 21:00 plan9-arm go@93587d35 (log)
communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain
2022-12-08 18:29 plan9-arm go@7973b0e5 (log)
communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- builder == "plan9-arm" && `communication error to buildlet`
2023-01-17 18:21 plan9-arm go@9088c691 (log)
---
Installed Go for plan9/arm in /boot/workdir/go
Installed commands in /boot/workdir/go/bin
*** You need to bind /boot/workdir/go/bin before /bin.

communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain
2023-01-17 19:53 plan9-arm go@526b8956 (log)
communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- builder == "plan9-arm" && `communication error to buildlet`
2023-02-01 21:30 plan9-arm go@cda461bb (log)
communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain

watchflakes

@gopherbot
Copy link

Change https://go.dev/cl/470355 mentions this issue: x/build/cmd/coordinator: exempt plan9-arm from repeatedCommunicationError

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- builder == "plan9-arm" && `communication error to buildlet`
2023-02-22 21:40 plan9-arm go@06b67591 (log)
communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain
2023-02-22 23:19 plan9-arm go@e7cfcda6 (log)
communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain

watchflakes

@gopherbot gopherbot reopened this Feb 24, 2023
@bcmills
Copy link
Member Author

bcmills commented Feb 24, 2023

This is probably pending a redeploy of cmd/coordinator.

@bcmills bcmills closed this as completed Feb 24, 2023
@gopherbot gopherbot reopened this Mar 8, 2023
@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- builder == "plan9-arm" && `communication error to buildlet`
2023-02-28 01:11 plan9-arm go@7a0799b2 (log)
---
Installed Go for plan9/arm in /boot/workdir/go
Installed commands in /boot/workdir/go/bin
*** You need to bind /boot/workdir/go/bin before /bin.

communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain

watchflakes

@millerresearch
Copy link
Contributor

Has the cmd/coordinator been updated yet?

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- builder == "plan9-arm" && `communication error to buildlet`
2023-01-31 19:45 plan9-arm go@780db9a6 (log)
communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain

watchflakes

@bcmills
Copy link
Member Author

bcmills commented Mar 16, 2023

Has the cmd/coordinator been updated yet?

I think so, yes.

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Mar 16, 2023
@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm Issues solely affecting the 32-bit arm architecture. NeedsFix The path to resolution is known, but the work has not been done. OS-Plan9 WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
Status: Done
Development

No branches or pull requests

4 participants