New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: frequent "communication error to buildlet" failures on plan9-arm
#52677
Comments
I've reconfigured the plan9-arm cluster with a filesystem version which I hope is more stable. Looking at the local log for the most recent of these failures, I see it ends this way:
It seems that a Wouldn't it be better if the buildlet sent back an explicit error to the coordinator in such cases, so the test run could be retried? |
Hrm. Is there a way to set that timeout to something longer?
IMO it would not be appropriate to retry the test — if it times out on one run, what's to stop it from timing out on the next one? And, moreover, if we do 20 minutes of work and then time out and retry, we've just wasted 20 minutes of builder time that could have been put to more productive use. (#42699 is closely related.) |
Bad idea, I think. The test of {fmt, go/ast, go/build} on plan9-arm normally takes about 50 seconds. If it's timing out after 20 minutes, that's not just slow, it's stalled. Lengthening the timeout would just waste more time. |
Whenever I do a manual retry using the I will set up a process on my local builders to monitor progress on the log output file. If nothing is emitted for say 15 minutes, it will send an alert so I can go in with the debugger and try to find out what's stalled. |
|
I've found a likely cause: an assertion failure in the Plan 9 filesystem. While working on diagnosing that, I can try making it retry instead of halting, so it won't stall the builder. |
Is the retry in place? The builder seems a little more stable, but there's still a recent one of these.
|
greplogs --dashboard -md -l -e '\Aplan9-arm.*(\n.*)*communication error to buildlet' --since=2022-01-01
2022-05-02T14:54:05-349cc83/plan9-arm
2022-04-27T14:23:28-f0c0e0f/plan9-arm
2022-04-26T02:28:58-17d7983/plan9-arm
2022-04-11T16:31:53-0179331/plan9-arm
2022-04-07T23:06:24-c451a02/plan9-arm
2022-04-05T14:15:59-62bceae/plan9-arm
2022-03-31T05:34:15-2b8178c/plan9-arm
2022-03-31T00:27:01-0a6ddcc/plan9-arm
2022-03-31T00:26:58-0775730/plan9-arm
2022-03-30T01:12:57-8fefeab/plan9-arm
2022-03-21T19:10:16-efbff6e/plan9-arm
2022-03-07T18:17:40-dcb6547/plan9-arm
2022-03-03T21:19:37-87a345c/plan9-arm
2022-03-01T19:32:51-44e92e1/plan9-arm
2022-02-25T00:25:34-b8b3196/plan9-arm
2022-02-01T18:15:07-125c5a3/plan9-arm
2022-01-27T21:25:18-ad345c2/plan9-arm
2022-01-19T16:33:11-985d97e/plan9-arm
2022-01-10T22:49:07-4ceb5a9/plan9-arm
@millerresearch, can something be done to prevent this builder from getting wedged?
(Compare #49756.)
The text was updated successfully, but these errors were encountered: