Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: bound the number of times coordinator can reschedule a failed build due to buildlet disappearing #31261

Open
bradfitz opened this issue Apr 4, 2019 · 10 comments

Comments

@bradfitz
Copy link
Member

commented Apr 4, 2019

The plan9-386 builders are dying halfway through (buildlet crashing?) and restarting the build forever.

We should bound the number of retries in the coordinator so plan9's flakiness doesn't waste our resources forever.

/cc @0intro @dmitshur

@bradfitz bradfitz self-assigned this Apr 4, 2019

@gopherbot gopherbot added this to the Unreleased milestone Apr 4, 2019

@gopherbot gopherbot added the Builders label Apr 4, 2019

@0intro

This comment has been minimized.

Copy link
Member

commented Apr 4, 2019

Yes, it seems so. Do you have more information about what is happening? I couldn't reproduce this issue on my 386 Plan 9 servers at home. Was the buildlet recently updated?

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2019

Do you have more information about what is happening?

I don't. I haven't caught it in the act yet.

Was the buildlet recently updated?

Yes. There have been a few updates lately.

Does master Go not work on plan9? I might've built the buildlets with master instead of Go 1.12.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2019

I've rebuilt the buildlet with Go 1.12.1.

@0intro

This comment has been minimized.

Copy link
Member

commented Apr 4, 2019

It might be worth trying to build the buildlet with Go 1.12.

There is a plan9/386 buildlet binary built with Go 1.12.1 available here:

http://9legacy.org/download/go/buildlet/386/buildlet

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2019

Some logs:

2019/04/04 17:41:22 {plan9-386 1abf3aa55bb8b346bb1575ac8db5022f215df65a  } failed: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain
2019/04/04 17:53:24 Buildlet http://10.240.0.127 GCE VM: buildlet-plan9-386-gce-rnd773f75 failed three heartbeats; final error: timeout waiting for headers
2019/04/04 17:53:24 [build plan9-386 cf8cc7f63c7ddefb666a6e8d99a4843d3277db9f]: Execution error running go_test:cmd/go/internal/module on http://10.240.0.127 GCE VM: buildlet-plan9-386-gce-rnd773f75: Buildlet http://10.240.0.127 GCE VM: buildlet-plan9-386-gce-rnd773f75 failed heartbeat after 10.000332173s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 17:54:25 Buildlet http://10.240.0.54 GCE VM: buildlet-plan9-386-gce-rn4fcb9cf failed three heartbeats; final error: timeout waiting for headers
2019/04/04 17:54:25 [build plan9-386 cf8cc7f63c7ddefb666a6e8d99a4843d3277db9f]: Execution error running go_test:cmd/link/internal/sym on http://10.240.0.54 GCE VM: buildlet-plan9-386-gce-rn4fcb9cf: Buildlet http://10.240.0.54 GCE VM: buildlet-plan9-386-gce-rn4fcb9cf failed heartbeat after 10.000348415s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 17:54:25 {plan9-386 cf8cc7f63c7ddefb666a6e8d99a4843d3277db9f  } failed: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain
2019/04/04 18:01:24 Failed to create VM for host-plan9-386-gce: buildlet didn't come up at http://10.240.0.133 in 5m0s
2019/04/04 18:01:25 failed to get a host-plan9-386-gce buildlet: buildlet didn't come up at http://10.240.0.133 in 5m0s
2019/04/04 18:30:38 Buildlet http://10.240.0.97 GCE VM: buildlet-plan9-386-gce-rnd7d2bff failed three heartbeats; final error: timeout waiting for headers
2019/04/04 18:30:38 [build plan9-386 1abf3aa55bb8b346bb1575ac8db5022f215df65a]: Execution error running go_test:cmd/go/internal/modfetch/codehost on http://10.240.0.97 GCE VM: buildlet-plan9-386-gce-rnd7d2bff: Buildlet http://10.240.0.97 GCE VM: buildlet-plan9-386-gce-rnd7d2bff failed heartbeat after 10.000405754s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 18:31:36 Buildlet http://10.240.0.101 GCE VM: buildlet-plan9-386-gce-rneed6676 failed three heartbeats; final error: timeout waiting for headers
2019/04/04 18:31:36 [build plan9-386 1abf3aa55bb8b346bb1575ac8db5022f215df65a]: Execution error running go_test:cmd/trace on http://10.240.0.101 GCE VM: buildlet-plan9-386-gce-rneed6676: Buildlet http://10.240.0.101 GCE VM: buildlet-plan9-386-gce-rneed6676 failed heartbeat after 10.000325062s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 18:31:36 {plan9-386 1abf3aa55bb8b346bb1575ac8db5022f215df65a  } failed: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain
2019/04/04 18:42:49 Buildlet http://10.240.0.112 GCE VM: buildlet-plan9-386-gce-rn51f9016 failed three heartbeats; final error: timeout waiting for headers
2019/04/04 18:42:49 [build plan9-386 cf8cc7f63c7ddefb666a6e8d99a4843d3277db9f]: Execution error running go_test:cmd/addr2line on http://10.240.0.112 GCE VM: buildlet-plan9-386-gce-rn51f9016: Buildlet http://10.240.0.112 GCE VM: buildlet-plan9-386-gce-rn51f9016 failed heartbeat after 10.000413718s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 18:42:49 {plan9-386 cf8cc7f63c7ddefb666a6e8d99a4843d3277db9f  } failed: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain
2019/04/04 19:21:06 Buildlet http://10.240.0.10 GCE VM: buildlet-plan9-386-gce-rnaabef4d failed three heartbeats; final error: timeout waiting for headers
2019/04/04 19:21:06 [build plan9-386 bead358611e36fe0991c171a8a4a4924f4f0e584]: Execution error running go_test:runtime/trace on http://10.240.0.10 GCE VM: buildlet-plan9-386-gce-rnaabef4d: Buildlet http://10.240.0.10 GCE VM: buildlet-plan9-386-gce-rnaabef4d failed heartbeat after 10.000431191s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 19:22:06 Buildlet http://10.240.0.77 GCE VM: buildlet-plan9-386-gce-rn0835b4c failed three heartbeats; final error: timeout waiting for headers
2019/04/04 19:22:06 [build plan9-386 bead358611e36fe0991c171a8a4a4924f4f0e584]: Execution error running test:6_10 on http://10.240.0.77 GCE VM: buildlet-plan9-386-gce-rn0835b4c: Buildlet http://10.240.0.77 GCE VM: buildlet-plan9-386-gce-rn0835b4c failed heartbeat after 10.000437184s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 19:22:06 {plan9-386 bead358611e36fe0991c171a8a4a4924f4f0e584  } failed: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain
2019/04/04 19:31:17 Buildlet http://10.240.0.46 GCE VM: buildlet-plan9-386-gce-rnb28d877 failed three heartbeats; final error: timeout waiting for headers
2019/04/04 19:31:17 [build plan9-386 cf8cc7f63c7ddefb666a6e8d99a4843d3277db9f]: Execution error running go_test:cmd/go/internal/generate on http://10.240.0.46 GCE VM: buildlet-plan9-386-gce-rnb28d877: Buildlet http://10.240.0.46 GCE VM: buildlet-plan9-386-gce-rnb28d877 failed heartbeat after 10.000503557s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 19:33:18 Buildlet http://10.240.0.101 GCE VM: buildlet-plan9-386-gce-rn1ba77dc failed three heartbeats; final error: timeout waiting for headers
2019/04/04 19:33:18 [build plan9-386 cf8cc7f63c7ddefb666a6e8d99a4843d3277db9f]: Execution error running go_test:cmd/link on http://10.240.0.101 GCE VM: buildlet-plan9-386-gce-rn1ba77dc: Buildlet http://10.240.0.101 GCE VM: buildlet-plan9-386-gce-rn1ba77dc failed heartbeat after 10.000471533s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 19:33:18 {plan9-386 cf8cc7f63c7ddefb666a6e8d99a4843d3277db9f  } failed: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain
2019/04/04 20:10:40 Buildlet http://10.240.0.125 GCE VM: buildlet-plan9-386-gce-rnb804613 failed three heartbeats; final error: timeout waiting for headers
2019/04/04 20:10:40 [build plan9-386 bead358611e36fe0991c171a8a4a4924f4f0e584]: Execution error running go_test:cmd/go/internal/load on http://10.240.0.125 GCE VM: buildlet-plan9-386-gce-rnb804613: Buildlet http://10.240.0.125 GCE VM: buildlet-plan9-386-gce-rnb804613 failed heartbeat after 10.000458s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 20:12:59 Buildlet http://10.240.0.95 GCE VM: buildlet-plan9-386-gce-rn1dcf17d failed three heartbeats; final error: timeout waiting for headers
2019/04/04 20:12:59 [build plan9-386 bead358611e36fe0991c171a8a4a4924f4f0e584]: Execution error running go_test:cmd/trace on http://10.240.0.95 GCE VM: buildlet-plan9-386-gce-rn1dcf17d: Buildlet http://10.240.0.95 GCE VM: buildlet-plan9-386-gce-rn1dcf17d failed heartbeat after 10.000334803s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 20:12:59 {plan9-386 bead358611e36fe0991c171a8a4a4924f4f0e584  } failed: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain
2019/04/04 20:22:05 Buildlet http://10.240.0.29 GCE VM: buildlet-plan9-386-gce-rnb58c6de failed three heartbeats; final error: timeout waiting for headers
2019/04/04 20:22:06 [build plan9-386 cf8cc7f63c7ddefb666a6e8d99a4843d3277db9f]: Execution error running go_test:cmd/go/internal/modconv on http://10.240.0.29 GCE VM: buildlet-plan9-386-gce-rnb58c6de: Buildlet http://10.240.0.29 GCE VM: buildlet-plan9-386-gce-rnb58c6de failed heartbeat after 10.000421105s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 20:23:05 Buildlet http://10.240.0.31 GCE VM: buildlet-plan9-386-gce-rn11ca9fe failed three heartbeats; final error: timeout waiting for headers
2019/04/04 20:23:05 [build plan9-386 cf8cc7f63c7ddefb666a6e8d99a4843d3277db9f]: Execution error running go_test:cmd/link on http://10.240.0.31 GCE VM: buildlet-plan9-386-gce-rn11ca9fe: Buildlet http://10.240.0.31 GCE VM: buildlet-plan9-386-gce-rn11ca9fe failed heartbeat after 10.000445149s; marking dead; err=timeout waiting for headers (numFails = 1)
2019/04/04 20:23:05 {plan9-386 cf8cc7f63c7ddefb666a6e8d99a4843d3277db9f  } failed: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain
@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2019

There is a plan9/386 buildlet binary built with Go 1.12.1 available here:

http://9legacy.org/download/go/buildlet/386/buildlet

All of the buildlets are stored in a GCS bucket next to each other and updated at the same time.

See the cmd/buildlet/Makefile.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2019

The looping failures end with:

Error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain
@gopherbot

This comment has been minimized.

Copy link

commented Apr 4, 2019

Change https://golang.org/cl/170781 mentions this issue: cmd/coordinator: don't let plan9 buildlets retry forever on heartbeat timeout

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 16, 2019

I'm going to have to stop the plan9 builders. It's a waste of resources & distracting to have them always building and resulting in crashes, loops, timeouts, or other failures.

@gopherbot

This comment has been minimized.

Copy link

commented Apr 18, 2019

Change https://golang.org/cl/172797 mentions this issue: dashboard: disable plan9-386 builder

gopherbot pushed a commit to golang/build that referenced this issue Apr 18, 2019
dashboard: disable plan9-386 builder
It hasn't passed in months and now spins, wasting resources.

Updates golang/go#31261
Updates golang/go#29801

Change-Id: Idcf13ae915bad4febb156c5c5d49f07f76cf9d49
Reviewed-on: https://go-review.googlesource.com/c/build/+/172797
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
gopherbot pushed a commit to golang/build that referenced this issue Apr 19, 2019
cmd/coordinator: don't let plan9 buildlets retry forever on heartbeat…
… timeout

Quick fix to stop wasting resources for now.

Updates golang/go#31261

Change-Id: Ia95988197313e5a8c7db3d8c557a8c7dd24b93ef
Reviewed-on: https://go-review.googlesource.com/c/build/+/170781
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>

@bradfitz bradfitz changed the title x/build/cmd/coordinator: plan9 builders looping, never failing x/build/cmd/coordinator: bound the number of times coordinator can reschedule a failed build due to buildlet disappearing May 29, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.