Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: gotip-windows-arm64 builders stops working #66962

Closed
cherrymui opened this issue Apr 22, 2024 · 2 comments
Closed

x/build: gotip-windows-arm64 builders stops working #66962

cherrymui opened this issue Apr 22, 2024 · 2 comments
Labels
Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@cherrymui
Copy link
Member

https://ci.chromium.org/ui/p/golang/builders/luci.golang.ci/gotip-windows-arm64
seems all builders are offline.

cc @golang/release @thanm

@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Apr 22, 2024
@gopherbot gopherbot added this to the Unreleased milestone Apr 22, 2024
@cherrymui cherrymui added NeedsFix The path to resolution is known, but the work has not been done. Builders x/build issues (builders, bots, dashboards) and removed Builders x/build issues (builders, bots, dashboards) labels Apr 22, 2024
@thanm
Copy link
Contributor

thanm commented Apr 23, 2024

I got an access grant, and logged into VMs to inspect them. Both were up (not hung or dead) but the "swarming" user was completely inactive (which is not supposed to happen if the systems are healthy). I inspected the system event logs but I don't see any red flags-- last entry in the logs for anything useful done by swarming is on Apr 7th, then after that the user just vanishes.

From the bot logs I see this in the Apr 7th swarming bot log ("C:\Users\swarming.swarming\logs\bot_stdout.log.1"):

Found a previous bot, 11832 rebooting as a workaround for https://crbug.com/1061531
Sleeping for 300 secs

We have SWARMING_NEVER_REBOOT set to true for these VMs, but the code in question doesn't seem to respect that.

Of course that doesn't explain why we would have two copies of the swarming bot running at the same time in the first place. Also a mystery as to why we don't get a proper auto-logon of the swarming user after this happens (since when I do manual restarts we don't seem to have this issue). If anyone has any ideas on how to debug this let me know.

I restarted both VMs and and they seem to be processing jobs again.

@dmitshur
Copy link
Contributor

From what I can tell, SWARMING_NEVER_REBOOT has effect for most frequent reasons that would otherwise cause the reboot to happen, but it doesn't catch all. The swarming bot seems to occasionally trigger a reboot in some edge cases.

We can try to catch and report those edge cases, and aim to get them fixed so the variable does as its name implies in all situations. There may still be future instances that get missed and a restart happens unintentionally anyway.

Other options include making this builder come back automatically after a restart, i.e., remove the need for setting the variable, and just handling the occasional restart manually when it happens.

Since the builders are now back online and working, let's close this particular issue. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done.
Projects
Status: Done
Development

No branches or pull requests

4 participants