Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: make logic for cleaning up TryBot runs that reach "0 builds remaining" more robust #43323

Open
dmitshur opened this issue Dec 22, 2020 · 2 comments

Comments

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Dec 22, 2020

During investigation of #43312 we observed that some TryBot runs would reach 0 builds remaining, yet wouldn't complete. We should try to understand the root cause and fix it.

CC @golang/release.

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented Mar 3, 2021

Observed this again now. There were 6 trybot runs at https://farmer.golang.org/#trybots just now, and they were stuck at "Builds remaining: 0" for at least 5-10 minutes. Afterwards, most of them completed.

I was tailing the logs during the time and did not see anything that visibly stood out as a problem.

The trybot run at https://farmer.golang.org/try?commit=f1347265 is still active at this moment despite having "Builds remaining: 0" for well over 10 minutes.

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented Mar 3, 2021

I think I understand this now.

TryBot completion happens as part of a loop inside findTryWork. This is why TryBots take at least a few seconds after reaching 0 builds left, and why when findTryWork was broken in #43312 try weren't completing.

In the case of https://farmer.golang.org/try?commit=f1347265, it was an unfortunate race condition involving these steps:

  1. the trybot run reached 0 builds left,
  2. it posted a trybot-result vote (but wasn't removed as an active trybot yet)
  3. I removed the trybot-result vote (coordinator still hasn't removed it as an active trybot)
  4. by the time it would be removed in the findTryWork loop, it stopped meeting the condition for a "finished run" because the trybot-result vote was missing.

Relevant code is coordinator.go#L1093-L1098.

In conclusion:

  1. The original problem observed on Dec 22 is completely related to #43312; it's not a problem otherwise.
  2. The problem observed on Mar 2 is due to me removing TryBot-Result vote "too quickly", without giving coordinator a chance to mark the trybot run as complete. A workaround is to also remove the TryBot-Run vote, wait a minute for the run to get cancelled, then restart it.

The fix is to improve the trybot completing logic by either factoring it out of the findTryWork loop, or otherwise at least adding a check and cancel even if ts.wantedAsOf == now when the number of builds remaining is non-positive.

@dmitshur dmitshur changed the title x/build/cmd/coordinator: understand why TryBot runs sometimes fail to complete after reaching 0 builds remaining x/build/cmd/coordinator: make logic for cleaning up TryBot runs that reach "0 builds remaining" more robust Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant