Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: make logic for cleaning up TryBot runs that reach "0 builds remaining" more robust #43323

dmitshur opened this issue Dec 22, 2020 · 2 comments
Builders NeedsFix


Copy link

dmitshur commented Dec 22, 2020

During investigation of #43312 we observed that some TryBot runs would reach 0 builds remaining, yet wouldn't complete. We should try to understand the root cause and fix it.

Edit: The root cause is well understood now, see comment from Mar 3.

CC @golang/release.

@dmitshur dmitshur added Builders NeedsInvestigation labels Dec 22, 2020
@dmitshur dmitshur added this to the Backlog milestone Dec 22, 2020
Copy link
Contributor Author

dmitshur commented Mar 3, 2021

Observed this again now. There were 6 trybot runs at just now, and they were stuck at "Builds remaining: 0" for at least 5-10 minutes. Afterwards, most of them completed.

I was tailing the logs during the time and did not see anything that visibly stood out as a problem.

The trybot run at is still active at this moment despite having "Builds remaining: 0" for well over 10 minutes.

Copy link
Contributor Author

dmitshur commented Mar 3, 2021

I think I understand this now.

TryBot completion happens as part of a loop inside findTryWork. This is why TryBots take at least a few seconds after reaching 0 builds left, and why when findTryWork was broken in #43312 try weren't completing.

In the case of, it was an unfortunate race condition involving these steps:

  1. the trybot run reached 0 builds left,
  2. it posted a trybot-result vote (but wasn't removed as an active trybot yet)
  3. I removed the trybot-result vote (coordinator still hasn't removed it as an active trybot)
  4. by the time it would be removed in the findTryWork loop, it stopped meeting the condition for a "finished run" because the trybot-result vote was missing.

Relevant code is coordinator.go#L1093-L1098.

In conclusion:

  1. The original problem observed on Dec 22 is completely related to x/build/cmd/coordinator: failing to find TryBot work because it cannot handle multiple CLs with same Change-Id #43312; it's not a problem otherwise.
  2. The problem observed on Mar 2 is due to me removing TryBot-Result vote "too quickly", without giving coordinator a chance to mark the trybot run as complete. A workaround is to also remove the TryBot-Run vote, wait a minute for the run to get cancelled, then restart it.

The fix is to improve the trybot completing logic by either factoring it out of the findTryWork loop, or otherwise at least adding a check and cancel even if ts.wantedAsOf == now when the number of builds remaining is non-positive.

@dmitshur dmitshur changed the title x/build/cmd/coordinator: understand why TryBot runs sometimes fail to complete after reaching 0 builds remaining x/build/cmd/coordinator: make logic for cleaning up TryBot runs that reach "0 builds remaining" more robust Mar 3, 2021
@dmitshur dmitshur added NeedsFix and removed NeedsInvestigation labels Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Builders NeedsFix
None yet

No branches or pull requests

1 participant