Observed this again now. There were 6 trybot runs at https://farmer.golang.org/#trybots just now, and they were stuck at "Builds remaining: 0" for at least 5-10 minutes. Afterwards, most of them completed.
I was tailing the logs during the time and did not see anything that visibly stood out as a problem.
TryBot completion happens as part of a loop inside findTryWork. This is why TryBots take at least a few seconds after reaching 0 builds left, and why when findTryWork was broken in #43312 try weren't completing.
The problem observed on Mar 2 is due to me removing TryBot-Result vote "too quickly", without giving coordinator a chance to mark the trybot run as complete. A workaround is to also remove the TryBot-Run vote, wait a minute for the run to get cancelled, then restart it.
The fix is to improve the trybot completing logic by either factoring it out of the findTryWork loop, or otherwise at least adding a check and cancel even if ts.wantedAsOf == now when the number of builds remaining is non-positive.
changed the title
x/build/cmd/coordinator: understand why TryBot runs sometimes fail to complete after reaching 0 builds remainingMar 3, 2021