x/build/cmd/gitmirror: 10 minute timeout on git fetch is incorrectly implemented, so git fetch may still block forever #38887
Every now and then,
This problem happens rarely (some number of times a year~). There are two instances of gitmirror in production, so as long as at least one is in a good state, mirroring continues to operate without a problem. When it happens, it's easy to spot via https://farmer.golang.org/#health and restart an instance by deleting the problematic pod (a new pod will automatically spin up by the replication controller).
This is a tracking issue for this problem to see how often it happens, investigate as needed and fix it.
It happened again today:
I've captured some logs and restarted one of the instances (having two bad instances meant that the Go repo was no longer being mirrored). I'll restart the other one in a bit, after doing some more investigation.
The text was updated successfully, but these errors were encountered:
This time, I was able to gather enough information and figure it out!
CL 203057 added a 10 minute timeout to the
There's an accepted proposal #23019 to try to change
"attempt 2" meant attempt 1 failed, and this was the corresponding error message:
It seems the Gerrit server was having a temporary issue at that moment. It's likely attempt 2 failed for a similar reason, except it got stuck.
The goroutine dump (via the
Line 515 of src/os/exec/exec.go in Go 1.14.2 is:
Note that it's stuck in
Until #23019 is resolved, the fix in
It seems like the best thing to do is to read the output of the command with a pipe, rather than CombinedOutput, and closing our pipe after the timeout.
Is it possible that sending a sigint before the CommandContext's sigkill could help? It may give
@bcmills introduced a nice function in Playground for a related issue here: https://github.com/golang/playground/blob/master/internal/internal.go#L14-L18
This seems like another case where the behavior of
That process-only behavior is generally ok for
I just noticed it happened again, only 6 days since the last time:
Because both instances were stuck on the Go repo, it meant that commits at https://go.googlesource.com/go weren't made available on the https://github.com/golang/go mirror during that time. Those commits were still being tested at https://build.golang.org though.
I restarted the two gitmirror instances to fix the problem now:
~ $ kubectl get pods | grep gitmirror gitmirror-rc-7bsh4 1/1 Running 0 6d20h gitmirror-rc-l5cxp 1/1 Running 0 6d23h ~ $ kubectl delete pod gitmirror-rc-7bsh4 && sleep 60 && kubectl delete pod gitmirror-rc-l5cxp pod "gitmirror-rc-7bsh4" deleted pod "gitmirror-rc-l5cxp" deleted
It's not hard, but if this continues to happen this frequently without us noticing, automating the fix will be more worthwhile.
February, 2021 edition:
2 months 17 days since last. Fixed with
Doing some xkcd.com/1205 math here, assuming this task is done monthly and takes a couple of minutes, that gives a budget of a couple hours to automate this (that is, to fix the diagnosed bug). So both doing this by hand and automating are quite close in cost.