Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/gitmirror: 10 minute timeout on git fetch is incorrectly implemented, so git fetch may still block forever #38887

Open
dmitshur opened this issue May 5, 2020 · 12 comments

Comments

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented May 5, 2020

Every now and then, gitmirror gets into a bad state where one of the repositories (usually, the main "go" repo) gets stuck and reports a "repo go: hung? no activity since last success ago" problem on https://farmer.golang.org/#health.

This problem happens rarely (some number of times a year~). There are two instances of gitmirror in production, so as long as at least one is in a good state, mirroring continues to operate without a problem. When it happens, it's easy to spot via https://farmer.golang.org/#health and restart an instance by deleting the problematic pod (a new pod will automatically spin up by the replication controller).

This is a tracking issue for this problem to see how often it happens, investigate as needed and fix it.

Previously:

Relevant CLs:

It happened again today:

image

I've captured some logs and restarted one of the instances (having two bad instances meant that the Go repo was no longer being mirrored). I'll restart the other one in a bit, after doing some more investigation.

/cc @cagedmantis @toothrot @bcmills

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented May 5, 2020

This time, I was able to gather enough information and figure it out! 🎉

CL 203057 added a 10 minute timeout to the git fetch command execution by using exec.CommandContext. My intention there was to not let it run for more than 10 minutes, ever. I was not aware that Wait may still continue to block if a the command runs a subprocess. git fetch is a high level git command, it runs a lower level one (e.g., git-fetch-pack or so). As a result, it was still possible for git fetch to block forever in some cases.

Also see #21922, #22485, and other similar issues.

There's an accepted proposal #23019 to try to change Wait behavior.

Details

The /debug/watcher/go endpoint of gitmirror reported:

2020-05-05T20:40:22Z   2h44m17s ago         running git fetch origin, attempt 2
2020-05-05T20:37:29Z   2h47m11s ago         running git fetch origin
2020-05-05T20:37:29Z   2h47m11s ago         got update tickle
2020-05-05T20:37:29Z   2h47m11s ago         waiting
2020-05-05T20:37:29Z   2h47m11s ago         did sync to github
2020-05-05T20:37:28Z   2h47m12s ago         syncing to github
2020-05-05T20:37:28Z   2h47m12s ago         ran git fetch
2020-05-05T20:34:08Z   2h50m31s ago         running git fetch origin

"attempt 2" meant attempt 1 failed, and this was the corresponding error message:

2020/05/05 20:40:17 go: git fetch: exit status 128

error: RPC failed; HTTP 502 curl 22 The requested URL returned error: 502 
fatal: the remote end hung up unexpectedly

It seems the Gerrit server was having a temporary issue at that moment. It's likely attempt 2 failed for a similar reason, except it got stuck.

The goroutine dump (via the /debug/goroutines endpoint) was very insightful:

goroutine 151 [chan receive, 160 minutes]:
os/exec.(*Cmd).Wait(0xc00040ab00, 0x0, 0x0)
	/usr/local/go/src/os/exec/exec.go:515 +0x125
os/exec.(*Cmd).Run(0xc00040ab00, 0xc00009ce70, 0xc00040ab00)
	/usr/local/go/src/os/exec/exec.go:341 +0x5c
os/exec.(*Cmd).CombinedOutput(0xc00040ab00, 0x3, 0xc00039fd68, 0x3, 0x3, 0xc00040ab00)
	/usr/local/go/src/os/exec/exec.go:567 +0x91
main.(*Repo).fetch.func2(0x0, 0x0)
	/go/src/golang.org/x/build/cmd/gitmirror/gitmirror.go:508 +0x1c0
main.try(0x3, 0xc00039fe08, 0x18, 0xef32a0)
	/go/src/golang.org/x/build/cmd/gitmirror/gitmirror.go:642 +0x63
main.(*Repo).fetch(0xc000360000, 0x0, 0x0)
	/go/src/golang.org/x/build/cmd/gitmirror/gitmirror.go:499 +0xd6
main.(*Repo).Loop(0xc000360000)
	/go/src/golang.org/x/build/cmd/gitmirror/gitmirror.go:448 +0x7f
main.runGitMirror.func1(0xc00002e7d6, 0x2)
	/go/src/golang.org/x/build/cmd/gitmirror/gitmirror.go:153 +0x2bb
created by main.runGitMirror
	/go/src/golang.org/x/build/cmd/gitmirror/gitmirror.go:162 +0x342

Line 515 of src/os/exec/exec.go in Go 1.14.2 is:

if err := <-c.errch; err != nil && copyError == nil {

Note that it's stuck in cmd.Wait for 160 minutes.

Until #23019 is resolved, the fix in cmd/gitmirror is to ensure git fetch isn't allowed to run for more than 10 minutes in a way that is functional.

@dmitshur dmitshur changed the title x/build/cmd/gitmirror: track, investigate, and resolve rare cases of repo hanging x/build/cmd/gitmirror: 10 minute timeout on git fetch is incorrectly implemented, so git fetch may still block forever May 5, 2020
@toothrot
Copy link
Contributor

@toothrot toothrot commented May 6, 2020

It seems like the best thing to do is to read the output of the command with a pipe, rather than CombinedOutput, and closing our pipe after the timeout.

Is it possible that sending a sigint before the CommandContext's sigkill could help? It may give git fetch an opportunity to kill it's subprocesses, assuming git is well behaved.

@bcmills introduced a nice function in Playground for a related issue here: https://github.com/golang/playground/blob/master/internal/internal.go#L14-L18

@bcmills
Copy link
Member

@bcmills bcmills commented May 6, 2020

This seems like another case where the behavior of exec.CommandContext is very suboptimal.

Per https://pkg.go.dev/os/exec?tab=doc#CommandContext:

The provided context is used to kill the process (by calling os.Process.Kill) …

and https://pkg.go.dev/os?tab=doc#Process.Kill:

This only kills the Process itself, not any other processes it may have started.


That process-only behavior is generally ok for SIGINT, because the process has the opportunity to do its own signal-propagation to its subprocesses.

However, with SIGKILL the process has no such opportunity to forward the signal, because the signal cannot be caught. (Instead, the alternative there is for the Go program to use the less-portable syscall.Kill to send the signal to the whole group.)

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented May 12, 2020

I just noticed it happened again, only 6 days since the last time:

image

Because both instances were stuck on the Go repo, it meant that commits at https://go.googlesource.com/go weren't made available on the https://github.com/golang/go mirror during that time. Those commits were still being tested at https://build.golang.org though.

I restarted the two gitmirror instances to fix the problem now:

~ $ kubectl get pods | grep gitmirror
gitmirror-rc-7bsh4                         1/1     Running   0          6d20h
gitmirror-rc-l5cxp                         1/1     Running   0          6d23h
~ $ kubectl delete pod gitmirror-rc-7bsh4 && sleep 60 && kubectl delete pod gitmirror-rc-l5cxp
pod "gitmirror-rc-7bsh4" deleted
pod "gitmirror-rc-l5cxp" deleted

It's not hard, but if this continues to happen this frequently without us noticing, automating the fix will be more worthwhile.

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented Oct 29, 2020

Again, today:

image

That's 5 months 17 days since last reported occurrence.

Fixed using same procedure as in #38887 (comment).

CC @golang/release.

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented Dec 9, 2020

December edition:

image

That's 1 months 9 days since last reported occurrence.

Fixed using same procedure as in #38887 (comment).

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented Feb 26, 2021

February, 2021 edition:

image

2 months 17 days since last. Fixed with kubectl delete pod gitmirror-rc-6p6lf.

Doing some xkcd.com/1205 math here, assuming this task is done monthly and takes a couple of minutes, that gives a budget of a couple hours to automate this (that is, to fix the diagnosed bug). So both doing this by hand and automating are quite close in cost.

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented Apr 7, 2021

April 7, 2021 edition:

image

1 month 13 days since last. Fixed with kubectl delete pod gitmirror-rc-cbhf4.

@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented Apr 14, 2021

instance gitmirror-rc-fqrxd: repo exp: hung? no activity since last success 1h8m32.429381987s ago
#45554 was filed.

@dmitshur
Copy link
Contributor Author

@dmitshur dmitshur commented May 28, 2021

May 27, 2021 edition:

image

Fixed as before with kubectl delete pod gitmirror-rc-423aca9388b272046d4bc507a0414e1f-v27xm. (CC @heschi FYI; also see last paragraph in #38887 (comment).)

@gopherbot
Copy link

@gopherbot gopherbot commented Jun 7, 2021

Change https://golang.org/cl/325771 mentions this issue: cmd/gitmirror: kill subprocesses with SIGINT before SIGKILL

@heschi
Copy link
Contributor

@heschi heschi commented Jun 11, 2021

And again today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants