New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: trybots not running due to maintner resource temporarily unavailable errors #23705

Closed
bradfitz opened this Issue Feb 5, 2018 · 16 comments

Comments

Projects
None yet
4 participants
@bradfitz
Member

bradfitz commented Feb 5, 2018

Trybots aren't running because maintner isn't noticing that there's new stuff on Gerrit, due to:

2018/02/02 20:07:17 Temporary error from gerrit go.googlesource.com/example: running git remote -v in /cache/go.googlesource.com%2Fexample: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 gerrit go.googlesource.com/time: init: running git remote -v in /cache/go.googlesource.com%2Ftime: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Ftime: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 Temporary error from gerrit go.googlesource.com/time: running git remote -v in /cache/go.googlesource.com%2Ftime: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 gerrit go.googlesource.com/build: init: running git remote -v in /cache/go.googlesource.com%2Fbuild: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Fbuild: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 Temporary error from gerrit go.googlesource.com/build: running git remote -v in /cache/go.googlesource.com%2Fbuild: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 gerrit code.googlesource.com/gocloud: init: running git remote -v in /cache/code.googlesource.com%2Fgocloud: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/code.googlesource.com%2Fgocloud: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 Temporary error from gerrit code.googlesource.com/gocloud: running git remote -v in /cache/code.googlesource.com%2Fgocloud: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:19 gerrit go.googlesource.com/gddo: init: running git remote -v in /cache/go.googlesource.com%2Fgddo: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:19 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Fgddo: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:19 Temporary error from gerrit go.googlesource.com/gddo: running git remote -v in /cache/go.googlesource.com%2Fgddo: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 gerrit go.googlesource.com/scratch: init: running git remote -v in /cache/go.googlesource.com%2Fscratch: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Fscratch: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 Temporary error from gerrit go.googlesource.com/scratch: running git remote -v in /cache/go.googlesource.com%2Fscratch: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 gerrit go.googlesource.com/proposal: init: running git remote -v in /cache/go.googlesource.com%2Fproposal: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Fproposal: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 Temporary error from gerrit go.googlesource.com/proposal: running git remote -v in /cache/go.googlesource.com%2Fproposal: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 gerrit go.googlesource.com/tour: init: running git remote -v in /cache/go.googlesource.com%2Ftour: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Ftour: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 Temporary error from gerrit go.googlesource.com/tour: running git remote -v in /cache/go.googlesource.com%2Ftour: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:21 gerrit go.googlesource.com/oauth2: init: running git remote -v in /cache/go.googlesource.com%2Foauth2: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:21 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Foauth2: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:21 Temporary error from gerrit go.googlesource.com/oauth2: running git remote -v in /cache/go.googlesource.com%2Foauth2: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:21 gerrit go.googlesource.com/exp: init: running git remote -v in /cache/go.googlesource.com%2Fexp: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:21 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Fexp: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:21 Temporary error from gerrit go.googlesource.com/exp: running git remote -v in /cache/go.googlesource.com%2Fexp: fork/exec /usr/bin/git: resource temporarily unavailable^C

This smells just like #23686 in a different binary.

Both of these were just updated for the LetsEncrypt ACME changes.

So what changed in the meantime?

What resource isn't available?

/cc @andybons @ianlancetaylor

@gopherbot gopherbot added this to the Unreleased milestone Feb 5, 2018

@gopherbot gopherbot added the Builders label Feb 5, 2018

@bradfitz

This comment has been minimized.

Member

bradfitz commented Feb 5, 2018

Can't even bash in to debug:

$ kubectl exec -ti maintnerd-deployment-2926465623-tjsxh /bin/bash
rpc error: code = 2 desc = "fork/exec /usr/bin/containerd-shim: resource temporarily unavailable"command terminated with exit code 126

Process limit? FD leak?

@bradfitz

This comment has been minimized.

Member

bradfitz commented Feb 5, 2018

Killed container and let it get restarted. Then ran bash in it and:

root@maintnerd-deployment-2926465623-7trps:/# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 60077
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1048576
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Feb 6, 2018

This may have happened again, as https://golang.org/cl/92155 has had no trybot initiation after ten minutes.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Feb 6, 2018

The trybots did start running on CL 92155 after 32 minutes.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Feb 8, 2018

Well, @andybons updated our GKE cluster from 1.5.6 to 1.8.whatever the latests GKE option is, and the errors persist, but they've changed. Now I see:

2018/02/08 17:00:25 Pubsub event on topic "gerrit:go.googlesource.com/go" discarded; already syncing?
2018/02/08 17:00:37 error fetching 9 hashes from gerrit project go.googlesource.com/go: 
2018/02/08 17:00:37 gerrit go.googlesource.com/go: sync: signal: killed; stderr=""
2018/02/08 17:00:37 IS TEMP ERROR? *errors.errorString signal: killed; stderr=""
2018/02/08 17:00:37 Temporary error from gerrit go.googlesource.com/go: signal: killed; stderr=""
2018/02/08 17:01:07 gerrit go.googlesource.com/go: git directory exists.
2018/02/08 17:01:10 gerrit go.googlesource.com/go: 9 new refs
2018/02/08 17:01:10 gerrit go.googlesource.com/go: fetching 9 hashes...
2018/02/08 17:01:11 error fetching 9 hashes from gerrit project go.googlesource.com/go: 
2018/02/08 17:01:11 gerrit go.googlesource.com/go: sync: signal: killed; stderr=""
2018/02/08 17:01:11 IS TEMP ERROR? *errors.errorString signal: killed; stderr=""
2018/02/08 17:01:11 Temporary error from gerrit go.googlesource.com/go: signal: killed; stderr=""
2018/02/08 17:01:14 Got pubsubhelper event: {
        "Time": "2018-02-08T17:01:14.022043045Z",
        "Gerrit": {
                "URL": "https://go-review.googlesource.com/92815",
                "Project": "go",
                "CommitHash": "d069c928b4e5108c10621d3190761abf6a5ece9a",
                "ChangeNumber": 92815
        }       
}       
2018/02/08 17:01:14 Pubsub event on topic "gerrit:go.googlesource.com/go" discarded; already syncing?
2018/02/08 17:01:41 gerrit go.googlesource.com/go: git directory exists.
2018/02/08 17:01:42 gerrit go.googlesource.com/go: sync: git fetch origin: signal: killed, 
2018/02/08 17:01:42 IS TEMP ERROR? *errors.errorString git fetch origin: signal: killed, 
2018/02/08 17:01:42 Temporary error from gerrit go.googlesource.com/go: git fetch origin: signal: killed, 

2018/02/08 17:02:12 gerrit go.googlesource.com/go: git directory exists.
2018/02/08 17:02:13 gerrit go.googlesource.com/go: sync: git fetch origin: signal: killed, 
2018/02/08 17:02:13 IS TEMP ERROR? *errors.errorString git fetch origin: signal: killed, 
2018/02/08 17:02:13 Temporary error from gerrit go.googlesource.com/go: git fetch origin: signal: killed, 

Not sure what Temporary error from gerrit go.googlesource.com/go: signal: killed; stderr="" means.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Feb 8, 2018

Looks like something's leaking. Bunch of defunct processes:

root@maintnerd-deployment-2926465623-g8q6z:/# ps ax | wc -l
145
root@maintnerd-deployment-2926465623-g8q6z:/# ps ax | head -20
  PID TTY      STAT   TIME COMMAND
    1 ?        Ssl   40:02 /maintnerd --config=go --bucket=golang-maintner-log --verbose --data-dir=/cache --listen=:80 --autocert=maintner.golang.org --autocert-bucket=golang-maintner-autocert
 2611 ?        Z      0:00 [git-remote-http] <defunct>
 3986 ?        Z      0:00 [git-remote-http] <defunct>
 3990 ?        Z      0:00 [git-remote-http] <defunct>
 4012 ?        Z      0:00 [git-remote-http] <defunct>
 4038 ?        Z      0:00 [git-remote-http] <defunct>
 4042 ?        Z      0:00 [git-remote-http] <defunct>
 4093 ?        Z      0:00 [git-remote-http] <defunct>
 4250 ?        Z      0:00 [git-remote-http] <defunct>
 4258 ?        Z      0:00 [git-remote-http] <defunct>
 4265 ?        Z      0:00 [git] <defunct>
 4283 ?        Z      0:00 [git-remote-http] <defunct>
 4363 ?        Z      0:00 [git-remote-http] <defunct>
 4385 ?        Z      0:00 [git-remote-http] <defunct>
 4407 ?        Z      0:00 [git-remote-http] <defunct>
 4420 ?        Z      0:00 [git-remote-http] <defunct>
 4423 ?        Z      0:04 [git] <defunct>
 4464 ?        Z      0:00 [git-remote-http] <defunct>
 4468 ?        Z      0:00 [git-remote-http] <defunct>
@bradfitz

This comment has been minimized.

Member

bradfitz commented Feb 8, 2018

I have no idea where these defunct processes are coming from. I audited all the os/exec callers in golang.org/x/build/maintner/... and don't see anything suspicious.

@ianlancetaylor, any debugging tips?

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Feb 9, 2018

Sorry, no good ideas. Just the obvious one of double-checking that the parent process is what you think it is.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Feb 9, 2018

Updates from research & chats with others.

This is a GKE issue. Apparently GKE 1.7 automatically reaped zombie processes like an init would, using a "pause container", but that was reverted in GKE 1.8 and hasn't been fixed in GKE 1.8 yet.

Some links:

https://news.ycombinator.com/item?id=14524253
kubernetes/kubernetes#1615
https://github.com/krallin/tini
https://stackoverflow.com/questions/43122080/how-to-use-init-parameter-in-docker-run

@gopherbot

This comment has been minimized.

gopherbot commented Feb 9, 2018

Change https://golang.org/cl/93077 mentions this issue: maintner/maintnerd: wrap maintnerd in tini init process to reap zombies

gopherbot pushed a commit to golang/build that referenced this issue Feb 9, 2018

maintner/maintnerd: wrap maintnerd in tini init process to reap zombies
Updates golang/go#23705

Change-Id: If0e7826ab75aae46dc7b79306d36d6cd3f07a041
Reviewed-on: https://go-review.googlesource.com/93077
Reviewed-by: Andrew Bonventre <andybons@golang.org>
@bradfitz

This comment has been minimized.

Member

bradfitz commented Feb 9, 2018

Deployed:

root@maintnerd-deployment-7fdd7bdfff-77dcp:/# ps afx
  PID TTY      STAT   TIME COMMAND
   18 ?        Ss     0:00 /bin/bash
   26 ?        R+     0:00  \_ ps afx
    1 ?        Ss     0:00 /tini -- /maintnerd --config=go --bucket=golang-maintner-log --verbose --data-dir=/cache --listen=:80 --autocert=maintner.golang
    7 ?        Sl     0:22 /maintnerd --config=go --bucket=golang-maintner-log --verbose --data-dir=/cache --listen=:80 --autocert=maintner.golang.org --au
@gopherbot

This comment has been minimized.

gopherbot commented Feb 9, 2018

Change https://golang.org/cl/93082 mentions this issue: maintner/maintnerd: fix use of tini

gopherbot pushed a commit to golang/build that referenced this issue Feb 9, 2018

maintner/maintnerd: fix use of tini
Updates golang/go#23705

Change-Id: Ia66e3c16bda3357daf4796f3eafb50b2eb019223
Reviewed-on: https://go-review.googlesource.com/93082
Reviewed-by: Andrew Bonventre <andybons@golang.org>
@bradfitz

This comment has been minimized.

Member

bradfitz commented Feb 12, 2018

gitmirror is next to need love:

root@gitmirror-rc-dlxkn:/# ps ax
  PID TTY      STAT   TIME COMMAND
    1 ?        Ssl    1:57 /gitmirror -http=:8585 -mirror=true -report=true -network=true -cachedir=/cache/gitmirror
 1233 ?        Z      0:00 [sh] <defunct>
 1238 ?        Z      0:00 [sh] <defunct>
 1247 ?        Z      0:00 [sh] <defunct>
 1253 ?        Z      0:00 [sh] <defunct>
 1272 ?        Z      0:00 [sh] <defunct>
 1275 ?        Z      0:00 [sh] <defunct>
 1278 ?        Z      0:00 [sh] <defunct>
 1282 ?        Z      0:00 [sh] <defunct>
 1285 ?        Z      0:00 [sh] <defunct>
 1287 ?        Z      0:00 [sh] <defunct>
 1290 ?        Z      0:00 [sh] <defunct>
 1293 ?        Z      0:00 [sh] <defunct>
 1296 ?        Z      0:00 [sh] <defunct>
 1298 ?        Z      0:00 [sh] <defunct>
 1335 ?        Z      0:00 [sh] <defunct>
 1339 ?        Z      0:00 [sh] <defunct>
 1340 ?        Z      0:00 [sh] <defunct>
 1347 ?        Z      0:00 [sh] <defunct>
 1368 ?        Z      0:00 [sh] <defunct>
 1375 ?        Z      0:00 [sh] <defunct>
 1392 ?        Z      0:00 [sh] <defunct>
 1395 ?        Z      0:00 [sh] <defunct>
 1397 ?        Z      0:00 [sh] <defunct>
 1403 ?        Z      0:00 [sh] <defunct>
 1406 ?        Z      0:00 [sh] <defunct>
 1414 ?        Z      0:00 [sh] <defunct>
 1424 ?        Z      0:00 [sh] <defunct>
 1427 ?        Z      0:00 [sh] <defunct>
 1430 ?        Z      0:00 [sh] <defunct>
 1485 ?        S      0:00 ssh git@github.com git-upload-pack 'golang/example.git'
 1524 ?        Z      0:00 [ssh] <defunct>
 1572 ?        Z      0:00 [ssh] <defunct>
 1575 ?        S      0:00 ssh git@github.com git-upload-pack 'golang/scratch.git'
 1586 ?        Z      0:00 [ssh] <defunct>
 1798 ?        Ss     0:00 /bin/bash
 1804 ?        R+     0:00 ps ax
@gopherbot

This comment has been minimized.

gopherbot commented Feb 13, 2018

Change https://golang.org/cl/93755 mentions this issue: cmd/gitmirror: add the tini init wrapper to the Dockerfile

gopherbot pushed a commit to golang/build that referenced this issue Feb 14, 2018

cmd/gitmirror: add the tini init wrapper to the Dockerfile
Updates golang/go#23705

Change-Id: I1fa42b2767d1a9780672458fd06d64ed998b9e04
Reviewed-on: https://go-review.googlesource.com/93755
Reviewed-by: Andrew Bonventre <andybons@golang.org>
@gopherbot

This comment has been minimized.

gopherbot commented Feb 14, 2018

Change https://golang.org/cl/94075 mentions this issue: cmd/gerritbot: add the tini init wrapper to the Dockerfile

gopherbot pushed a commit to golang/build that referenced this issue Feb 14, 2018

cmd/gerritbot: add the tini init wrapper to the Dockerfile
Updates golang/go#23705

Change-Id: I9c483efa491e2d9f705850d81ea94feda9f9d5a4
Reviewed-on: https://go-review.googlesource.com/94075
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@dmitshur

This comment has been minimized.

Member

dmitshur commented Nov 2, 2018

I believe this particular issue is resolved, so closing. Please re-open if not, or if there's something actionable to do here.

@dmitshur dmitshur closed this Nov 2, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment