Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production deployment is down #48

Closed
jhamman opened this issue May 1, 2019 · 1 comment
Closed

Production deployment is down #48

jhamman opened this issue May 1, 2019 · 1 comment

Comments

@jhamman
Copy link
Member

jhamman commented May 1, 2019

After merging #47 the production deployment has gone down. The current pod listing is:

 ~  kubectl get pods --namespace prod --output wide                                                                            ✔  10480  08:35:18
NAME                                                           READY     STATUS             RESTARTS   AGE       IP            NODE                                    NOMINATED NODE
binder-d6c8bbf58-59s9j                                         1/1       Running            0          9h        10.50.43.35   gke-binder-default-pool-109da6b7-387f   <none>
hub-bd6fd466c-qjkzh                                            1/1       Running            0          9h        10.50.43.37   gke-binder-default-pool-109da6b7-387f   <none>
jupyter-pangeo-2ddata-2dpan-2d-5focean-5fexamples-2dpg26eu8d   1/1       Running            0          54m       10.49.48.36   gke-binder-jupyter-pool-6f85590c-b7wr   <none>
jupyter-xgcm-2dxgcm-5fexamples-2dnz5o013j                      1/1       Running            0          1h        10.49.48.35   gke-binder-jupyter-pool-6f85590c-b7wr   <none>
prod-dind-96vw6                                                0/1       CrashLoopBackOff   116        9h        10.49.48.26   gke-binder-jupyter-pool-6f85590c-b7wr   <none>
prod-image-cleaner-8xw9d                                       0/1       CrashLoopBackOff   120        9h        10.49.48.27   gke-binder-jupyter-pool-6f85590c-b7wr   <none>
prod-kube-lego-574f9948c6-vxjt8                                1/1       Running            0          23d       10.50.43.4    gke-binder-default-pool-109da6b7-387f   <none>
prod-nginx-ingress-controller-589f47d979-6rscr                 1/1       Running            0          9h        10.49.48.29   gke-binder-jupyter-pool-6f85590c-b7wr   <none>
prod-nginx-ingress-controller-589f47d979-xszqv                 1/1       Running            0          9h        10.49.48.28   gke-binder-jupyter-pool-6f85590c-b7wr   <none>
prod-nginx-ingress-default-backend-66b8f956d6-f4g56            1/1       Running            0          22d       10.50.44.12   gke-binder-default-pool-109da6b7-3gzs   <none>
proxy-5487465f49-97xpp                                         1/1       Running            0          9h        10.50.43.36   gke-binder-default-pool-109da6b7-387f   <none>

Details from the dind logs:

 ~  kubectl logs --namespace prod prod-dind-96vw6                                                                              ✔  10481  08:40:36
time="2019-05-01T15:39:33.431943017Z" level=warning msg="could not change group /var/run/dind/prod/docker.sock/docker.sock to docker: group docker not found"
time="2019-05-01T15:39:33.433283517Z" level=info msg="libcontainerd: started new containerd process" pid=25
time="2019-05-01T15:39:33.433433078Z" level=info msg="parsed scheme: \"unix\"" module=grpc
time="2019-05-01T15:39:33.433488005Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
time="2019-05-01T15:39:33.433593327Z" level=info msg="ccResolverWrapper: sending new addresses to cc: [{unix:///var/run/docker/containerd/containerd.sock 0  }]" module=grpc
time="2019-05-01T15:39:33.433649071Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
time="2019-05-01T15:39:33.433742959Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc4201956e0, CONNECTING" module=grpc
time="2019-05-01T15:39:33.455220961Z" level=info msg="starting containerd" revision=9754871865f7fe2f4e74d43e2fc7ccd237edcbce version=v1.2.2
time="2019-05-01T15:39:33.455703260Z" level=info msg="loading plugin "io.containerd.content.v1.content"..." type=io.containerd.content.v1
time="2019-05-01T15:39:33.455760366Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.btrfs"..." type=io.containerd.snapshotter.v1
time="2019-05-01T15:39:33.456024013Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.btrfs" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter"
time="2019-05-01T15:39:33.456084268Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.aufs"..." type=io.containerd.snapshotter.v1
time="2019-05-01T15:39:33.462416022Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.aufs" error="modprobe aufs failed: "ip: can't find device 'aufs'\nmodprobe: can't change directory to '/lib/modules': No such file or directory\n": exit status 1"
time="2019-05-01T15:39:33.462469424Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.native"..." type=io.containerd.snapshotter.v1
time="2019-05-01T15:39:33.462544329Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." type=io.containerd.snapshotter.v1
time="2019-05-01T15:39:33.462642643Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.zfs"..." type=io.containerd.snapshotter.v1
time="2019-05-01T15:39:33.462881282Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.zfs" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter"
time="2019-05-01T15:39:33.462913517Z" level=info msg="loading plugin "io.containerd.metadata.v1.bolt"..." type=io.containerd.metadata.v1
time="2019-05-01T15:39:33.462927137Z" level=warning msg="could not use snapshotter btrfs in metadata plugin" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter"
time="2019-05-01T15:39:33.462952533Z" level=warning msg="could not use snapshotter aufs in metadata plugin" error="modprobe aufs failed: "ip: can't find device 'aufs'\nmodprobe: can't change directory to '/lib/modules': No such file or directory\n": exit status 1"
time="2019-05-01T15:39:33.462972241Z" level=warning msg="could not use snapshotter zfs in metadata plugin" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter"
time="2019-05-01T15:39:43.433833795Z" level=error msg="failed connecting to containerd" error="failed to dial \"/var/run/docker/containerd/containerd.sock\": context deadline exceeded" module=libcontainerd
time="2019-05-01T15:39:43.534327107Z" level=info msg="killing and restarting containerd" module=libcontainerd pid=25
time="2019-05-01T15:39:43.535051235Z" level=info msg="=== BEGIN goroutine stack dump ===
goroutine 25 [running]:
github.com/containerd/containerd/cmd/containerd/command.dumpStacks()
	/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:78 +0x8c
github.com/containerd/containerd/cmd/containerd/command.handleSignals.func1(0xc4202e4180, 0xc4202e4120, 0x190aba0, 0xc4200ca010, 0xc420090180)
	/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:53 +0x274
created by github.com/containerd/containerd/cmd/containerd/command.handleSignals
	/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:43 +0x8b

goroutine 1 [sleep]:
time.Sleep(0x2faf080)
/usr/local/go/src/runtime/time.go:102 +0x16c
github.com/containerd/containerd/vendor/go.etcd.io/bbolt.flock(0xc4203f4f00, 0x1, 0x0, 0x1a4, 0xc4203b6330)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bolt_unix.go:40 +0x88
github.com/containerd/containerd/vendor/go.etcd.io/bbolt.Open(0xc420121400, 0x48, 0x1a4, 0x20cf680, 0xc4200f1048, 0x1, 0x0)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:199 +0x165
github.com/containerd/containerd/services/server.LoadPlugins.func2(0xc420370c40, 0xc42025b4a0, 0x21, 0xc4200422a0, 0x1e)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/services/server/server.go:264 +0x49d
github.com/containerd/containerd/plugin.(*Registration).Init(0xc4200d14f0, 0xc420370c40, 0xc4200d14f0)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/plugin/plugin.go:100 +0x3a
github.com/containerd/containerd/services/server.New(0x190aba0, 0xc4200ca010, 0xc420354000, 0x1, 0xc4204e1c80, 0x0)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/services/server/server.go:120 +0x557
github.com/containerd/containerd/cmd/containerd/command.App.func1(0xc420352000, 0xc420352000, 0xc4204e1d07)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/cmd/containerd/command/main.go:141 +0x67e
github.com/containerd/containerd/vendor/github.com/urfave/cli.HandleAction(0x16f0900, 0x18e7090, 0xc420352000, 0xc4202e40c0, 0x0)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:502 +0xca
github.com/containerd/containerd/vendor/github.com/urfave/cli.(*App).Run(0xc42034a000, 0xc4200d00f0, 0x5, 0x5, 0x0, 0x0)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:268 +0x60e
main.main()
github.com/containerd/containerd/cmd/containerd/main.go:33 +0x51

goroutine 19 [syscall]:
os/signal.signal_recv(0x18fa520)
/usr/local/go/src/runtime/sigqueue.go:139 +0xa8
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:22 +0x24
created by os/signal.init.0
/usr/local/go/src/os/signal/signal_unix.go:28 +0x43

goroutine 20 [chan receive]:
github.com/containerd/containerd/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x20aa9a0)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/github.com/golang/glog/glog.go:879 +0x8d
created by github.com/containerd/containerd/vendor/github.com/golang/glog.init.0
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/github.com/golang/glog/glog.go:410 +0x205

goroutine 26 [select, locked to thread]:
runtime.gopark(0x18e9ed0, 0x0, 0x115f5ff, 0x6, 0x18, 0x1)
/usr/local/go/src/runtime/proc.go:291 +0x120
runtime.selectgo(0xc4203bbf50, 0xc420090240)
/usr/local/go/src/runtime/select.go:392 +0xe56
runtime.ensureSigM.func1()
/usr/local/go/src/runtime/signal_unix.go:549 +0x1f6
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2361 +0x1

goroutine 11 [select]:
github.com/containerd/containerd/vendor/github.com/docker/go-events.(*Broadcaster).run(0xc4200d1540)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/broadcast.go:117 +0x3c4
created by github.com/containerd/containerd/vendor/github.com/docker/go-events.NewBroadcaster
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/broadcast.go:39 +0x1b1

=== END goroutine stack dump ==="
time="2019-05-01T15:39:43.636338550Z" level=error msg="containerd did not exit successfully" error="signal: killed" module=libcontainerd
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1525f50]

goroutine 26 [running]:
github.com/docker/docker/vendor/github.com/containerd/containerd.(*Client).Close(0x0, 0x0, 0x0)
/go/src/github.com/docker/docker/vendor/github.com/containerd/containerd/client.go:536 +0x30
github.com/docker/docker/libcontainerd/supervisor.(*remote).monitorDaemon(0xc4207081a0, 0x26a5600, 0xc420682fc0)
/go/src/github.com/docker/docker/libcontainerd/supervisor/remote_daemon.go:321 +0x262
created by github.com/docker/docker/libcontainerd/supervisor.Start
/go/src/github.com/docker/docker/libcontainerd/supervisor/remote_daemon.go:90 +0x3fa

Details from the image cleaner pod:

~  kubectl logs --namespace prod prod-image-cleaner-8xw9d  ✔  10482  08:42:23 2019-05-01 15:40:05,798 Pruning docker images when /var/lib/docker has 80.0% inodes or blocks used Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen chunked=chunked) File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 354, in _make_request conn.request(method, url, **httplib_request_kw) File "/usr/local/lib/python3.6/http/client.py", line 1239, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/local/lib/python3.6/http/client.py", line 1285, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/local/lib/python3.6/http/client.py", line 1234, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/local/lib/python3.6/http/client.py", line 1026, in _send_output self.send(msg) File "/usr/local/lib/python3.6/http/client.py", line 964, in send self.connect() File "/usr/local/lib/python3.6/site-packages/docker/transport/unixconn.py", line 46, in connect sock.connect(self.unix_socket) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 367, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.6/http/client.py", line 1026, in _send_output
self.send(msg)
File "/usr/local/lib/python3.6/http/client.py", line 964, in send
self.connect()
File "/usr/local/lib/python3.6/site-packages/docker/transport/unixconn.py", line 46, in connect
sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 171, in _retrieve_server_version
return self.version(api_version=False)["ApiVersion"]
File "/usr/local/lib/python3.6/site-packages/docker/api/daemon.py", line 179, in version
return self._result(self._get(url), json=True)
File "/usr/local/lib/python3.6/site-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 194, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 537, in get
return self.request('GET', url, **kwargs)
File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 524, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 637, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/image-cleaner.py", line 198, in
main()
File "/usr/local/bin/image-cleaner.py", line 121, in main
client = docker.from_env(version='auto')
File "/usr/local/lib/python3.6/site-packages/docker/client.py", line 81, in from_env
**kwargs_from_env(**kwargs))
File "/usr/local/lib/python3.6/site-packages/docker/client.py", line 38, in init
self.api = APIClient(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 154, in init
self._version = self._retrieve_server_version()
File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 179, in _retrieve_server_version
'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))

@yuvipanda - does this still look like staging/prod are racing for the dind socket?

@jhamman
Copy link
Member Author

jhamman commented May 1, 2019

Btw, I think this is the most relavent bit from the dind pod logs:

time="2019-05-01T15:39:33.462881282Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.zfs" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter"
time="2019-05-01T15:39:33.462913517Z" level=info msg="loading plugin "io.containerd.metadata.v1.bolt"..." type=io.containerd.metadata.v1
time="2019-05-01T15:39:33.462927137Z" level=warning msg="could not use snapshotter btrfs in metadata plugin" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter"
time="2019-05-01T15:39:33.462952533Z" level=warning msg="could not use snapshotter aufs in metadata plugin" error="modprobe aufs failed: "ip: can't find device 'aufs'\nmodprobe: can't change directory to '/lib/modules': No such file or directory\n": exit status 1"
time="2019-05-01T15:39:33.462972241Z" level=warning msg="could not use snapshotter zfs in metadata plugin" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter"
time="2019-05-01T15:39:43.433833795Z" level=error msg="failed connecting to containerd" error="failed to dial \"/var/run/docker/containerd/containerd.sock\": context deadline exceeded" module=libcontainerd
time="2019-05-01T15:39:43.534327107Z" level=info msg="killing and restarting containerd" module=libcontainerd pid=25

@jhamman jhamman closed this as completed May 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant