Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sandbox already exists after node reboot #1742

Open
jnummelin opened this Issue Aug 16, 2018 · 20 comments

Comments

Projects
None yet
8 participants
@jnummelin
Copy link

jnummelin commented Aug 16, 2018

Description

Steps to reproduce the issue:

  1. Setup Kube with CRI-O 1.11.1
  2. Reboot a node

Describe the results you received:

After reboot no pods can be created. Kubelet always gets an error from crio:

Aug 16 10:49:35 pharos-worker-0 kubelet[1503]: W0816 10:49:35.119467    1503 status_manager.go:482] Failed to get status for pod "pharos-proxy-pharos-worker-0_kube-system(f2ed2b7c8d4f0463b262f46b7a3856b9)": Get https://localhost:6443/api/v1/namespaces/kube-system/pods/pharos-proxy-pharos-worker-0: dial tcp 127.0.0.1:6443: connect: connection refused
Aug 16 10:49:35 pharos-worker-0 kubelet[1503]: E0816 10:49:35.134458    1503 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = pod sandbox with name "k8s_pharos-proxy-pharos-worker-0_kube-system_f2ed2b7c8d4f0463b262f46b7a3856b9_0" already exists
Aug 16 10:49:35 pharos-worker-0 kubelet[1503]: E0816 10:49:35.134538    1503 kuberuntime_sandbox.go:56] CreatePodSandbox for pod "pharos-proxy-pharos-worker-0_kube-system(f2ed2b7c8d4f0463b262f46b7a3856b9)" failed: rpc error: code = Unknown desc = pod sandbox with name "k8s_pharos-proxy-pharos-worker-0_kube-system_f2ed2b7c8d4f0463b262f46b7a3856b9_0" already exists
Aug 16 10:49:35 pharos-worker-0 kubelet[1503]: E0816 10:49:35.134552    1503 kuberuntime_manager.go:646] createPodSandbox for pod "pharos-proxy-pharos-worker-0_kube-system(f2ed2b7c8d4f0463b262f46b7a3856b9)" failed: rpc error: code = Unknown desc = pod sandbox with name "k8s_pharos-proxy-pharos-worker-0_kube-system_f2ed2b7c8d4f0463b262f46b7a3856b9_0" already exists

CRI-O logs also show some errors after reboot:

-- Logs begin at Thu 2018-08-16 10:49:11 UTC, end at Thu 2018-08-16 11:59:52 UTC. --
Aug 16 10:49:14 pharos-worker-0 systemd[1]: Starting Open Container Initiative Daemon...
Aug 16 10:49:14 pharos-worker-0 sysctl[1490]: net.ipv4.ip_forward = 1
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.809774896Z" level=info msg="[graphdriver] using prior storage driver: overlay"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.819928928Z" level=info msg="CNI network pharos (type=weave-net) is used from /etc/cni/net.d/00-pharos.conflist"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.819951403Z" level=info msg="Initial CNI setting succeeded"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.874817377Z" level=warning msg="could not restore sandbox 563be37a1089ed4e57ccb41fcad923262514189111aee88fd4b03e82af22c3b9 container 563be37a1089ed4e57ccb41fcad923262514189111aee88fd4b03e82af22c3b9: open /var/run/containers/storage/overlay-containers/563be37a1089ed4e57ccb41fcad923262514189111aee88fd4b03e82af22c3b9/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.874882490Z" level=warning msg="could not restore sandbox 45f7b1b7037e9f0c89de29bae6e9851ac4ce94aff11e7ff65ca2f998445d6439 container 45f7b1b7037e9f0c89de29bae6e9851ac4ce94aff11e7ff65ca2f998445d6439: open /var/run/containers/storage/overlay-containers/45f7b1b7037e9f0c89de29bae6e9851ac4ce94aff11e7ff65ca2f998445d6439/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875014642Z" level=warning msg="could not restore sandbox ab131c0e092e9e8347b56910f612bf50b89c6ead0c90bbec09ee95f076f461d2 container ab131c0e092e9e8347b56910f612bf50b89c6ead0c90bbec09ee95f076f461d2: open /var/run/containers/storage/overlay-containers/ab131c0e092e9e8347b56910f612bf50b89c6ead0c90bbec09ee95f076f461d2/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875069151Z" level=warning msg="could not restore sandbox 5a87873e9591cc308c4cad490469f8ddcf54fc71b85d64056c1f873d75f59e58 container 5a87873e9591cc308c4cad490469f8ddcf54fc71b85d64056c1f873d75f59e58: open /var/run/containers/storage/overlay-containers/5a87873e9591cc308c4cad490469f8ddcf54fc71b85d64056c1f873d75f59e58/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875107871Z" level=warning msg="could not restore container d22575619707850de9fc1be3afb2ebf4a21980a82a8c139daf30a3b3cfc90a50: open /var/run/containers/storage/overlay-containers/d22575619707850de9fc1be3afb2ebf4a21980a82a8c139daf30a3b3cfc90a50/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875221658Z" level=warning msg="could not restore container 50255cb7b734aa10719fcc0712e5e6e65d3bc8bbe5b8e4e637304a431c483645: open /var/run/containers/storage/overlay-containers/50255cb7b734aa10719fcc0712e5e6e65d3bc8bbe5b8e4e637304a431c483645/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875336372Z" level=warning msg="could not restore container fb3d0ea85b6ac97a23becefab610a76108be1086c551efb576c9dd468924b706: open /var/run/containers/storage/overlay-containers/fb3d0ea85b6ac97a23becefab610a76108be1086c551efb576c9dd468924b706/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875433800Z" level=warning msg="could not restore container a942ea6b2b97c4528e4d43d8fa9278776e3fc2672838dea16c09678b790f9d6f: open /var/run/containers/storage/overlay-containers/a942ea6b2b97c4528e4d43d8fa9278776e3fc2672838dea16c09678b790f9d6f/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875471488Z" level=warning msg="could not restore container 6bff7a83a7d6d91bbc4fb745ff3649eb1f13ec189dbe387ca8cfda0d30b58a4e: open /var/run/containers/storage/overlay-containers/6bff7a83a7d6d91bbc4fb745ff3649eb1f13ec189dbe387ca8cfda0d30b58a4e/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875559127Z" level=warning msg="could not restore container 65c92394ff515dd385a83874c0faf86450c99ed8c5bd16d36ef9e38e95abfd7c: open /var/run/containers/storage/overlay-containers/65c92394ff515dd385a83874c0faf86450c99ed8c5bd16d36ef9e38e95abfd7c/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875684359Z" level=warning msg="could not restore container ab2aefc0221c18153bc1ada4096baee42b918d8fe554b68e746a23731d938e2f: open /var/run/containers/storage/overlay-containers/ab2aefc0221c18153bc1ada4096baee42b918d8fe554b68e746a23731d938e2f/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 systemd[1]: Started Open Container Initiative Daemon.

In this case the pod to be created/started was running fine before reboot. It's a static pod in /etc/kubernetes/manifest/...

We've tried many variations in the shutdown process, but cannot seem to get crio to startup properly after reboot.

The only way to "recover" seems to be with:

systemctl stop crio
rm -rf /var/lib/containers
rm -rf /var/run/containers
systemctl start crio

Describe the results you expected:

Kubelet & CRI-O to be able to start/create needed pods after reboot.

Additional information you deem important (e.g. issue happens only occasionally):

Based on our testing, reboots seem to work properly with 1.11.0 release

Output of crio --version:

root@pharos-worker-0:~# crio --version
crio version 1.11.1

Additional environment details (AWS, VirtualBox, physical, etc.):

Tested on DO with Ubuntu Xenial, Ubuntu Bionic & CentOS 7 with the same results.

@mrunalp

This comment has been minimized.

Copy link
Member

mrunalp commented Aug 16, 2018

#1744 would probably fix this for you.

@mrunalp

This comment has been minimized.

Copy link
Member

mrunalp commented Aug 16, 2018

We will port it to release-1.11 branch and cut a new release.

@jnummelin

This comment has been minimized.

Copy link
Author

jnummelin commented Aug 17, 2018

@mrunalp thanks for speedy turn-around time :)

@mrunalp

This comment has been minimized.

Copy link
Member

mrunalp commented Aug 21, 2018

@jnummelin Sure :) Can you verify that 1.11.2 fixes the issue for you? If so, then we can close this.

@jnummelin

This comment has been minimized.

Copy link
Author

jnummelin commented Aug 23, 2018

@mrunalp 1.11.2 confirmed to fix the reboot issue.

@jnummelin jnummelin closed this Aug 23, 2018

schu added a commit to schu/kubedee that referenced this issue Aug 25, 2018

schu added a commit to kinvolk/kubernetes-the-hard-way-vagrant that referenced this issue Aug 26, 2018

Update crio v1.11.1 -> v1.11.2
 v1.11.2 includes the following fix:

 kubernetes-sigs/cri-o#1742

schu added a commit to kinvolk/kubernetes-the-hard-way-vagrant that referenced this issue Aug 26, 2018

Update crio v1.11.1 -> v1.11.2
v1.11.2 includes the following fix:

kubernetes-sigs/cri-o#1742
@jnummelin

This comment has been minimized.

Copy link
Author

jnummelin commented Aug 30, 2018

@mrunalp I was maybe too hasty to close this. A fresh 1.11.2 survives reboots, but upgrade will not. I mean I've got 1.11.1 running with some pods, I updated it to 1.11.2 and rebooted. Pods cannot be started.

I see kubelet is able to create the pod, but no containers can get running. This is what I see:

root@pharos-worker-1:~# crictl -r unix:///var/run/crio/crio.sock pods
POD ID              CREATED             STATE               NAME                                NAMESPACE           ATTEMPT
61872bd141228       4 minutes ago       Ready               pharos-proxy-pharos-worker-1        kube-system         0
d6daf3adc5775       About an hour ago   NotReady            pharos-telemetry-1535612400-zqwlz   kube-system         0
2f6be646aa4fb       2 hours ago         NotReady            pharos-telemetry-1535608800-424bw   kube-system         1
c888f4e0ae86c       2 hours ago         NotReady            pharos-telemetry-1535608800-424bw   kube-system         0
a2637cbaf14d5       4 hours ago         NotReady            pharos-telemetry-1535601600-p6x24   kube-system         0

root@pharos-worker-1:~# crictl -r unix:///var/run/crio/crio.sock ps -a
CONTAINER ID        IMAGE                                                                                                            CREATED             STATE               NAME                ATTEMPT
84eb67dd4a869       quay.io/kontena/pharos-telemetry-amd64@sha256:b95612002da1cb1519c557db5d0ade987c28897d975d623cfda3adad53e5324c   About an hour ago   Exited              agent               0

root@pharos-worker-1:~# crictl -r unix:///var/run/crio/crio.sock logs 61872bd141228
E0830 08:07:01.960325    3521 remote_runtime.go:278] ContainerStatus "61872bd141228" from runtime service failed: rpc error: code = Unknown desc = specified container not found: 61872bd141228dba313a2faf3ffb8a984356dd5a783f6b24259de36ebd29aa3a
FATA[0000] rpc error: code = Unknown desc = specified container not found: 61872bd141228dba313a2faf3ffb8a984356dd5a783f6b24259de36ebd29aa3a 

Here's what I see on crio logs:

-- Logs begin at Thu 2018-08-30 07:59:56 UTC, end at Thu 2018-08-30 08:07:39 UTC. --
Aug 30 07:59:59 pharos-worker-1 systemd[1]: Starting Open Container Initiative Daemon...
Aug 30 07:59:59 pharos-worker-1 sysctl[1458]: net.ipv4.ip_forward = 1
Aug 30 08:00:33 pharos-worker-1 systemd[1]: Started Open Container Initiative Daemon.
Aug 30 08:00:33 pharos-worker-1 crio[1513]: weave-cni: error removing interface "eth0": no such file or directory
Aug 30 08:00:33 pharos-worker-1 crio[1513]: weave-cni: unable to release IP address: Delete http://127.0.0.1:6784/ip/c888f4e0ae86ce6c53266709c491f1f6d72f7b9c3e3ed7ed02b3ff1ddfbe0424: dial tcp 127.0.0.1:6784: c
Aug 30 08:00:33 pharos-worker-1 crio[1513]: container "c888f4e0ae86ce6c53266709c491f1f6d72f7b9c3e3ed7ed02b3ff1ddfbe0424" does not exist
Aug 30 08:00:33 pharos-worker-1 crio[1513]: weave-cni: error removing interface "eth0": no such file or directory
Aug 30 08:00:33 pharos-worker-1 crio[1513]: weave-cni: unable to release IP address: Delete http://127.0.0.1:6784/ip/2f6be646aa4fbaa80b098268e2c632c62ca362859540b22711a6dfe02343b75a: dial tcp 127.0.0.1:6784: c
Aug 30 08:00:33 pharos-worker-1 crio[1513]: container "2f6be646aa4fbaa80b098268e2c632c62ca362859540b22711a6dfe02343b75a" does not exist
Aug 30 08:00:33 pharos-worker-1 crio[1513]: weave-cni: error removing interface "eth0": no such file or directory
Aug 30 08:00:33 pharos-worker-1 crio[1513]: weave-cni: unable to release IP address: Delete http://127.0.0.1:6784/ip/a2637cbaf14d53dddd8bac488a736398a2113f061a034fd80906e9271ed7e746: dial tcp 127.0.0.1:6784: c
Aug 30 08:00:33 pharos-worker-1 crio[1513]: container "a2637cbaf14d53dddd8bac488a736398a2113f061a034fd80906e9271ed7e746" does not exist
Aug 30 08:00:33 pharos-worker-1 crio[1513]: weave-cni: error removing interface "eth0": no such file or directory
Aug 30 08:00:33 pharos-worker-1 crio[1513]: weave-cni: unable to release IP address: Delete http://127.0.0.1:6784/ip/7c7ccdeda98f847ba40f360744b764f130f5bdf920fe63b8b97b9897947b8e40: dial tcp 127.0.0.1:6784: c
Aug 30 08:00:33 pharos-worker-1 crio[1513]: weave-cni: error removing interface "eth0": no such file or directory
Aug 30 08:00:33 pharos-worker-1 crio[1513]: weave-cni: unable to release IP address: Delete http://127.0.0.1:6784/ip/d383151f0e50388b54002ef5d72d959feb6cbfc2c1d927b70b5d95c7798e106d: dial tcp 127.0.0.1:6784: c
Aug 30 08:01:33 pharos-worker-1 crio[1513]: weave-cni: error removing interface "eth0": no such file or directory
Aug 30 08:01:33 pharos-worker-1 crio[1513]: weave-cni: unable to release IP address: Delete http://127.0.0.1:6784/ip/a2637cbaf14d53dddd8bac488a736398a2113f061a034fd80906e9271ed7e746: dial tcp 127.0.0.1:6784: c
Aug 30 08:01:33 pharos-worker-1 crio[1513]: container "a2637cbaf14d53dddd8bac488a736398a2113f061a034fd80906e9271ed7e746" does not exist
Aug 30 08:01:33 pharos-worker-1 crio[1513]: weave-cni: error removing interface "eth0": no such file or directory
Aug 30 08:01:33 pharos-worker-1 crio[1513]: weave-cni: unable to release IP address: Delete http://127.0.0.1:6784/ip/c888f4e0ae86ce6c53266709c491f1f6d72f7b9c3e3ed7ed02b3ff1ddfbe0424: dial tcp 127.0.0.1:6784: c
Aug 30 08:01:33 pharos-worker-1 crio[1513]: container "c888f4e0ae86ce6c53266709c491f1f6d72f7b9c3e3ed7ed02b3ff1ddfbe0424" does not exist

That repeats a lot.

Kubelet logs shows pretty much the same issue as before:

Aug 30 08:08:48 pharos-worker-1 kubelet[1772]: E0830 08:08:48.128812    1772 remote_runtime.go:187] CreateContainer in sandbox "61872bd141228dba313a2faf3ffb8a984356dd5a783f6b24259de36ebd29aa3a" from runtime service failed: rpc error: code = Unknown desc = that name is already in use
Aug 30 08:08:48 pharos-worker-1 kubelet[1772]: E0830 08:08:48.128892    1772 kuberuntime_manager.go:733] container start failed: CreateContainerError: that name is already in use

So far I haven't been able to figure out a workaround how to safely migrate from 1.11.1 --> 1.11.2. Any pointers highly appreciated.

@jnummelin jnummelin reopened this Aug 30, 2018

@mythi

This comment has been minimized.

Copy link
Contributor

mythi commented Aug 30, 2018

That repeats a lot.

I saw this too. It's actually pretty bad because it fires new /pause containers all the time and eventually your worker is on its knees (see ps ax).

@mythi

This comment has been minimized.

Copy link
Contributor

mythi commented Aug 31, 2018

So far I haven't been able to figure out a workaround how to safely migrate from 1.11.1 --> 1.11.2.

One thing to try is you copy the relevant pieces in /var/run/container under /var/lib/container before reboot.

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Sep 3, 2018

One thing to try is you copy the relevant pieces in /var/run/container under /var/lib/container before reboot.

that's probably meaning that we're not storing to disk something important during an on/off phase of the system 🤔

@mjudeikis

This comment has been minimized.

Copy link

mjudeikis commented Nov 9, 2018

we observed something similar, but this was fresh start of the node. mean none of the containers were running before, and nothing started apart of docker and cri-o processes (OCP 3.11).

Nov 09 10:11:43 master-000002 atomic-openshift-node[6003]: E1109 10:11:43.477794    6003 pod_workers.go:186] Error syncing pod 64da4761f4f96eac4d37d59897c078e8 ("master-api-master-000002_kube-system(64da4761f4f9
6eac4d37d59897c078e8)"), skipping: failed to "CreatePodSandbox" for "master-api-master-000002_kube-system(64da4761f4f96eac4d37d59897c078e8)" with CreatePodSandboxError: "CreatePodSandbox for pod \"master-api-mas
ter-000002_kube-system(64da4761f4f96eac4d37d59897c078e8)\" failed: rpc error: code = Unknown desc = pod sandbox with name \"k8s_master-api-master-000002_kube-system_64da4761f4f96eac4d37d59897c078e8_0\" already e
xists"                                                                                                                                                                                                             
Nov 09 10:11:43 master-000002 atomic-openshift-node[6003]: I1109 10:11:43.477820    6003 server.go:470] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"master-api-master-000002", UID:"64da476
1f4f96eac4d37d59897c078e8", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed create pod sandbox: rpc error: code = Unknown desc = pod sandbox with name
 "k8s_master-api-master-000002_kube-system_64da4761f4f96eac4d37d59897c078e8_0" already exists 

Somehow it got this on the api server static pod statup :/

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Nov 9, 2018

@mjudeikis looks unlikely that if the node was fresh something hasn't happened before. Do you have any additional logs?

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Nov 9, 2018

@jnummelin do you still have issues with upgrading from 1.11.1 to 1.11.2? Maybe this commit which went into 1.11.2 should help 2b3add5

@mjudeikis are you using CRI-O >= 1.11.2?

@jnummelin

This comment has been minimized.

Copy link
Author

jnummelin commented Nov 9, 2018

@runcom I think I did try it succesfully, cannot remember for sure. :) We're running and shipping with Pharos 1.11.6 already

@cyphar

This comment has been minimized.

Copy link
Contributor

cyphar commented Nov 28, 2018

We have seen this quite a lot recently with the 1.9.x -> 1.10.x upgrade. After looking at this for a while it definitely looks like there's something wrong with the on-disk state and some aspect of cri-o really doesn't like it. I couldn't find any logs that explained why cri-o was continually trying to start new pods but not cleaning up after the failure.

From what I've seen, the problem manifests with DaemonSets (if you have pods defined in a manifest and started on-boot they seem to start fine) but that might just be because our upgrade testing does a kubectl drain (which won't kill DaemonSets).

It should be noted that the way the Ansible folks worked around this is to nuke /var/lib/containers from orbit.

@pohly

This comment has been minimized.

Copy link

pohly commented Nov 28, 2018

I'm also seeing this with cri-o 1.11.6.

In my case, I am running a cluster on QEMU virtual machines and, while everything is running, kill the QEMU processes with SIGKILL. That's the same as a sudden powerloss of the machine. My expectation is that a cluster should survive that. Is that perhaps too optimistic?

@cyphar

This comment has been minimized.

Copy link
Contributor

cyphar commented Dec 1, 2018

My expectation is that a cluster should survive that. Is that perhaps too optimistic?

I would argue that it should definitely survive this. One of the primary goals of cri-o was to have a container management process that could consistently survive death -- something that Docker will always struggle with.

@pohly

This comment has been minimized.

Copy link

pohly commented Dec 1, 2018

avalluri pushed a commit to intel/oim that referenced this issue Jan 30, 2019

test: make container runtime configurable, Docker as default
For testing purposes it is useful to switch back and forth between
Docker and cri-o without having to revert commits.

Docker is now again the default for testing because it works better
handles killing VMs (cri-o not crash-resistant, see
kubernetes-sigs/cri-o#1742 (comment))
and because only Docker gives pmem-csi a read/write
/sys (intel/pmem-CSI#112).

pmem-csi uses the same test setup and therefore it makes sense to use
the same defaults also in OIM.
@jasonbrooks

This comment has been minimized.

Copy link
Contributor

jasonbrooks commented Mar 2, 2019

I'm seeing this with cri-o-1.12.0-12.dev.gitc4f232a.fc29.x86_64 -- it's got to be able to survive an unexpected reboot, I'd say.

update: I upgraded to cri-o-1.13.0-1.gite8a2525.module_f29+3066+eba77a73.x86_64 and now my single-node cluster is coming back up as expected following a reboot...

@mrunalp

This comment has been minimized.

Copy link
Member

mrunalp commented Mar 5, 2019

We fixed this in all the branches recently.

@mrunalp

This comment has been minimized.

Copy link
Member

mrunalp commented Mar 5, 2019

I will cut new releases for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.