New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sandbox already exists after node reboot #1742
Comments
#1744 would probably fix this for you. |
We will port it to release-1.11 branch and cut a new release. |
@mrunalp thanks for speedy turn-around time :) |
@jnummelin Sure :) Can you verify that 1.11.2 fixes the issue for you? If so, then we can close this. |
@mrunalp 1.11.2 confirmed to fix the reboot issue. |
v1.11.2 includes the following fix: cri-o/cri-o#1742
v1.11.2 includes the following fix: cri-o/cri-o#1742
v1.11.2 includes the following fix: cri-o/cri-o#1742
@mrunalp I was maybe too hasty to close this. A fresh 1.11.2 survives reboots, but upgrade will not. I mean I've got 1.11.1 running with some pods, I updated it to 1.11.2 and rebooted. Pods cannot be started. I see kubelet is able to create the pod, but no containers can get running. This is what I see:
Here's what I see on crio logs:
That repeats a lot. Kubelet logs shows pretty much the same issue as before:
So far I haven't been able to figure out a workaround how to safely migrate from 1.11.1 --> 1.11.2. Any pointers highly appreciated. |
I saw this too. It's actually pretty bad because it fires new /pause containers all the time and eventually your worker is on its knees (see |
One thing to try is you copy the relevant pieces in |
that's probably meaning that we're not storing to disk something important during an on/off phase of the system 🤔 |
we observed something similar, but this was fresh start of the node. mean none of the containers were running before, and nothing started apart of docker and cri-o processes (OCP 3.11).
Somehow it got this on the api server static pod statup :/ |
@mjudeikis looks unlikely that if the node was fresh something hasn't happened before. Do you have any additional logs? |
@jnummelin do you still have issues with upgrading from 1.11.1 to 1.11.2? Maybe this commit which went into 1.11.2 should help 2b3add5 @mjudeikis are you using CRI-O >= 1.11.2? |
@runcom I think I did try it succesfully, cannot remember for sure. :) We're running and shipping with Pharos 1.11.6 already |
We have seen this quite a lot recently with the From what I've seen, the problem manifests with DaemonSets (if you have pods defined in a manifest and started on-boot they seem to start fine) but that might just be because our upgrade testing does a It should be noted that the way the Ansible folks worked around this is to nuke |
I'm also seeing this with cri-o 1.11.6. In my case, I am running a cluster on QEMU virtual machines and, while everything is running, kill the QEMU processes with SIGKILL. That's the same as a sudden powerloss of the machine. My expectation is that a cluster should survive that. Is that perhaps too optimistic? |
I would argue that it should definitely survive this. One of the primary goals of cri-o was to have a container management process that could consistently survive death -- something that Docker will always struggle with. |
Aleksa Sarai <notifications@github.com> writes:
> My expectation is that a cluster should survive that. Is that perhaps too optimistic?
I would argue that it should definitely survive this. One of the
primary goals of cri-o was to have a container management process that
could consistently survive death -- something that Docker will always
struggle with.
Thanks for the confirmation. Then this bug needs to remain open, because
right now it doesn't seem to be working. FWIW, Docker worked fine for me
in the same situation (same OS = Clear Linux, same VM setup, hard poweroff).
|
For testing purposes it is useful to switch back and forth between Docker and cri-o without having to revert commits. Docker is now again the default for testing because it works better handles killing VMs (cri-o not crash-resistant, see cri-o/cri-o#1742 (comment)) and because only Docker gives pmem-csi a read/write /sys (intel/pmem-csi#112). pmem-csi uses the same test setup and therefore it makes sense to use the same defaults also in OIM.
I'm seeing this with cri-o-1.12.0-12.dev.gitc4f232a.fc29.x86_64 -- it's got to be able to survive an unexpected reboot, I'd say. update: I upgraded to cri-o-1.13.0-1.gite8a2525.module_f29+3066+eba77a73.x86_64 and now my single-node cluster is coming back up as expected following a reboot... |
We fixed this in all the branches recently. |
I will cut new releases for it. |
Closing per @mrunalp 's comments above, please reopen if you disagree.. |
Still reproducing on : |
Description
Steps to reproduce the issue:
Describe the results you received:
After reboot no pods can be created. Kubelet always gets an error from crio:
CRI-O logs also show some errors after reboot:
In this case the pod to be created/started was running fine before reboot. It's a static pod in
/etc/kubernetes/manifest/...
We've tried many variations in the shutdown process, but cannot seem to get crio to startup properly after reboot.
The only way to "recover" seems to be with:
Describe the results you expected:
Kubelet & CRI-O to be able to start/create needed pods after reboot.
Additional information you deem important (e.g. issue happens only occasionally):
Based on our testing, reboots seem to work properly with 1.11.0 release
Output of
crio --version
:Additional environment details (AWS, VirtualBox, physical, etc.):
Tested on DO with Ubuntu Xenial, Ubuntu Bionic & CentOS 7 with the same results.
The text was updated successfully, but these errors were encountered: