Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet goes to a cyclic restart loop when inconsistent container list received from runtimeservice #21

Closed
mikkosest opened this issue Dec 22, 2020 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@mikkosest
Copy link

mikkosest commented Dec 22, 2020

Kubelet 1.19.3
When node joining is started with kubeadm, PodSandbox is created for Multus, Podsandbox dies for unknown reason. Problem is that this podsandbox container removal is skipped because container ID is not found in pods list (ContainerStatus[]) in Kubelet.

Later when container manager make queries runtimeService.ListContainers(nil) and runtimeService.ListPodsandbox(nil) and loop containers, one of the containers has reference to the died PodSandbox which is not anymore in Podsandbox list returned for runtimeService. This leads to Kubelet fatal crash. Because there is no working logic to cleanup non-existing Podsandbox reference from container returned inruntimeService.ListContainers(nil) kubelet start to crash in loop.
kubelet[5992]: I1217 11:30:42.639790 5992 kubelet.go:1898] SyncLoop (PLEG): ignore irrelevant event: &pleg.PodLifecycleEvent{ID:"68224015-de33-4879-a229-b8eee8538b89", Type:"ContainerDied", Data:"894f35dca3eda57ad ef28b69acd0607efdeb34e8814e87e196bc163305576028"} 2020-12-17T09:30:42.640070+00:00 base-image-2 kubelet[5992]: W1217 11:30:42.639799 5992 pod_container_deletor.go:79] Container " 894f35dca3eda57adef28b69acd0607efdeb34e8814e87e196bc163305576028" not found in pod's containers 2020-12-17T09:30:43.234857+00:00 base-image-2 kubelet[5992]: I1217 11:30:43.232179 5992 generic.go:155] GenericPLEG: 68224015-de 33-4879-a229-b8eee8538b89/894f35dca3eda57adef28b69acd0607efdeb34e8814e87e196bc163305576028: exited -> non-existent kubelet.go:1325] Failed to start ContainerManager failed to build map of initial containers from runtime: no PodsandBox found with Id '894f35dca3eda57adef28b69acd0607efdeb34e8814e87e196bc163305576028'

Workaround to add runtimeService.RemoveContainer call for this podsandbox container in container manager

func buildContainerMapFromRuntime(runtimeService internalapi.RuntimeService) (containermap.ContainerMap, error) {
	podSandboxMap := make(map[string]string)
	podSandboxList, _ := runtimeService.ListPodSandbox(nil)
	for _, p := range podSandboxList {
		podSandboxMap[p.Id] = p.Metadata.Uid
	}

	containerMap := containermap.NewContainerMap()
	containerList, _ := runtimeService.ListContainers(nil)
	for _, c := range containerList {
		if _, exists := podSandboxMap[c.PodSandboxId]; !exists {
Line added------------> runtimeService.RemoveContainer(c.Id)
			return nil, fmt.Errorf("no PodsandBox found with Id '%s'", c.PodSandboxId)
		}
		containerMap.Add(podSandboxMap[c.PodSandboxId], c.Metadata.Name, c.Id)
	}

	return containerMap, nil
} 
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 23, 2021
@tmmorin
Copy link

tmmorin commented Apr 12, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2021
@ansilh
Copy link

ansilh commented Jun 9, 2021

Same issue observed on v1.20.4.

@amitsingla
Copy link

We are facing the similar issue on v1.18.19 . Kubelet on some nodes goes in a restart loop with same error. I am going to try ^ solution and I hope it works for us as well . This issue is blocking the whole release.

@ialidzhikov
Copy link
Contributor

/kind bug
/sig node

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 22, 2021
@artur9010
Copy link

Find broken container (docker ps -a --filter "label=io.kubernetes.sandbox.id=894f35dca3eda57adef28b69acd0607efdeb34e8814e87e196bc163305576028" <--- id from error message) and remove it using docker rm ID, then restart kubelet.

@YanzhaoLi
Copy link
Member

Hit same issue on kubernetes 1.21.2. I wonder

  • Is it expected that PodSandbox died for unknown reason?
  • Why is kubelet not aware of died PodSandbox ?

@hjkatz
Copy link

hjkatz commented Sep 21, 2021

We are also experiencing this issue on 1.19.10. The resolution given by @artur9010 works for us.

@amitsingla
Copy link

amitsingla commented Sep 21, 2021

We created a customized image for kubelet as mentioned by @mikkosest and we were able to upgrade our kube clusters to 1.18.19 and did not faced this issue from last 3 months.

@fungusakafungus
Copy link

Experiencing this issue with kubernetes 1.21.5 and containerd, the exited container without a corresponding pod is always a calico-node container

@TobiasDeBruijn
Copy link

Experiencing this issue with kubernetes 1.21.5 and containerd, the exited container without a corresponding pod is always a calico-node container

Experiencing the same with Cillium

@ADustyOldMuffin
Copy link

ADustyOldMuffin commented Nov 12, 2021

We got the same with a lot of containers. In our instance we had a node that had been moved from one cluster to another cluster and then back. I had to essentially stop and remove every container running that was previously booted on the node until Kubelet finally booted.

V - 1.21.4, and CNI is Calico.

@ehashman
Copy link
Member

Hello,

This is a mirror repo and is not monitored: https://github.com/kubernetes/kubelet#where-does-it-come-from

Please file issues against https://github.com/kubernetes/kubernetes/issues

/close

@k8s-ci-robot
Copy link
Contributor

@ehashman: Closing this issue.

In response to this:

Hello,

This is a mirror repo and is not monitored: https://github.com/kubernetes/kubelet#where-does-it-come-from

Please file issues against https://github.com/kubernetes/kubernetes/issues

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
Development

No branches or pull requests