Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet tries to get ContainerStatus of non-existent containers when initializing a cluster #124407

Open
SoroushTaheri opened this issue Apr 19, 2024 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@SoroushTaheri
Copy link

SoroushTaheri commented Apr 19, 2024

What happened?

When initializing a cluster using kubeadm init --pod-network-cidr 10.112.0.0/12 --service-cidr 10.16.0.0/12 --apiserver-advertise-address 172.X.X.X --v=5, during the wait-control-plane phase kubelet is launched and expected to launch essential pods for the control plane. However, kubeadm times out waiting for kubelet to be healthy:

[kubelet-start] Starting the kubelet
I0419 13:28:21.800518   83681 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to be healthy
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.

Unfortunately, an error has occurred:
        timed out waiting for the condition

This error is likely caused by:
        - The kubelet is not running
        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
        - 'systemctl status kubelet'
        - 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
        - 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
        Once you have found the failing container, you can inspect its logs with:
        - 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock logs CONTAINERID'
couldn't initialize a Kubernetes cluster
...

Taking a look at kubelet journal logs: (Provided logs are for after the node is registered)

Apr 19 13:28:28 avije-master kubelet[83794]: I0419 13:28:28.049050   83794 kubelet_node_status.go:76] "Successfully registered node" node="avijeh-master"
Apr 19 13:28:28 avije-master kubelet[83794]: I0419 13:28:28.267975   83794 apiserver.go:52] "Watching apiserver" 
Apr 19 13:28:28 avije-master kubelet[83794]: I0419 13:28:28.290394   83794 desired_state_of_world_populator.go:159] "Finished populating initial desired state of world"
Apr 19 13:28:28 avije-master kubelet[83794]: E0419 13:28:28.400654   83794 kubelet.go:1921] "Failed creating a mirror pod for" err="pods \"kube-apiserver-avijeh-master\" is forbidden: no PriorityClass with name system-node-critical was found" pod="kube-system/kube-apiserver-avijeh-master"
Apr 19 13:28:28 avije-master kubelet[83794]: E0419 13:28:28.527008   83794 kubelet.go:1921] "Failed creating a mirror pod for" err="pods \"kube-controller-manager-avijeh-master\" is forbidden: no PriorityClass with name system-node-critical was found" pod="kube-system/kube-controller-manager-avijeh-master"
Apr 19 13:28:28 avije-master kubelet[83794]: E0419 13:28:28.551525   83794 kubelet.go:1921] "Failed creating a mirror pod for" err="pods \"kube-scheduler-avijeh-master\" is forbidden: no PriorityClass with name system-node-critical was found" pod="kube-system/kube-scheduler-avijeh-master"
Apr 19 13:28:36 avije-master kubelet[83794]: I0419 13:28:36.666828   83794 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="kube-system/etcd-avijeh-master" podStartSLOduration=2.6666824719999997 podStartE2EDuration="2.666682472s" podCreationTimestamp="2024-04-19 13:28:34 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-04-19 13:28:34.456833141 +0000 UTC m=+12.642373986" watchObservedRunningTime="2024-04-19 13:28:36.666682472 +0000 UTC m=+14.852223314"
Apr 19 13:28:38 avije-master kubelet[83794]: I0419 13:28:38.545904   83794 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="kube-system/kube-apiserver-avijeh-master" podStartSLOduration=2.545820659 podStartE2EDuration="2.545820659s" podCreationTimestamp="2024-04-19 13:28:36 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-04-19 13:28:36.694198023 +0000 UTC m=+14.879738885" watchObservedRunningTime="2024-04-19 13:28:38.545820659 +0000 UTC m=+16.731361485"
Apr 19 13:28:43 avije-master kubelet[83794]: I0419 13:28:43.359153   83794 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="kube-system/kube-controller-manager-avijeh-master" podStartSLOduration=5.358999714 podStartE2EDuration="5.358999714s" podCreationTimestamp="2024-04-19 13:28:38 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-04-19 13:28:43.358794059 +0000 UTC m=+21.544334918" watchObservedRunningTime="2024-04-19 13:28:43.358999714 +0000 UTC m=+21.544540543"
Apr 19 13:29:23 avije-master kubelet[83794]: E0419 13:29:23.304870   83794 remote_runtime.go:432] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"95685c1b9abef2e5cb5d7fc82874b1f5879244d16f12d4773ddc63b7d9e57dc2\": not found" containerID="95685c1b9abef2e5cb5d7fc82874b1f5879244d16f12d4773ddc63b7d9e57dc2"
Apr 19 13:29:23 avije-master kubelet[83794]: I0419 13:29:23.304969   83794 kuberuntime_gc.go:360] "Error getting ContainerStatus for containerID" containerID="95685c1b9abef2e5cb5d7fc82874b1f5879244d16f12d4773ddc63b7d9e57dc2" err="rpc error: code = NotFound desc = an error occurred when try to find container \"95685c1b9abef2e5cb5d7fc82874b1f5879244d16f12d4773ddc63b7d9e57dc2\": not found"
Apr 19 13:29:23 avije-master kubelet[83794]: E0419 13:29:23.306097   83794 remote_runtime.go:432] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"eb5090a7b6984cea307e3b89205512fc4df427d459ba77d3f4ad3c9609f710c5\": not found" containerID="eb5090a7b6984cea307e3b89205512fc4df427d459ba77d3f4ad3c9609f710c5"
Apr 19 13:29:23 avije-master kubelet[83794]: I0419 13:29:23.306185   83794 kuberuntime_gc.go:360] "Error getting ContainerStatus for containerID" containerID="eb5090a7b6984cea307e3b89205512fc4df427d459ba77d3f4ad3c9609f710c5" err="rpc error: code = NotFound desc = an error occurred when try to find container \"eb5090a7b6984cea307e3b89205512fc4df427d459ba77d3f4ad3c9609f710c5\": not found"
Apr 19 13:30:23 avije-master kubelet[83794]: E0419 13:30:23.361448   83794 kubelet_node_status.go:456] "Node not becoming ready in time after startup"
Apr 19 13:30:23 avije-master kubelet[83794]: E0419 13:30:23.420288   83794 kubelet.go:2892] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Apr 19 13:30:28 avije-master kubelet[83794]: E0419 13:30:28.422517   83794 kubelet.go:2892] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

At this point the required containers have been started according to containerd:

$ crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a
CONTAINER           IMAGE               CREATED              STATE               NAME                      ATTEMPT             POD ID              POD
cd75ded425c7e       3861cfcd7c04c       About a minute ago   Running             etcd                      82                  5d5eff6d06dac       etcd-master
251638dc7ae54       e444022412717       About a minute ago   Running             kube-scheduler            0                   463a0e09c6d7d       kube-scheduler-master
c232a00ad060c       48ad18e13fb4f       About a minute ago   Running             kube-controller-manager   0                   1da3b8f00b55b       kube-controller-manager-master
de2982e4c6a16       7ae3494460614       About a minute ago   Running             kube-apiserver            1                   42fbbede44751       kube-apiserver-master

However, the conatiners whose ContainerStatus kubelet is trying to get are not among the containers that containerd has created!
containerd logs confirm this as well:

Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.205106219Z" level=info msg="CreateContainer within sandbox \"5d5eff6d06dacfd8b07d5f441cf5e76697a896f8d9f111ff07e716ffe1179c5d\" for container &ContainerMetadata{Name:etcd,Attempt:82,}"
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.242328418Z" level=info msg="CreateContainer within sandbox \"5d5eff6d06dacfd8b07d5f441cf5e76697a896f8d9f111ff07e716ffe1179c5d\" for &ContainerMetadata{Name:etcd,Attempt:82,} returns container id \"cd75ded425c7e18e40958266e8e0abb551a8a9864f2d4c57b3a4547c26255ef1\""
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.243302417Z" level=info msg="StartContainer for \"cd75ded425c7e18e40958266e8e0abb551a8a9864f2d4c57b3a4547c26255ef1\""
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.395299613Z" level=info msg="StartContainer for \"de2982e4c6a161b4328e1f8545b51a9e4ee8b194a88bc8854beb4d1f21ff1ff5\" returns successfully"
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.404490886Z" level=info msg="StartContainer for \"c232a00ad060cc4ba8837a51b0a3e70bdbbcc7e8066caaba01e1943204666e47\" returns successfully"
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.459529956Z" level=info msg="StartContainer for \"251638dc7ae54cf85a671493cfb9f28eeb724c86a44480e1da8032295e8547fc\" returns successfully"
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.522611185Z" level=info msg="StartContainer for \"cd75ded425c7e18e40958266e8e0abb551a8a9864f2d4c57b3a4547c26255ef1\" returns successfully"
Apr 19 13:29:23 avije-master containerd[80846]: time="2024-04-19T13:29:23.304289064Z" level=error msg="ContainerStatus for \"95685c1b9abef2e5cb5d7fc82874b1f5879244d16f12d4773ddc63b7d9e57dc2\" failed" error="rpc error: code = NotFound desc = an error occurred when try to find container \"95685c1b9abef2e5cb5d7fc82874b1f5879244d16f12d4773ddc63b7d9e57dc2\": not found"
Apr 19 13:29:23 avije-master containerd[80846]: time="2024-04-19T13:29:23.305689987Z" level=error msg="ContainerStatus for \"eb5090a7b6984cea307e3b89205512fc4df427d459ba77d3f4ad3c9609f710c5\" failed" error="rpc error: code = NotFound desc = an error occurred when try to find container \"eb5090a7b6984cea307e3b89205512fc4df427d459ba77d3f4ad3c9609f710c5\": not found"

This behavior results in failure when initializing the cluster.

What did you expect to happen?

For kubelet to verify the readiness of created containers properly and for kubeadm to verify kubelet's health and carry on with the rest of the initialization process.

How can we reproduce it (as minimally and precisely as possible)?

Run kubeadm init with standard appropriate flags and wait for it to reach the wait-control-plane phase.
Follow kubelet & containerd logs:

$ journalctl -xeu kubelet -f
...
$ journalctl -xeu containerd -f
...

Also keep track of containerd containers.

$ crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a
...

Anything else we need to know?

In case of concerns:

  • swap is off.
  • SystemdCgroup is set to true in /etc/containerd/config.toml.
  • SELinux is turned off using setenforce 0.
  • runc version: 1.1.12

Kubernetes version

$ kubectl version
Client Version: v1.29.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
The connection to the server localhost:8080 was refused - did you specify the right host or port?

Cloud provider

Self-hosted

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux master 5.15.0-102-generic #112-Ubuntu SMP Tue Mar 5 16:50:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

containerd
$ containerd -v
containerd github.com/containerd/containerd v1.7.15 926c9586fe4a6236699318391cd44976a98e31f1

Related plugins (CNI, CSI, ...) and versions (if applicable)

No CSI/CNI is installed.
@SoroushTaheri SoroushTaheri added the kind/bug Categorizes issue or PR as related to a bug. label Apr 19, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 19, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@SoroushTaheri
Copy link
Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 19, 2024
@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs Apr 24, 2024
@AnishShah
Copy link
Contributor

/sig cluster-lifecycle
/remove-sig node

Someone from sig-cluster-lifecycle can triage this first

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 24, 2024
@AnishShah AnishShah removed this from Triage in SIG Node Bugs May 1, 2024
@neolit123
Copy link
Member

neolit123 commented May 3, 2024

However, the conatiners whose ContainerStatus kubelet is trying get are not among the containers that containerd has created!
containerd logs confirm this as well:

seems like something is not right in the kubelet / containerd.
at that point kubeadm just waits for the kubelet to report 200 on it's /heatlhz endpoint. if it fails to report 200, kubeadm will exit with an error.

/remove-sig cluster-lifecycle
/sig node
/kind support

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/support Categorizes issue or PR as a support question. and removed sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels May 3, 2024
@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs May 22, 2024
@haircommander haircommander moved this from Triage to Needs Information in SIG Node Bugs May 29, 2024
@earimon
Copy link

earimon commented Jun 4, 2024

I am using 1.26.1 with crio 1.26.4, same behavior, seems its not depending on the k8s versio, but there is something which is causing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
SIG Node Bugs
Needs Information
Development

No branches or pull requests

5 participants