App container is started before initContainers could complete on node reboot #120247

shivamerla · 2023-08-30T02:30:53Z

What happened?

On node reboot, we see app container is started before initContainers could complete. Essentially both app and initContainers are running in parallel. We are using initContainers to ensure dependent drivers are loaded by other pod before the app container could start. But because app container is starting in parallel, they hit a failure. This was not seen with k8s 1.27 and only happening with 1.28. Also, this happens only on node reboot, not during fresh install of our application (gpu-operator)

root@co-node-2-126:~# kubectl  describe pod nvidia-driver-daemonset-svkhz -n gpu-operator
Name:                 nvidia-driver-daemonset-svkhz
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-driver
Node:                 co-node-8/10.213.6.120
Start Time:           Tue, 29 Aug 2023 14:18:59 +0000
Labels:               app=nvidia-driver-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=7c4c76cdf8
                      helm.sh/chart=gpu-operator-v23.6.0
                      nvidia.com/precompiled=false
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: ef20538089b848b06a67dcc865444b487644a2945cd07a9908becfe16b163d3e
                      cni.projectcalico.org/podIP: 192.168.177.14/32
                      cni.projectcalico.org/podIPs: 192.168.177.14/32
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "k8s-pod-network",
                            "ips": [
                                "192.168.177.14"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "k8s-pod-network",
                            "ips": [
                                "192.168.177.14"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      kubectl.kubernetes.io/default-container: nvidia-driver-ctr
Status:               Running
IP:                   192.168.177.14
IPs:
  IP:           192.168.177.14
Controlled By:  DaemonSet/nvidia-driver-daemonset
Init Containers:
  mofed-validation:
    Container ID:  containerd://ae68718e709138b1737ed01ca82c8489c13990870545c90919dbc08044bbfef9
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:b65eb649188193f39e169af5650acfc7aa3cc32d2328630118702f04fdc4afc1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Running
      Started:      Tue, 29 Aug 2023 19:54:23 +0000
    Ready:          False
    Restart Count:  1
    Environment:
      WITH_WAIT:                true
      COMPONENT:                mofed
      NODE_NAME:                 (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:   void
      GPU_DIRECT_RDMA_ENABLED:  true
    Mounts:
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qdtbk (ro)
  k8s-driver-manager:
    Container ID:  containerd://931cee0081f15ff4a1b5d7db9b6d669ad9304f0975b4c48406c6f9ea3c1b9f54
    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.2
    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:bb845160b32fd12eb3fae3e830d2e6a7780bc7405e0d8c5b816242d48be9daa8
    Port:          <none>
    Host Port:     <none>
    Command:
      driver-manager
    Args:
      uninstall_driver
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 29 Aug 2023 14:26:56 +0000
      Finished:     Tue, 29 Aug 2023 14:27:00 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      NODE_NAME:                    (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:      void
      ENABLE_GPU_POD_EVICTION:     true
      ENABLE_AUTO_DRAIN:           false
      DRAIN_USE_FORCE:             false
      DRAIN_POD_SELECTOR_LABEL:    
      DRAIN_TIMEOUT_SECONDS:       0s
      DRAIN_DELETE_EMPTYDIR_DATA:  false
      OPERATOR_NAMESPACE:          gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /run/nvidia from run-nvidia (rw)
      /sys from host-sys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qdtbk (ro)
Containers:
  nvidia-driver-ctr:
    Container ID:  containerd://bfbfbbe496f5691030d74ec4c3730adbc54e031b6ea47b87811923a8933e2786
    Image:         nvcr.io/nvidia/driver:535.86.10-ubuntu22.04
    Image ID:      nvcr.io/nvidia/driver@sha256:9a4e9b8b87e9f39450d73de07c37d6789700f4ac58fedf1f5d83f6219e5c028d
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      init
    State:          Running
      Started:      Tue, 29 Aug 2023 19:54:24 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 29 Aug 2023 14:27:00 +0000
      Finished:     Tue, 29 Aug 2023 19:53:40 +0000
    Ready:          True
    Restart Count:  1
    Startup:        exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:
      GPU_DIRECT_RDMA_ENABLED:  true
    Mounts:
      /dev/log from dev-log (rw)
      /host-etc/os-release from host-os-release (ro)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /run/nvidia from run-nvidia (rw)
      /run/nvidia-topologyd from run-nvidia-topologyd (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qdtbk (ro)
  nvidia-peermem-ctr:
    Container ID:  containerd://38e913bb4c93285b173c03e6594a57e9b78da2f5b67a16474ecc0b902179d876
    Image:         nvcr.io/nvidia/driver:535.86.10-ubuntu22.04
    Image ID:      nvcr.io/nvidia/driver@sha256:9a4e9b8b87e9f39450d73de07c37d6789700f4ac58fedf1f5d83f6219e5c028d
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      reload_nvidia_peermem
    State:          Running
      Started:      Tue, 29 Aug 2023 19:54:24 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 29 Aug 2023 14:27:00 +0000
      Finished:     Tue, 29 Aug 2023 19:53:40 +0000
    Ready:          True
    Restart Count:  1
    Liveness:       exec [sh -c nvidia-driver probe_nvidia_peermem] delay=30s timeout=10s period=30s #success=1 #failure=1
    Startup:        exec [sh -c nvidia-driver probe_nvidia_peermem] delay=10s timeout=10s period=10s #success=1 #failure=120
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (ro)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia from run-nvidia (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qdtbk (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:  
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  run-nvidia-topologyd:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia-topologyd
    HostPathType:  DirectoryOrCreate
  mlnx-ofed-usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers/usr/src
    HostPathType:  DirectoryOrCreate
  run-mellanox-drivers:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  kube-api-access-qdtbk:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason          Age                From             Message
  ----     ------          ----               ----             -------
  Warning  NodeNotReady    20m                node-controller  Node is not ready
  Normal   SandboxChanged  17m                kubelet          Pod sandbox changed, it will be killed and re-created.
  Normal   AddedInterface  16m                multus           Add eth0 [192.168.177.14/32] from k8s-pod-network
  Normal   Pulled          16m                kubelet          Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0" already present on machine
  Normal   Created         16m                kubelet          Created container mofed-validation
  Normal   Started         16m                kubelet          Started container mofed-validation
  Normal   Pulled          16m                kubelet          Container image "nvcr.io/nvidia/driver:535.86.10-ubuntu22.04" already present on machine
  Normal   Created         16m                kubelet          Created container nvidia-driver-ctr
  Normal   Started         16m                kubelet          Started container nvidia-driver-ctr
  Normal   Pulled          16m                kubelet          Container image "nvcr.io/nvidia/driver:535.86.10-ubuntu22.04" already present on machine
  Normal   Created         16m                kubelet          Created container nvidia-peermem-ctr
  Normal   Started         16m                kubelet          Started container nvidia-peermem-ctr
  Warning  Unhealthy       15m (x4 over 16m)  kubelet          Startup probe failed: DRIVER_ARCH is x86_64

Above, we can see even though init container mofed-validation is still running, main container nvidia-driver-ctr started. Also pod status shows 2/2 main containers ready when initContainer is still running.

root@co-node-2-126:~# kubectl  get pods -n gpu-operator --field-selector=spec.nodeName=co-node-8 
NAME                                               READY   STATUS     RESTARTS      AGE
gpu-feature-discovery-t5njq                        0/1     Init:0/1   0             9m6s
gpu-operator-node-feature-discovery-worker-4tkdj   1/1     Running    1 (11m ago)   5h46m
nvidia-container-toolkit-daemonset-sflgw           1/1     Running    1 (11m ago)   5h38m
nvidia-dcgm-exporter-2mnxt                         0/1     Init:0/1   0             9m6s
nvidia-device-plugin-daemonset-s6pw9               0/1     Init:0/1   0             9m6s
nvidia-driver-daemonset-svkhz                      2/2     Init:0/2   2 (11m ago)   5h46m
nvidia-mig-manager-4mbsv                           1/1     Init:0/1   1 (11m ago)   5h38m
nvidia-operator-validator-ktcxj                    1/1     Running    1 (11m ago)   5h38m

Also, another issue we are seeing is for pod nvidia-operator-validator-ktcxj , which has 4 init containers, only 1 is run (driver-validation) and rest are skipped, while the app container started in parallel again. The start/finish timestamps of these initContainers indicate that clearly below.

root@co-node-2-126:~# kubectl  describe pod nvidia-operator-validator-ktcxj -n gpu-operator
Name:                 nvidia-operator-validator-ktcxj
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      nvidia-operator-validator
Node:                 co-node-8/10.213.6.120
Start Time:           Tue, 29 Aug 2023 14:27:00 +0000
Labels:               app=nvidia-operator-validator
                      app.kubernetes.io/managed-by=gpu-operator
                      app.kubernetes.io/part-of=gpu-operator
                      controller-revision-hash=55899446bb
                      helm.sh/chart=gpu-operator-v23.6.0
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: 02a1f15958fdbf1a97b0b5e6d2c7bd9b595cf721585b4bb10e63363e33a9772b
                      cni.projectcalico.org/podIP: 192.168.177.13/32
                      cni.projectcalico.org/podIPs: 192.168.177.13/32
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "k8s-pod-network",
                            "ips": [
                                "192.168.177.13"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "k8s-pod-network",
                            "ips": [
                                "192.168.177.13"
                            ],
                            "default": true,
                            "dns": {}
                        }]
Status:               Running
IP:                   192.168.177.13
IPs:
  IP:           192.168.177.13
Controlled By:  DaemonSet/nvidia-operator-validator
Init Containers:
  driver-validation:
    Container ID:  containerd://064dc1a47d796959b355435da27924de9f1b41ed42b65046fbe4dc29cad8af5a
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:b65eb649188193f39e169af5650acfc7aa3cc32d2328630118702f04fdc4afc1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 29 Aug 2023 19:54:37 +0000
      Finished:     Tue, 29 Aug 2023 19:55:32 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ljstf (ro)
  toolkit-validation:
    Container ID:  containerd://02c2eca7b6fd1f9f28e0d02d41db122aa267f1cc4ccbebfe1d9293caf3869aa7
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:b65eb649188193f39e169af5650acfc7aa3cc32d2328630118702f04fdc4afc1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 29 Aug 2023 14:28:22 +0000
      Finished:     Tue, 29 Aug 2023 14:28:22 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
      WITH_WAIT:               false
      COMPONENT:               toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ljstf (ro)
  cuda-validation:
    Container ID:  containerd://cc01020476d5cef14ec21b8310f09b4d13acf0f2bd2ee0f59d4db8243a515605
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:b65eb649188193f39e169af5650acfc7aa3cc32d2328630118702f04fdc4afc1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 29 Aug 2023 14:28:23 +0000
      Finished:     Tue, 29 Aug 2023 14:28:28 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ljstf (ro)
  plugin-validation:
    Container ID:  containerd://61aad57b2a07f9a5acd7ed80beba9ea8be384fb45ec2e4ac6bf57107f7048609
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:b65eb649188193f39e169af5650acfc7aa3cc32d2328630118702f04fdc4afc1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 29 Aug 2023 14:28:29 +0000
      Finished:     Tue, 29 Aug 2023 14:28:34 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                false
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ljstf (ro)
Containers:
  nvidia-operator-validator:
    Container ID:  containerd://9cf87ec82ff7a8876864e4dceaada2f023655176c556f1a8a99d4ee86d17c229
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:b65eb649188193f39e169af5650acfc7aa3cc32d2328630118702f04fdc4afc1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    State:          Running
      Started:      Tue, 29 Aug 2023 19:54:37 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 29 Aug 2023 14:28:35 +0000
      Finished:     Tue, 29 Aug 2023 19:53:40 +0000
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ljstf (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  kube-api-access-ljstf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.operator-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:                      <none>

What did you expect to happen?

initContainers should complete in order before app containers could start on node reboot. Also, all initContainers should run rather than skipped before app-container could start.

How can we reproduce it (as minimally and precisely as possible)?

Create a pod with multiple initContainers (which wait for about 2-3 mins) and an app container. On node reboot, app-container is started in parallel to initContainer.

Anything else we need to know?

No response

Kubernetes version

root@co-node-2-126:~# kubectl  version
Client Version: v1.28.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.0

Cloud provider

OS version

root@co-node-8:~# cat /etc/issue
Ubuntu 22.04.2 LTS \n \l

root@co-node-8:~# uname -a
Linux co-node-8 5.15.0-79-generic #86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

containerd

root@co-node-8:~# ctr version
Client:
Version: v1.7.2
Revision: 0cae528dd6cb557f7201036e9f43420650207b58
Go version: go1.20.4

Server:
Version: v1.7.2
Revision: 0cae528dd6cb557f7201036e9f43420650207b58
UUID: f120b710-d067-4968-bb0a-6532b6918467

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

klueska · 2023-08-30T08:04:31Z

/sig node
/triage accepted

NOTE: This occurs without enabling the new SidecarFeatureGate

Glancing through the code that was added for the the sidecar feature, I wouldn't be surprised if something got broken in the normal initContainer flow. Very little of the modified code is guarded by the SidecarFeatureGate. Normally all code changes should be guarded by the feature gate, not just code related to the new API being exposed. In my opinion, the code should be refactored with the proper guardrails in place to avoid going down any new code paths unless the SidecarFeatureGate is enabled.

Here is the PR that introduced these changes:
#116429

/cc @mrunalp @gjkim42

gjkim42 · 2023-08-30T08:36:44Z

Glancing through the code that was added for the the sidecar feature, I wouldn't be surprised if something got broken in the normal initContainer flow. Very little of the modified code is guarded by the SidecarFeatureGate. Normally all code changes should be guarded by the feature gate, not just code related to the new API being exposed. In my opinion, the code should be refactored with the proper guardrails in place to avoid going down any new code paths unless the SidecarFeatureGate is enabled.

#116429 (comment)

There have been some discussions about this and we realized that it is practically difficult to guard all new code paths.
I'll investigate it later today and will find a better way to guard the new code path.

/assign
cc @SergeyKanzhelev

pacoxu · 2023-08-30T08:37:58Z

Does it sound like a regression? It should be backported to v1.28.x.
/kind regression

aojea · 2023-08-30T10:23:30Z

I can not reproduce this behavior at head , the init containers correctly run before the app

apiVersion: v1
kind: Pod
metadata:
  name: myapp
  labels:
    app.kubernetes.io/name: MyApp
spec:
  restartPolicy: Never
  containers:
  - name: myapp
    image: busybox:1.28
    command: ['sh', '-c', 'echo The app is running! && date && exit 0']
  initContainers:
  - name: init
    image: busybox:1.28
    command: ['sh', '-c', "echo starting && date ; sleep 15; echo finishing starting after 15 sec && date"]

$ kubectl logs myapp -c init
starting
Wed Aug 30 10:22:02 UTC 2023
finishing starting after 15 sec
Wed Aug 30 10:22:17 UTC 2023
$ kubectl logs myapp -c myapp
The app is running!
Wed Aug 30 10:22:18 UTC 2023

aojea · 2023-08-30T10:26:34Z

I can not reproduce this behavior at head , the init containers correctly run before the app

Also, this happens only on node reboot, not during fresh install of our application (gpu-operator)

@gjkim42 @shivamerla can we try to find the root cause first?

shivamerla · 2023-08-30T12:11:21Z

@aojea I am hitting this with pod restartPolicy: Always which is by default.

apiVersion: v1
kind: Pod
metadata:
  name: myapp
  labels:
    app.kubernetes.io/name: MyApp
spec:
  containers:
  - name: myapp
    image: busybox:1.28
    command: ['sh', '-c', 'echo The app is running! && date && sleep 600']
  initContainers:
  - name: init
    image: busybox:1.28
    command: ['sh', '-c', "echo starting && date ; sleep 150; echo finishing starting after 150 sec && date"]

On node reboot ( we can see app container started before init could finish)

root@co-node-2-126:~# kubectl logs myapp -c init
starting
Wed Aug 30 12:04:42 UTC 2023
finishing starting after 150 sec
Wed Aug 30 12:07:12 UTC 2023

root@co-node-2-126:~# kubectl logs myapp
Defaulted container "myapp" out of: myapp, init (init)
The app is running!
Wed Aug 30 12:04:43 UTC 2023

root@co-node-2-126:~# kubectl describe pod myapp
Name:             myapp
Namespace:        default
Priority:         0
Service Account:  default
Node:             co-node-8/10.213.6.120
Start Time:       Wed, 30 Aug 2023 11:56:10 +0000
Labels:           app.kubernetes.io/name=MyApp
Annotations:      cni.projectcalico.org/containerID: f874f3103c1d09cbf0cc9039f684caf20b00468946228d9477d835a1634652e8
                  cni.projectcalico.org/podIP: 192.168.177.30/32
                  cni.projectcalico.org/podIPs: 192.168.177.30/32
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "k8s-pod-network",
                        "ips": [
                            "192.168.177.30"
                        ],
                        "default": true,
                        "dns": {}
                    }]
                  k8s.v1.cni.cncf.io/networks-status:
                    [{
                        "name": "k8s-pod-network",
                        "ips": [
                            "192.168.177.30"
                        ],
                        "default": true,
                        "dns": {}
                    }]
Status:           Running
IP:               192.168.177.30
IPs:
  IP:  192.168.177.30
Init Containers:
  init:
    Container ID:  containerd://f62f5e2508df7084290c45f75cf02d9c5f8de15ea95f815e7bf67782eee4c06d
    Image:         busybox:1.28
    Image ID:      docker.io/library/busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      echo starting && date ; sleep 150; echo finishing starting after 150 sec && date
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 30 Aug 2023 12:04:42 +0000
      Finished:     Wed, 30 Aug 2023 12:07:12 +0000
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jg8w7 (ro)
Containers:
  myapp:
    Container ID:  containerd://8ce9fcf8ce7134dc4cfca131101a17e2fb9cc4c2ff576b49a24a5892d23ae46b
    Image:         busybox:1.28
    Image ID:      docker.io/library/busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      echo The app is running! && date && sleep 600
    State:          Running
      Started:      Wed, 30 Aug 2023 12:04:43 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Wed, 30 Aug 2023 11:58:47 +0000
      Finished:     Wed, 30 Aug 2023 12:03:13 +0000
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jg8w7 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-jg8w7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age    From               Message
  ----    ------          ----   ----               -------
  Normal  Scheduled       13m    default-scheduler  Successfully assigned default/myapp to co-node-8
  Normal  AddedInterface  12m    multus             Add eth0 [192.168.177.28/32] from k8s-pod-network
  Normal  Pulled          12m    kubelet            Container image "busybox:1.28" already present on machine
  Normal  Created         12m    kubelet            Created container init
  Normal  Started         12m    kubelet            Started container init
  Normal  Pulled          10m    kubelet            Container image "busybox:1.28" already present on machine
  Normal  Created         10m    kubelet            Created container myapp
  Normal  Started         10m    kubelet            Started container myapp
  Normal  SandboxChanged  5m43s  kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal  AddedInterface  4m33s  multus             Add eth0 [192.168.177.30/32] from k8s-pod-network
  Normal  Pulled          4m31s  kubelet            Container image "busybox:1.28" already present on machine
  Normal  Created         4m31s  kubelet            Created container init
  Normal  Started         4m31s  kubelet            Started container init
  Normal  Pulled          4m30s  kubelet            Container image "busybox:1.28" already present on machine
  Normal  Created         4m30s  kubelet            Created container myapp
  Normal  Started         4m30s  kubelet            Started container myapp

gjkim42 · 2023-08-30T14:09:49Z

I reproduced it with your scenario.

/priority important-soon

liggitt · 2023-08-30T14:24:18Z

cc @bobbypage @smarterclayton for visibility to kubelet correctness

gjkim42 · 2023-08-30T14:25:06Z

I guess this happens when the pod sandbox is changed.

we can reproduce it easily with,

crictl stopp POD_ID

I'll propose a PR containing the e2e-scenario and the fix for it.

gjkim42 · 2023-08-30T14:29:21Z

Also, another issue we are seeing is for pod nvidia-operator-validator-ktcxj , which has 4 init containers, only 1 is run (driver-validation) and rest are skipped, while the app container started in parallel again. The start/finish timestamps of these initContainers indicate that clearly below.

I'll cover this as well.

liggitt · 2023-08-30T14:29:33Z

once we have a solid reproducer, check if it reproduces with 1.27 or not so we know if this was a regression in 1.28 or just a long-standing issue

gjkim42 · 2023-08-30T14:30:19Z

confirmed that this is NOT reproduced in 1.27. It is a regression.

gjkim42 · 2023-08-30T15:14:59Z

kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go

Lines 886 to 890 in 3cf3702

    
           // If any of the main containers have status and are Running, then all init containers must 
        
           // have been executed at some point in the past.  However, they could have been removed 
        
           // from the container runtime now, and if we proceed, it would appear as if they 
        
           // never ran and will re-execute improperly except for the restartable init containers. 
        
           podHasInitialized := hasAnyRegularContainerCreated(pod, podStatus)

This happens because the kubelet uses the old container statuses to check if the pod has been initialized before.

Before 1.28, the kubelet considers the pod as initialized if there is any regular container running.
However, with 1.28, the kubelet considers the pod as initialized if there is any regular container running OR EXITED.

I think the latter one is theoretically right, but currently, we cannot know if the container statuses are from the current pod sandbox or the old one.

The fix would be,

check if the container statuses are outdated (not sure how to do this yet...)
(workaround) revert pod initialization check: there is any regular container running or exited => there is any regular container running

liggitt · 2023-08-30T15:17:36Z

that looks to have changed as part of sidecar work - there were a lot of changes to pkg/kubelet/kuberuntime/kuberuntime_container.go in https://github.com/kubernetes/kubernetes/pull/116429/files#diff-d86ffc24d8e4c9deafd3e39e8f4ee3546ee46a31dcea8ea6526443b3f4bffa28 that are not obviously feature gated... were all of those supposed to be no-op refactors?

liggitt · 2023-08-30T15:18:24Z

cc @thockin for visibility to kubelet regressions in 1.28 related to sidecar refactors

liggitt · 2023-08-30T15:43:17Z

The fix would be,

check if the container statuses are outdated (not sure how to do this yet...)

(workaround) revert pod initialization check: there is any regular container running or exited => there is any regular container running

A starting point that would build confidence would be to restore the kubelet logic here to the 1.27 state before the sidecar merge (reviewing diff(release-1.27, Fix PR) should make it obvious that 1.27 behavior has been restored).

The refactored / new methods could be kept in a different file and only called when the feature gate was enabled. That would restore 1.28 to match 1.27 behavior by default as quickly as possible.

bobbypage · 2023-08-30T16:59:28Z

/cc

ffromani · 2023-09-05T12:39:35Z

/cc

shivamerla added the kind/bug Categorizes issue or PR as related to a bug. label Aug 30, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 30, 2023

k8s-ci-robot assigned gjkim42 Aug 30, 2023

k8s-ci-robot added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Aug 30, 2023

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 30, 2023

gjkim42 mentioned this issue Aug 30, 2023

[DO NOT MERGE]: Add an e2e test for the pod sandbox changed scenario #120267

Merged

liggitt added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 30, 2023

liggitt assigned SergeyKanzhelev Aug 30, 2023

gjkim42 mentioned this issue Aug 30, 2023

Restart containers in right order with SidecarContainers enabled #120269

Merged

SergeyKanzhelev added this to Triage in SIG Node Bugs Aug 30, 2023

SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs Aug 30, 2023

gjkim42 mentioned this issue Aug 31, 2023

Feature-gate SidecarContainers code in pkg/kubelet/kuberuntime #120281

Merged

k8s-ci-robot closed this as completed in #120281 Sep 6, 2023

SIG Node Bugs automation moved this from Triaged to Done Sep 6, 2023

gjkim42 mentioned this issue Sep 6, 2023

Automated cherry pick of #120281: Feature-gate SidecarContainers code in pkg/kubelet/kuberuntime #120440

Merged

aditighag mentioned this issue Sep 8, 2023

Coredns fails connecting to kube-api via kubernetes service cilium/cilium#27900

Open

2 tasks

brandond mentioned this issue Sep 8, 2023

After restart of pods, one of the Canal pod is getting a Init:CreateContainerConfigError rancher/rke2#4742

Closed

cguertin14 mentioned this issue Sep 9, 2023

Minor: k3s update from v1.27.3+k3s1 to v1.28.1+k3s1 cguertin14/k3s-ansible-ha#25

Merged

youngnick mentioned this issue Sep 11, 2023

Cilium v1.14.2 with Kubernetes v1.28 is unstable cilium/cilium#27982

Closed

2 tasks

Silvest89 mentioned this issue Sep 11, 2023

Node pods cannot access kube dns after a reboot kube-hetzner/terraform-hcloud-kube-hetzner#974

Closed

ThomasK33 mentioned this issue Sep 15, 2023

Bump image versions to latest stable releases for all distros loft-sh/utils#20

Merged

Silvest89 mentioned this issue Oct 12, 2023

[Bug]: longhorn manager crash loop kube-hetzner/terraform-hcloud-kube-hetzner#1022

Closed

renovate bot mentioned this issue Mar 4, 2024

feat(github-release): update k3s-io/k3s to v1.30.0+k3s1 abasitt/home-ops#81

Open

1 task

gjkim42 mentioned this issue May 7, 2024

Kubernetes Skips Init Containers Beyond First During Some Pod Restarts #124002

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

App container is started before initContainers could complete on node reboot #120247

App container is started before initContainers could complete on node reboot #120247

shivamerla commented Aug 30, 2023

klueska commented Aug 30, 2023

gjkim42 commented Aug 30, 2023

pacoxu commented Aug 30, 2023

aojea commented Aug 30, 2023 •

edited

aojea commented Aug 30, 2023

shivamerla commented Aug 30, 2023 •

edited

gjkim42 commented Aug 30, 2023

liggitt commented Aug 30, 2023

gjkim42 commented Aug 30, 2023 •

edited

gjkim42 commented Aug 30, 2023

liggitt commented Aug 30, 2023

gjkim42 commented Aug 30, 2023 •

edited

gjkim42 commented Aug 30, 2023

liggitt commented Aug 30, 2023 •

edited

liggitt commented Aug 30, 2023

liggitt commented Aug 30, 2023 •

edited

bobbypage commented Aug 30, 2023

ffromani commented Sep 5, 2023

App container is started before initContainers could complete on node reboot #120247

App container is started before initContainers could complete on node reboot #120247

Comments

shivamerla commented Aug 30, 2023

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

klueska commented Aug 30, 2023

gjkim42 commented Aug 30, 2023

pacoxu commented Aug 30, 2023

aojea commented Aug 30, 2023 • edited

aojea commented Aug 30, 2023

shivamerla commented Aug 30, 2023 • edited

gjkim42 commented Aug 30, 2023

liggitt commented Aug 30, 2023

gjkim42 commented Aug 30, 2023 • edited

gjkim42 commented Aug 30, 2023

liggitt commented Aug 30, 2023

gjkim42 commented Aug 30, 2023 • edited

gjkim42 commented Aug 30, 2023

liggitt commented Aug 30, 2023 • edited

liggitt commented Aug 30, 2023

liggitt commented Aug 30, 2023 • edited

bobbypage commented Aug 30, 2023

ffromani commented Sep 5, 2023

aojea commented Aug 30, 2023 •

edited

shivamerla commented Aug 30, 2023 •

edited

gjkim42 commented Aug 30, 2023 •

edited

gjkim42 commented Aug 30, 2023 •

edited

liggitt commented Aug 30, 2023 •

edited

liggitt commented Aug 30, 2023 •

edited