Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App container is started before initContainers could complete on node reboot #120247

Closed
shivamerla opened this issue Aug 30, 2023 · 18 comments · Fixed by #120281
Closed

App container is started before initContainers could complete on node reboot #120247

shivamerla opened this issue Aug 30, 2023 · 18 comments · Fixed by #120281
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@shivamerla
Copy link

What happened?

On node reboot, we see app container is started before initContainers could complete. Essentially both app and initContainers are running in parallel. We are using initContainers to ensure dependent drivers are loaded by other pod before the app container could start. But because app container is starting in parallel, they hit a failure. This was not seen with k8s 1.27 and only happening with 1.28. Also, this happens only on node reboot, not during fresh install of our application (gpu-operator)

root@co-node-2-126:~# kubectl  describe pod nvidia-driver-daemonset-svkhz -n gpu-operator
Name:                 nvidia-driver-daemonset-svkhz
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-driver
Node:                 co-node-8/10.213.6.120
Start Time:           Tue, 29 Aug 2023 14:18:59 +0000
Labels:               app=nvidia-driver-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=7c4c76cdf8
                      helm.sh/chart=gpu-operator-v23.6.0
                      nvidia.com/precompiled=false
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: ef20538089b848b06a67dcc865444b487644a2945cd07a9908becfe16b163d3e
                      cni.projectcalico.org/podIP: 192.168.177.14/32
                      cni.projectcalico.org/podIPs: 192.168.177.14/32
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "k8s-pod-network",
                            "ips": [
                                "192.168.177.14"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "k8s-pod-network",
                            "ips": [
                                "192.168.177.14"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      kubectl.kubernetes.io/default-container: nvidia-driver-ctr
Status:               Running
IP:                   192.168.177.14
IPs:
  IP:           192.168.177.14
Controlled By:  DaemonSet/nvidia-driver-daemonset
Init Containers:
  mofed-validation:
    Container ID:  containerd://ae68718e709138b1737ed01ca82c8489c13990870545c90919dbc08044bbfef9
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:b65eb649188193f39e169af5650acfc7aa3cc32d2328630118702f04fdc4afc1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Running
      Started:      Tue, 29 Aug 2023 19:54:23 +0000
    Ready:          False
    Restart Count:  1
    Environment:
      WITH_WAIT:                true
      COMPONENT:                mofed
      NODE_NAME:                 (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:   void
      GPU_DIRECT_RDMA_ENABLED:  true
    Mounts:
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qdtbk (ro)
  k8s-driver-manager:
    Container ID:  containerd://931cee0081f15ff4a1b5d7db9b6d669ad9304f0975b4c48406c6f9ea3c1b9f54
    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.2
    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:bb845160b32fd12eb3fae3e830d2e6a7780bc7405e0d8c5b816242d48be9daa8
    Port:          <none>
    Host Port:     <none>
    Command:
      driver-manager
    Args:
      uninstall_driver
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 29 Aug 2023 14:26:56 +0000
      Finished:     Tue, 29 Aug 2023 14:27:00 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      NODE_NAME:                    (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:      void
      ENABLE_GPU_POD_EVICTION:     true
      ENABLE_AUTO_DRAIN:           false
      DRAIN_USE_FORCE:             false
      DRAIN_POD_SELECTOR_LABEL:    
      DRAIN_TIMEOUT_SECONDS:       0s
      DRAIN_DELETE_EMPTYDIR_DATA:  false
      OPERATOR_NAMESPACE:          gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /run/nvidia from run-nvidia (rw)
      /sys from host-sys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qdtbk (ro)
Containers:
  nvidia-driver-ctr:
    Container ID:  containerd://bfbfbbe496f5691030d74ec4c3730adbc54e031b6ea47b87811923a8933e2786
    Image:         nvcr.io/nvidia/driver:535.86.10-ubuntu22.04
    Image ID:      nvcr.io/nvidia/driver@sha256:9a4e9b8b87e9f39450d73de07c37d6789700f4ac58fedf1f5d83f6219e5c028d
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      init
    State:          Running
      Started:      Tue, 29 Aug 2023 19:54:24 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 29 Aug 2023 14:27:00 +0000
      Finished:     Tue, 29 Aug 2023 19:53:40 +0000
    Ready:          True
    Restart Count:  1
    Startup:        exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:
      GPU_DIRECT_RDMA_ENABLED:  true
    Mounts:
      /dev/log from dev-log (rw)
      /host-etc/os-release from host-os-release (ro)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /run/nvidia from run-nvidia (rw)
      /run/nvidia-topologyd from run-nvidia-topologyd (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qdtbk (ro)
  nvidia-peermem-ctr:
    Container ID:  containerd://38e913bb4c93285b173c03e6594a57e9b78da2f5b67a16474ecc0b902179d876
    Image:         nvcr.io/nvidia/driver:535.86.10-ubuntu22.04
    Image ID:      nvcr.io/nvidia/driver@sha256:9a4e9b8b87e9f39450d73de07c37d6789700f4ac58fedf1f5d83f6219e5c028d
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      reload_nvidia_peermem
    State:          Running
      Started:      Tue, 29 Aug 2023 19:54:24 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 29 Aug 2023 14:27:00 +0000
      Finished:     Tue, 29 Aug 2023 19:53:40 +0000
    Ready:          True
    Restart Count:  1
    Liveness:       exec [sh -c nvidia-driver probe_nvidia_peermem] delay=30s timeout=10s period=30s #success=1 #failure=1
    Startup:        exec [sh -c nvidia-driver probe_nvidia_peermem] delay=10s timeout=10s period=10s #success=1 #failure=120
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (ro)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia from run-nvidia (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qdtbk (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:  
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  run-nvidia-topologyd:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia-topologyd
    HostPathType:  DirectoryOrCreate
  mlnx-ofed-usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers/usr/src
    HostPathType:  DirectoryOrCreate
  run-mellanox-drivers:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  kube-api-access-qdtbk:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason          Age                From             Message
  ----     ------          ----               ----             -------
  Warning  NodeNotReady    20m                node-controller  Node is not ready
  Normal   SandboxChanged  17m                kubelet          Pod sandbox changed, it will be killed and re-created.
  Normal   AddedInterface  16m                multus           Add eth0 [192.168.177.14/32] from k8s-pod-network
  Normal   Pulled          16m                kubelet          Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0" already present on machine
  Normal   Created         16m                kubelet          Created container mofed-validation
  Normal   Started         16m                kubelet          Started container mofed-validation
  Normal   Pulled          16m                kubelet          Container image "nvcr.io/nvidia/driver:535.86.10-ubuntu22.04" already present on machine
  Normal   Created         16m                kubelet          Created container nvidia-driver-ctr
  Normal   Started         16m                kubelet          Started container nvidia-driver-ctr
  Normal   Pulled          16m                kubelet          Container image "nvcr.io/nvidia/driver:535.86.10-ubuntu22.04" already present on machine
  Normal   Created         16m                kubelet          Created container nvidia-peermem-ctr
  Normal   Started         16m                kubelet          Started container nvidia-peermem-ctr
  Warning  Unhealthy       15m (x4 over 16m)  kubelet          Startup probe failed: DRIVER_ARCH is x86_64

Above, we can see even though init container mofed-validation is still running, main container nvidia-driver-ctr started. Also pod status shows 2/2 main containers ready when initContainer is still running.

root@co-node-2-126:~# kubectl  get pods -n gpu-operator --field-selector=spec.nodeName=co-node-8 
NAME                                               READY   STATUS     RESTARTS      AGE
gpu-feature-discovery-t5njq                        0/1     Init:0/1   0             9m6s
gpu-operator-node-feature-discovery-worker-4tkdj   1/1     Running    1 (11m ago)   5h46m
nvidia-container-toolkit-daemonset-sflgw           1/1     Running    1 (11m ago)   5h38m
nvidia-dcgm-exporter-2mnxt                         0/1     Init:0/1   0             9m6s
nvidia-device-plugin-daemonset-s6pw9               0/1     Init:0/1   0             9m6s
nvidia-driver-daemonset-svkhz                      2/2     Init:0/2   2 (11m ago)   5h46m
nvidia-mig-manager-4mbsv                           1/1     Init:0/1   1 (11m ago)   5h38m
nvidia-operator-validator-ktcxj                    1/1     Running    1 (11m ago)   5h38m

Also, another issue we are seeing is for pod nvidia-operator-validator-ktcxj , which has 4 init containers, only 1 is run (driver-validation) and rest are skipped, while the app container started in parallel again. The start/finish timestamps of these initContainers indicate that clearly below.

root@co-node-2-126:~# kubectl  describe pod nvidia-operator-validator-ktcxj -n gpu-operator
Name:                 nvidia-operator-validator-ktcxj
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      nvidia-operator-validator
Node:                 co-node-8/10.213.6.120
Start Time:           Tue, 29 Aug 2023 14:27:00 +0000
Labels:               app=nvidia-operator-validator
                      app.kubernetes.io/managed-by=gpu-operator
                      app.kubernetes.io/part-of=gpu-operator
                      controller-revision-hash=55899446bb
                      helm.sh/chart=gpu-operator-v23.6.0
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: 02a1f15958fdbf1a97b0b5e6d2c7bd9b595cf721585b4bb10e63363e33a9772b
                      cni.projectcalico.org/podIP: 192.168.177.13/32
                      cni.projectcalico.org/podIPs: 192.168.177.13/32
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "k8s-pod-network",
                            "ips": [
                                "192.168.177.13"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "k8s-pod-network",
                            "ips": [
                                "192.168.177.13"
                            ],
                            "default": true,
                            "dns": {}
                        }]
Status:               Running
IP:                   192.168.177.13
IPs:
  IP:           192.168.177.13
Controlled By:  DaemonSet/nvidia-operator-validator
Init Containers:
  driver-validation:
    Container ID:  containerd://064dc1a47d796959b355435da27924de9f1b41ed42b65046fbe4dc29cad8af5a
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:b65eb649188193f39e169af5650acfc7aa3cc32d2328630118702f04fdc4afc1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 29 Aug 2023 19:54:37 +0000
      Finished:     Tue, 29 Aug 2023 19:55:32 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ljstf (ro)
  toolkit-validation:
    Container ID:  containerd://02c2eca7b6fd1f9f28e0d02d41db122aa267f1cc4ccbebfe1d9293caf3869aa7
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:b65eb649188193f39e169af5650acfc7aa3cc32d2328630118702f04fdc4afc1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 29 Aug 2023 14:28:22 +0000
      Finished:     Tue, 29 Aug 2023 14:28:22 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
      WITH_WAIT:               false
      COMPONENT:               toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ljstf (ro)
  cuda-validation:
    Container ID:  containerd://cc01020476d5cef14ec21b8310f09b4d13acf0f2bd2ee0f59d4db8243a515605
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:b65eb649188193f39e169af5650acfc7aa3cc32d2328630118702f04fdc4afc1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 29 Aug 2023 14:28:23 +0000
      Finished:     Tue, 29 Aug 2023 14:28:28 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ljstf (ro)
  plugin-validation:
    Container ID:  containerd://61aad57b2a07f9a5acd7ed80beba9ea8be384fb45ec2e4ac6bf57107f7048609
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:b65eb649188193f39e169af5650acfc7aa3cc32d2328630118702f04fdc4afc1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 29 Aug 2023 14:28:29 +0000
      Finished:     Tue, 29 Aug 2023 14:28:34 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                false
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ljstf (ro)
Containers:
  nvidia-operator-validator:
    Container ID:  containerd://9cf87ec82ff7a8876864e4dceaada2f023655176c556f1a8a99d4ee86d17c229
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:b65eb649188193f39e169af5650acfc7aa3cc32d2328630118702f04fdc4afc1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    State:          Running
      Started:      Tue, 29 Aug 2023 19:54:37 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 29 Aug 2023 14:28:35 +0000
      Finished:     Tue, 29 Aug 2023 19:53:40 +0000
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ljstf (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  kube-api-access-ljstf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.operator-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:                      <none>

What did you expect to happen?

initContainers should complete in order before app containers could start on node reboot. Also, all initContainers should run rather than skipped before app-container could start.

How can we reproduce it (as minimally and precisely as possible)?

Create a pod with multiple initContainers (which wait for about 2-3 mins) and an app container. On node reboot, app-container is started in parallel to initContainer.

Anything else we need to know?

No response

Kubernetes version

root@co-node-2-126:~# kubectl  version
Client Version: v1.28.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.0

Cloud provider

OS version

root@co-node-8:~# cat /etc/issue
Ubuntu 22.04.2 LTS \n \l

root@co-node-8:~# uname -a
Linux co-node-8 5.15.0-79-generic #86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

containerd

root@co-node-8:~# ctr version
Client:
Version: v1.7.2
Revision: 0cae528dd6cb557f7201036e9f43420650207b58
Go version: go1.20.4

Server:
Version: v1.7.2
Revision: 0cae528dd6cb557f7201036e9f43420650207b58
UUID: f120b710-d067-4968-bb0a-6532b6918467

Related plugins (CNI, CSI, ...) and versions (if applicable)

@shivamerla shivamerla added the kind/bug Categorizes issue or PR as related to a bug. label Aug 30, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 30, 2023
@klueska
Copy link
Contributor

klueska commented Aug 30, 2023

/sig node
/triage accepted

NOTE: This occurs without enabling the new SidecarFeatureGate

Glancing through the code that was added for the the sidecar feature, I wouldn't be surprised if something got broken in the normal initContainer flow. Very little of the modified code is guarded by the SidecarFeatureGate. Normally all code changes should be guarded by the feature gate, not just code related to the new API being exposed. In my opinion, the code should be refactored with the proper guardrails in place to avoid going down any new code paths unless the SidecarFeatureGate is enabled.

Here is the PR that introduced these changes:
#116429

/cc @mrunalp @gjkim42

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 30, 2023
@gjkim42
Copy link
Member

gjkim42 commented Aug 30, 2023

Glancing through the code that was added for the the sidecar feature, I wouldn't be surprised if something got broken in the normal initContainer flow. Very little of the modified code is guarded by the SidecarFeatureGate. Normally all code changes should be guarded by the feature gate, not just code related to the new API being exposed. In my opinion, the code should be refactored with the proper guardrails in place to avoid going down any new code paths unless the SidecarFeatureGate is enabled.

#116429 (comment)

There have been some discussions about this and we realized that it is practically difficult to guard all new code paths.
I'll investigate it later today and will find a better way to guard the new code path.

/assign
cc @SergeyKanzhelev

@pacoxu
Copy link
Member

pacoxu commented Aug 30, 2023

Does it sound like a regression? It should be backported to v1.28.x.
/kind regression

@k8s-ci-robot k8s-ci-robot added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Aug 30, 2023
@aojea
Copy link
Member

aojea commented Aug 30, 2023

I can not reproduce this behavior at head , the init containers correctly run before the app

apiVersion: v1
kind: Pod
metadata:
  name: myapp
  labels:
    app.kubernetes.io/name: MyApp
spec:
  restartPolicy: Never
  containers:
  - name: myapp
    image: busybox:1.28
    command: ['sh', '-c', 'echo The app is running! && date && exit 0']
  initContainers:
  - name: init
    image: busybox:1.28
    command: ['sh', '-c', "echo starting && date ; sleep 15; echo finishing starting after 15 sec && date"]
$ kubectl logs myapp -c init
starting
Wed Aug 30 10:22:02 UTC 2023
finishing starting after 15 sec
Wed Aug 30 10:22:17 UTC 2023
$ kubectl logs myapp -c myapp
The app is running!
Wed Aug 30 10:22:18 UTC 2023

@aojea
Copy link
Member

aojea commented Aug 30, 2023

I can not reproduce this behavior at head , the init containers correctly run before the app

Also, this happens only on node reboot, not during fresh install of our application (gpu-operator)

@gjkim42 @shivamerla can we try to find the root cause first?

@shivamerla
Copy link
Author

shivamerla commented Aug 30, 2023

@aojea I am hitting this with pod restartPolicy: Always which is by default.

apiVersion: v1
kind: Pod
metadata:
  name: myapp
  labels:
    app.kubernetes.io/name: MyApp
spec:
  containers:
  - name: myapp
    image: busybox:1.28
    command: ['sh', '-c', 'echo The app is running! && date && sleep 600']
  initContainers:
  - name: init
    image: busybox:1.28
    command: ['sh', '-c', "echo starting && date ; sleep 150; echo finishing starting after 150 sec && date"]

On node reboot ( we can see app container started before init could finish)

root@co-node-2-126:~# kubectl logs myapp -c init
starting
Wed Aug 30 12:04:42 UTC 2023
finishing starting after 150 sec
Wed Aug 30 12:07:12 UTC 2023

root@co-node-2-126:~# kubectl logs myapp
Defaulted container "myapp" out of: myapp, init (init)
The app is running!
Wed Aug 30 12:04:43 UTC 2023

root@co-node-2-126:~# kubectl describe pod myapp
Name:             myapp
Namespace:        default
Priority:         0
Service Account:  default
Node:             co-node-8/10.213.6.120
Start Time:       Wed, 30 Aug 2023 11:56:10 +0000
Labels:           app.kubernetes.io/name=MyApp
Annotations:      cni.projectcalico.org/containerID: f874f3103c1d09cbf0cc9039f684caf20b00468946228d9477d835a1634652e8
                  cni.projectcalico.org/podIP: 192.168.177.30/32
                  cni.projectcalico.org/podIPs: 192.168.177.30/32
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "k8s-pod-network",
                        "ips": [
                            "192.168.177.30"
                        ],
                        "default": true,
                        "dns": {}
                    }]
                  k8s.v1.cni.cncf.io/networks-status:
                    [{
                        "name": "k8s-pod-network",
                        "ips": [
                            "192.168.177.30"
                        ],
                        "default": true,
                        "dns": {}
                    }]
Status:           Running
IP:               192.168.177.30
IPs:
  IP:  192.168.177.30
Init Containers:
  init:
    Container ID:  containerd://f62f5e2508df7084290c45f75cf02d9c5f8de15ea95f815e7bf67782eee4c06d
    Image:         busybox:1.28
    Image ID:      docker.io/library/busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      echo starting && date ; sleep 150; echo finishing starting after 150 sec && date
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 30 Aug 2023 12:04:42 +0000
      Finished:     Wed, 30 Aug 2023 12:07:12 +0000
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jg8w7 (ro)
Containers:
  myapp:
    Container ID:  containerd://8ce9fcf8ce7134dc4cfca131101a17e2fb9cc4c2ff576b49a24a5892d23ae46b
    Image:         busybox:1.28
    Image ID:      docker.io/library/busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      echo The app is running! && date && sleep 600
    State:          Running
      Started:      Wed, 30 Aug 2023 12:04:43 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Wed, 30 Aug 2023 11:58:47 +0000
      Finished:     Wed, 30 Aug 2023 12:03:13 +0000
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jg8w7 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-jg8w7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age    From               Message
  ----    ------          ----   ----               -------
  Normal  Scheduled       13m    default-scheduler  Successfully assigned default/myapp to co-node-8
  Normal  AddedInterface  12m    multus             Add eth0 [192.168.177.28/32] from k8s-pod-network
  Normal  Pulled          12m    kubelet            Container image "busybox:1.28" already present on machine
  Normal  Created         12m    kubelet            Created container init
  Normal  Started         12m    kubelet            Started container init
  Normal  Pulled          10m    kubelet            Container image "busybox:1.28" already present on machine
  Normal  Created         10m    kubelet            Created container myapp
  Normal  Started         10m    kubelet            Started container myapp
  Normal  SandboxChanged  5m43s  kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal  AddedInterface  4m33s  multus             Add eth0 [192.168.177.30/32] from k8s-pod-network
  Normal  Pulled          4m31s  kubelet            Container image "busybox:1.28" already present on machine
  Normal  Created         4m31s  kubelet            Created container init
  Normal  Started         4m31s  kubelet            Started container init
  Normal  Pulled          4m30s  kubelet            Container image "busybox:1.28" already present on machine
  Normal  Created         4m30s  kubelet            Created container myapp
  Normal  Started         4m30s  kubelet            Started container myapp

@gjkim42
Copy link
Member

gjkim42 commented Aug 30, 2023

I reproduced it with your scenario.

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 30, 2023
@liggitt
Copy link
Member

liggitt commented Aug 30, 2023

cc @bobbypage @smarterclayton for visibility to kubelet correctness

@gjkim42
Copy link
Member

gjkim42 commented Aug 30, 2023

I guess this happens when the pod sandbox is changed.

we can reproduce it easily with,

crictl stopp POD_ID

I'll propose a PR containing the e2e-scenario and the fix for it.

@gjkim42
Copy link
Member

gjkim42 commented Aug 30, 2023

Also, another issue we are seeing is for pod nvidia-operator-validator-ktcxj , which has 4 init containers, only 1 is run (driver-validation) and rest are skipped, while the app container started in parallel again. The start/finish timestamps of these initContainers indicate that clearly below.

I'll cover this as well.

@liggitt
Copy link
Member

liggitt commented Aug 30, 2023

once we have a solid reproducer, check if it reproduces with 1.27 or not so we know if this was a regression in 1.28 or just a long-standing issue

@gjkim42
Copy link
Member

gjkim42 commented Aug 30, 2023

confirmed that this is NOT reproduced in 1.27. It is a regression.

@gjkim42
Copy link
Member

gjkim42 commented Aug 30, 2023

// If any of the main containers have status and are Running, then all init containers must
// have been executed at some point in the past. However, they could have been removed
// from the container runtime now, and if we proceed, it would appear as if they
// never ran and will re-execute improperly except for the restartable init containers.
podHasInitialized := hasAnyRegularContainerCreated(pod, podStatus)

This happens because the kubelet uses the old container statuses to check if the pod has been initialized before.

Before 1.28, the kubelet considers the pod as initialized if there is any regular container running.
However, with 1.28, the kubelet considers the pod as initialized if there is any regular container running OR EXITED.

I think the latter one is theoretically right, but currently, we cannot know if the container statuses are from the current pod sandbox or the old one.

The fix would be,

  1. check if the container statuses are outdated (not sure how to do this yet...)
  2. (workaround) revert pod initialization check: there is any regular container running or exited => there is any regular container running

@liggitt
Copy link
Member

liggitt commented Aug 30, 2023

that looks to have changed as part of sidecar work - there were a lot of changes to pkg/kubelet/kuberuntime/kuberuntime_container.go in https://github.com/kubernetes/kubernetes/pull/116429/files#diff-d86ffc24d8e4c9deafd3e39e8f4ee3546ee46a31dcea8ea6526443b3f4bffa28 that are not obviously feature gated... were all of those supposed to be no-op refactors?

@liggitt
Copy link
Member

liggitt commented Aug 30, 2023

cc @thockin for visibility to kubelet regressions in 1.28 related to sidecar refactors

@liggitt
Copy link
Member

liggitt commented Aug 30, 2023

The fix would be,

  1. check if the container statuses are outdated (not sure how to do this yet...)
  2. (workaround) revert pod initialization check: there is any regular container running or exited => there is any regular container running

A starting point that would build confidence would be to restore the kubelet logic here to the 1.27 state before the sidecar merge (reviewing diff(release-1.27, Fix PR) should make it obvious that 1.27 behavior has been restored).

The refactored / new methods could be kept in a different file and only called when the feature gate was enabled. That would restore 1.28 to match 1.27 behavior by default as quickly as possible.

@liggitt liggitt added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 30, 2023
@bobbypage
Copy link
Member

/cc

@ffromani
Copy link
Contributor

ffromani commented Sep 5, 2023

/cc

SIG Node Bugs automation moved this from Triaged to Done Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging a pull request may close this issue.

10 participants