Kubelet restart cause running pod restart with UnexpectedAdmissionError when pods have initContainers and external devices like GPU #124345

zwpaper · 2024-04-17T09:28:54Z

What happened?

When restarting kubelet, it will restart the running pods with UnexpectedAdmissionError when pods' initContainers and containers both use external devices like GPU

What did you expect to happen?

Restart kubelet should not cause running pods to restart

How can we reproduce it (as minimally and precisely as possible)?

create some pods with nvidia.com/gpu and initContainers which also use GPUs
wait until initContainers exit
restart kubelet

Anything else we need to know?

there was an issue and a fix for running pods with devices, but it looks like the initContainers is not counted as should skip containers.

the fix has cherry-picked to v1.25.16

issue: Running pods with devices are terminated if kubelet is restarted #118559
fix: kubelet: devices: skip allocation for running pods #118635
cherry-pick v1.25 since v1.25.14: [1.25] kubelet: devices: skip allocation for running pods #118635 #119707

Related Issues:

Completed Pod will also be affected: A completed status pod which request 1 nvidia.com/gpu resource updated to UnexpectedAdmissionError when restart kubelet service #117955
memory manager related kubelet restart cause UnexpectedAdmissionError: pod with initcontainer failed with UnexpectedAdmissionError when restart kubelet #123971

Kubernetes version

$ kubectl version
# 1.25.16

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot · 2024-04-17T09:29:03Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zwpaper · 2024-04-17T09:29:38Z

/sig node

ffromani · 2024-04-17T16:16:23Z

/cc

ffromani · 2024-04-17T16:19:47Z

my 2c: completed containers, aka containers terminated succesfully and not deleted, like init containers or containers belonging to Jobs, should NOT be restarted or even retry admission when kubelet is restarted. The reason is, well, these containers already completed succesfully.

ffromani · 2024-04-17T16:23:08Z

quoting https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#detailed-behavior:

A Pod cannot be Ready until all init containers have succeeded. The ports on an init container are not aggregated under a Service. A Pod that is initializing is in the Pending state but should have a condition Initialized set to false.

If the Pod [restarts](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#pod-restart-reasons), or is restarted, all init containers must execute again.

and

Because init containers can be restarted, retried, or re-executed, init container code should be idempotent. In particular, code that writes to files on EmptyDirs should be prepared for the possibility that an output file already exists.

which ties to #123980

ffromani · 2024-04-18T14:27:23Z

xref: #117955

zwpaper · 2024-04-20T08:18:46Z

my 2c: completed containers, aka containers terminated succesfully and not deleted, like init containers or containers belonging to Jobs, should NOT be restarted or even retry admission when kubelet is restarted. The reason is, well, these containers already completed succesfully.

yes! that's why I created this issue, and this issue is more focus on the init-containers, the completed init-containers cause running pods to be restarted.

this fixed doesn't help because the pod is still running: a2ca66d

I was thinking about, can we add a check here to skip allocateDevice if containers is expected stopped, looks like my pods was killed here:

https://github.com/ffromani/kubernetes/blob/7e3638982acebe901b34f3d9bab4f4e4c5d703c9/pkg/kubelet/cm/devicemanager/manager.go#L811

swatisehgal · 2024-04-22T14:29:04Z

/cc

AnishShah · 2024-04-24T17:45:33Z

@zwpaper can you try this on a newer k8s version? 1.25 is out of support. Can you also share a pod spec to reproduce this error? Why is the pod failing with an admission error if the device is still available?

SergeyKanzhelev · 2024-04-24T17:45:38Z

/cc

SergeyKanzhelev · 2024-04-24T17:46:16Z

/triage needs-information

zwpaper added the kind/bug Categorizes issue or PR as related to a bug. label Apr 17, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 17, 2024

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 17, 2024

SergeyKanzhelev added this to Triage in SIG Node Bugs Apr 17, 2024

AnishShah moved this from Triage to Needs Information in SIG Node Bugs Apr 24, 2024

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubelet restart cause running pod restart with UnexpectedAdmissionError when pods have initContainers and external devices like GPU #124345

Kubelet restart cause running pod restart with UnexpectedAdmissionError when pods have initContainers and external devices like GPU #124345

zwpaper commented Apr 17, 2024

k8s-ci-robot commented Apr 17, 2024

zwpaper commented Apr 17, 2024

ffromani commented Apr 17, 2024

ffromani commented Apr 17, 2024

ffromani commented Apr 17, 2024

ffromani commented Apr 18, 2024

zwpaper commented Apr 20, 2024

swatisehgal commented Apr 22, 2024

AnishShah commented Apr 24, 2024

SergeyKanzhelev commented Apr 24, 2024

SergeyKanzhelev commented Apr 24, 2024

Kubelet restart cause running pod restart with UnexpectedAdmissionError when pods have initContainers and external devices like GPU #124345

Kubelet restart cause running pod restart with UnexpectedAdmissionError when pods have initContainers and external devices like GPU #124345

Comments

zwpaper commented Apr 17, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Apr 17, 2024

zwpaper commented Apr 17, 2024

ffromani commented Apr 17, 2024

ffromani commented Apr 17, 2024

ffromani commented Apr 17, 2024

ffromani commented Apr 18, 2024

zwpaper commented Apr 20, 2024

swatisehgal commented Apr 22, 2024

AnishShah commented Apr 24, 2024

SergeyKanzhelev commented Apr 24, 2024

SergeyKanzhelev commented Apr 24, 2024