New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.27] kubelet: devices: skip allocation for running pods #118635 #119432
[1.27] kubelet: devices: skip allocation for running pods #118635 #119432
Conversation
/sig node |
/test |
@ffromani: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test all |
/test pull-kubernetes-e2e-gce-device-plugin-gpu |
/test pull-kubernetes-unit-go-compatibility unrelated failure: |
/test pull-kubernetes-e2e-gce-device-plugin-gpu setup failure? |
/test pull-kubernetes-unit-go-compatibility |
/test pull-kubernetes-e2e-capz-windows-1-27 unrelated failure: |
/retest-required apiserver is not affected by this change |
no problem at all! it took me a while to get a clue about what was broken, but appy it was resolved and we have better windows jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Re-applying as the label was removed due to rebase.
LGTM label has been added. Git tree hash: 01900654b0c6f23d2ddc1ff343f0eb6fe97a9fcd
|
regarding #119432 (comment)
+1, it looks like the test is failing on presubmits on your empty PR (#119590), so I agree we should not block this PR on that. I believe I found the issue there with the config (see kubernetes/test-infra#30450), which we can address separately. |
/lgtm |
/test pull-kubernetes-e2e-gce-device-plugin-gpu |
Thanks @bobbypage ! it seems your fix was merged and worked like a charm, awesome! |
@ffromani any idea when this will make it into 1.27? We have been hitting this for a while now, and encounter |
Hi, please note that we (or, at very least, I :) ) want to keep the existing behavior added in #109595 , whose partial fix lead to the bug I'm fixing here. In other words, some pods WILL still be hitting UnexpectedAdmissionError in some cases. |
/cc @kubernetes/release-managers Release managers, can you please take a look at this cherrypick? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For RelEng:
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani, mrunalp, xmudrii The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-node-e2e-containerd failure unrelated to this PR |
/test pull-kubernetes-unit-go-compatibility unrelated failure |
What type of PR is this?
/kind bug
/kind regression
What this PR does / why we need it:
Cherry-pick of #118635 to branch
release-1.27
. Cherry pick per se done usinghack/cherry_pick_pull.sh
Original description
When kubelet initializes, runs admission for pods and possibly allocate requested resources. We need to distinguish between node reboot (no containers running) versus kubelet restart (containers potentially running).
Running pods should always survive kubelet restart. This means that device allocation on admission should not be attempted, because if a container requires devices and is still running when kubelet is restarting, that container already has devices allocated and working.
Thus, we need to properly detect this scenario in the allocation step and handle it explicitely. We need to inform the devicemanager about which pods are already running.
Which issue(s) this PR fixes:
Fixes #118559
Special notes for your reviewer:
Implements the first approach proposed in the thread, so we make the devicemanager treat running pod differently.
This approach was chosen because it seems simpler to make self-contained and easier to backport.
The devicemanager already tracks (with the help of the checkpoint files) which containers got devices assigned to them, which by definition means these containers passed its admission. The missing bit is safely learning which container are already running when initializing, and for that we extend the existing
buildContainerMapFromRuntime
Does this PR introduce a user-facing change?