WIP: alternative take: kubelet: devices: skip allocation for running pods #119151

ffromani · 2023-07-07T09:25:57Z

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

When kubelet initializes, runs admission for pods and possibly allocate requested resources. We need to distinguish between node reboot (no containers running) versus kubelet restart (containers potentially running).

Running pods should always survive kubelet restart. This means that device allocation on admission should not be attempted, because if a container requires devices and is still running when kubelet is restarting, that container already has devices allocated and working.

Thus, we need to properly detect this scenario in the allocation step and handle it explicitely. We need to inform the devicemanager about which pods are already running.

Which issue(s) this PR fixes:

Fixes #118559

Special notes for your reviewer:

Alternative take as discussed in #118635 (comment) .
Once we reach agreement I'll either close this draft and make changes in #118635 or just drop this draft.

Implements the first approach proposed in the thread, so we make the devicemanager treat running pod differently.

This approach was chosen because it seems simpler to make self-contained and easier to backport.

The devicemanager already tracks (with the help of the checkpoint files) which containers got devices assigned to them, which by definition means these containers passed its admission. The missing bit is safely learning which container are already running when initializing, and for that we extend the existing buildContainerMapFromRuntime

Does this PR introduce a user-facing change?

NONE

When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no containers running) versus kubelet restart (containers potentially running). Running pods should always survive kubelet restart. This means that device allocation on admission should not be attempted, because if a container requires devices and is still running when kubelet is restarting, that container already has devices allocated and working. Thus, we need to properly detect this scenario in the allocation step and handle it explicitely. We need to inform the devicemanager about which pods are already running. Note that if container runtime is down when kubelet restarts, the approach implemented here won't work. In this scenario, so on kubelet restart containers will again fail admission, hitting kubernetes#118559 again. This scenario should however be pretty rare. Signed-off-by: Francesco Romani <fromani@redhat.com>

Fix e2e device manager tests. Most notably, the workload pods needs to survive a kubelet restart. Update tests to reflect that. Signed-off-by: Francesco Romani <fromani@redhat.com>

The recently added e2e device plugins test to cover node reboot works fine if runs every time on CI environment (e.g CI) but doesn't handle correctly partial setup when run repeatedly on the same instance (developer setup). To accomodate both flows, we extend the error management, checking more error conditions in the flow. Signed-off-by: Francesco Romani <fromani@redhat.com>

Make sure orphanded pods (pods deleted while kubelet is down) are handled correctly. Outline: 1. create a pod (not static pod) 2. stop kubelet 3. while kubelet is down, force delete the pod on API server 4. restart kubelet the pod becomes an orphaned pod and is expected to be killed by HandlePodCleanups. There is a similar test already, but here we want to check device assignment. Signed-off-by: Francesco Romani <fromani@redhat.com>

Signed-off-by: Francesco Romani <fromani@redhat.com>

k8s-ci-robot · 2023-07-07T09:25:59Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2023-07-07T09:26:05Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ffromani · 2023-07-07T09:26:26Z

/sig node

k8s-ci-robot · 2023-07-07T09:26:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ffromani
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/cm/OWNERS
test/e2e_node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elezar

Thanks @ffromani. I think the new containerMap struct looks good here.

There were still the questions about needed == 0, but I don't think they're blockers.

LGTM.

elezar · 2023-07-07T12:16:30Z

pkg/kubelet/cm/containermap/container_map.go

+	cm[containerID] = containerMapElem{
+		podUID:        podUID,
+		containerName: containerName,
+		running:       false,
+	}


Nit: could call AddWithRunningState()

Suggested change

cm[containerID] = containerMapElem{

podUID: podUID,

containerName: containerName,

running: false,

}

AddWithRunningState(podUID, containerName, containerID, false)

to only update the elements in one place.

elezar · 2023-07-07T12:17:03Z

pkg/kubelet/cm/containermap/container_map.go

 }

+// ContainerMap maps (containerID)->(*v1.Pod, *v1.Container)


Maybe add as well as the running state?

elezar · 2023-07-07T12:18:28Z

pkg/kubelet/cm/devicemanager/manager.go

@@ -1040,3 +1056,13 @@ func (m *ManagerImpl) setPodPendingAdmission(pod *v1.Pod) {

 	m.pendingAdmissionPod = pod
 }
+
+func (m *ManagerImpl) checkContainerAlreadyRunning(podUID, cntName string) bool {


Nit: Should this be name isContainerAlreadyRunning()?

elezar · 2023-07-07T12:20:17Z

pkg/kubelet/cm/devicemanager/manager.go

+	// kubelet restart flow.
+	if !m.sourcesReady.AllReady() {
+		// if the container is reported running and it had a device assigned, things are already fine and we can just bail out
+		if needed == 0 && m.checkContainerAlreadyRunning(podUID, contName) {


This PR doesn't address the questions about needed == 0 in #118635. My question there was whether we could use use:

if needed == 0 { klog.V(3).InfoS("no devices needed, nothing to do", "deviceNumber", needed, "resourceName", resource, "podUID", string(podUID), "containerName", contName) // No change, no work. return nil, nil }

As the first check in this function.

You are totally right, I should have mentioned this more explicitely. The reason why I didn't address your comment is I'm still evaluating the flow here. I tend to agree we can simplify, but I want to run a bit more tests before to change the code (or explain why I believe better not to change).

ffromani · 2023-07-12T11:18:15Z

This approach looks nicer, but after conversation in https://github.com/kubernetes/kubernetes/pull/118635/files/71fdf75dc74d2705f687d75320233b8a5553b62b#r1261023770 I think we should aim for a deeper refactoring/code reorganization. So I'm keeping the fix minimal, and defer the proper redesign to followup work.

ffromani added 5 commits July 7, 2023 11:07

e2e: node: devicemanager: update tests

e4faf20

Fix e2e device manager tests. Most notably, the workload pods needs to survive a kubelet restart. Update tests to reflect that. Signed-off-by: Francesco Romani <fromani@redhat.com>

fixup

7cbfbe8

Signed-off-by: Francesco Romani <fromani@redhat.com>

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jul 7, 2023

ffromani changed the title ~~Devmgr check pod running cntmap~~ WIP: alternative take: kubelet: devices: skip allocation for running pods Jul 7, 2023

k8s-ci-robot requested review from klueska and krmayankk July 7, 2023 09:26

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. area/kubelet area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 7, 2023

ffromani mentioned this pull request Jul 7, 2023

kubelet: devices: skip allocation for running pods #118635

Merged

elezar reviewed Jul 7, 2023

View reviewed changes

bart0sh added this to Triage in SIG Node PR Triage Jul 12, 2023

bart0sh moved this from Triage to WIP in SIG Node PR Triage Jul 12, 2023

ffromani closed this Jul 12, 2023

SIG Node PR Triage automation moved this from WIP to Done Jul 12, 2023

ffromani deleted the devmgr-check-pod-running-cntmap branch August 28, 2023 09:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: alternative take: kubelet: devices: skip allocation for running pods #119151

WIP: alternative take: kubelet: devices: skip allocation for running pods #119151

ffromani commented Jul 7, 2023

k8s-ci-robot commented Jul 7, 2023

k8s-ci-robot commented Jul 7, 2023

ffromani commented Jul 7, 2023

k8s-ci-robot commented Jul 7, 2023

elezar left a comment

elezar Jul 7, 2023

elezar Jul 7, 2023

elezar Jul 7, 2023

elezar Jul 7, 2023

ffromani Jul 7, 2023

ffromani commented Jul 12, 2023

		}

		// ContainerMap maps (containerID)->(v1.Pod, v1.Container)

WIP: alternative take: kubelet: devices: skip allocation for running pods #119151

WIP: alternative take: kubelet: devices: skip allocation for running pods #119151

Conversation

ffromani commented Jul 7, 2023

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Jul 7, 2023

k8s-ci-robot commented Jul 7, 2023

ffromani commented Jul 7, 2023

k8s-ci-robot commented Jul 7, 2023

elezar left a comment

Choose a reason for hiding this comment

elezar Jul 7, 2023

Choose a reason for hiding this comment

elezar Jul 7, 2023

Choose a reason for hiding this comment

elezar Jul 7, 2023

Choose a reason for hiding this comment

elezar Jul 7, 2023

Choose a reason for hiding this comment

ffromani Jul 7, 2023

Choose a reason for hiding this comment

ffromani commented Jul 12, 2023