Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: Handle UID reuse in pod worker #104847

Merged
merged 2 commits into from Sep 16, 2021

Conversation

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Sep 8, 2021

If a pod is killed (no longer wanted) and then a subsequent create/add/update event is seen in the pod worker, assume that a pod UID was reused (as it could be in static pods) and have the next SyncKnownPods after the pod terminates remove the worker history so that the config loop can restart the static pod, as well as return to the caller the fact that this termination was not final.

  1. If the pod worker sees updates KILL -> CREATE it sets a flag on the worker status restartRequested = true
  2. When SyncKnownPods runs it reports any terminated worker with that flag as TemporarilyTerminatedWork
  3. The pod housekeeping loop (which reconciles the pod worker with the config state) checks to see which pods were terminated but may need restart (TemporarilyTerminatedWork) during the sync, and starts them if they are not terminal (aka desired and admitted)

A pod that restarts this way will wait at most housekeeping loop period (2s) between being terminated and starting again.

/kind bug
/sig node

Fixes #104648

TODO:

  • verifying this fixes the race
When a static pod file is deleted and recreated while using a fixed UID, the pod was not properly restarted.

@smarterclayton
Copy link
Contributor Author

@smarterclayton smarterclayton commented Sep 8, 2021

In theory this fixes the problem but you have to wait for a reconcile loop of ~90s. A slightly more complex impl might queue the next pod. I'll take a look at that.

Ideally SyncKnownPods would be able to restart workers, but SyncKnownPods doesn't get passed "the pods the pod worker should know about" today. It needs to be "the admitted pods that should be running" which is a subset of what is in the pod manager.

Loading

@smarterclayton
Copy link
Contributor Author

@smarterclayton smarterclayton commented Sep 9, 2021

Ok, after thinking through this some more I'm fairly convinced this is safe. Builds on top of #104817 (which renames a bit of the admission logic):

  1. If the pod worker sees updates KILL -> CREATE it sets a flag on the worker status restartRequested = true
  2. When SyncKnownPods runs it reports any terminated worker with that flag as TemporarilyTerminatedWork
  3. The pod housekeeping loop (which reconciles the pod worker with the config state) checks to see which pods were terminated but may need restart during the sync, and starts them if they are not terminal (aka desired and admitted)

A pod that restarts this way will wait at most housekeeping loop period (2s) between being terminated and starting again.

Loading

@249043822
Copy link
Member

@249043822 249043822 commented Sep 10, 2021

/priority critical-urgent
/triage accepted

Loading

Testname: Mirror Pod, recreate
Description: When a static pod's manifest is removed and readded, the mirror pod MUST successfully recreate. Create the static pod, verify it is running, remove its manifest and then add it back, and verify the static pod runs again.
*/
ginkgo.It("should successfully recreate when file is removed and recreated [NodeConformance]", func() {
Copy link
Member

@gjkim42 gjkim42 Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the config file is removed out and recreated multiple times in rapid succession?
Do we need to test that?

Loading

Copy link
Member

@rphillips rphillips Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a loop around the create/delete of the static pod locally and it works. We can followup with a separate PR to increase the test coverage.

Loading

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

I have a question about the workingPods -> allPodsByUID case, but will investigate that myself a little more deeply. I do wonder if an alternative approach of letting the config source peek into the list of pods under management by uid would be simpler. If we had an existing pod under management with a conflicting UID observed during HandlePodAddition, we could possibly have sent back a conflict that would result in basically asking it to requeue at a later time?

Loading

pkg/kubelet/kubelet_pods.go Show resolved Hide resolved
Loading
start := kl.clock.Now()
mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
klog.V(3).InfoS("Pod is restartable after termination due to UID reuse", "pod", klog.KObj(pod), "podUID", pod.UID)
kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
Copy link
Member

@derekwaynecarr derekwaynecarr Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this basically recreates the HandlePodAdditions work, can we have the shared piece call to a common internal helper?

Loading

Copy link
Member

@derekwaynecarr derekwaynecarr Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spoke with @smarterclayton about the alternative option of having HandlePodAdditions send back a conflict response when it gets an Add event for a pod that exists under management with the same uid. There are other complications around that flow for how to best handle requeue that I don't want to block this PR around versus getting back to a functional state. I think this is my only nit that would like to address so that HandlePodAdditions and this area delegate to a common code path that couples dispatchWork and probes together.

Loading

Copy link
Contributor Author

@smarterclayton smarterclayton Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that the pod worker owns "what is the kubelet doing" and HandlePodAdditions is delivering incremental changes to the pod worker, so I was going to try to minimize what HandlePodAddition does (for better or worse, pod housekeeping is the resync between desired state and actual state). I think that any code that would need to be shared by handle pod addition (which controls admission) and housekeeping (which controls resync of the various loops) should instead live in pod worker, since pod worker owns pod lifecycle. The probe manager was the only code today, and I wanted to move probe manager addition under pod worker, not outside. Having a shared function there would actually confuse things a bit more - because I want there to be no shared "make the stuff aware of the start of a pod" (that's pod worker's job)

Loading

Copy link
Member

@derekwaynecarr derekwaynecarr Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough.

Loading

pkg/kubelet/pod_workers.go Show resolved Hide resolved
Loading
pkg/kubelet/pod_workers.go Show resolved Hide resolved
Loading
@@ -977,12 +999,16 @@ func (p *podWorkers) SyncKnownPods(desiredPods []*v1.Pod) map[types.UID]PodWorkT

p.podsSynced = true
for uid, status := range p.podSyncStatuses {
if _, exists := known[uid]; !exists {
if _, exists := known[uid]; !exists || status.restartRequested {
Copy link
Member

@derekwaynecarr derekwaynecarr Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the restartRequested knob on status will ensure that the worker is terminated, and the status mapping is then cleared for that pod uid, so restartRequested is a one-way flag and we dont need to reset state on it ever. ignore my earlier comment/question.

Loading

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

/lgtm
/approve

Loading

start := kl.clock.Now()
mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
klog.V(3).InfoS("Pod is restartable after termination due to UID reuse", "pod", klog.KObj(pod), "podUID", pod.UID)
kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
Copy link
Member

@derekwaynecarr derekwaynecarr Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough.

Loading

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Sep 16, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Loading

@derekwaynecarr
Copy link
Member

@derekwaynecarr derekwaynecarr commented Sep 16, 2021

/test pull-kubernetes-node-kubelet-serial-containerd
/test pull-kubernetes-node-kubelet-serial

Loading

@rphillips
Copy link
Member

@rphillips rphillips commented Sep 16, 2021

/hold cancel

Loading

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Sep 16, 2021

@smarterclayton: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
pull-kubernetes-node-kubelet-serial-containerd d571980 link /test pull-kubernetes-node-kubelet-serial-containerd
pull-kubernetes-node-kubelet-serial d571980 link /test pull-kubernetes-node-kubelet-serial
pull-kubernetes-integration d571980 link /test pull-kubernetes-integration

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Loading

@rphillips
Copy link
Member

@rphillips rphillips commented Sep 16, 2021

TestCustomResourceCascadingDeletion flake in pull-kubernetes-integration

/test pull-kubernetes-integration

Loading

@k8s-ci-robot k8s-ci-robot merged commit 51384aa into kubernetes:master Sep 16, 2021
13 of 17 checks passed
Loading
SIG Node CI/Test Board automation moved this from Archive-it to Done Sep 16, 2021
SIG Node PR Triage automation moved this from Waiting on Author to Done Sep 16, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Sep 16, 2021
k8s-ci-robot added a commit that referenced this issue Sep 17, 2021
…04847-upstream-release-1.22

Automated cherry pick of #104847: kubelet: Handle UID reuse in pod worker
ehashman added a commit to ehashman/kubernetes that referenced this issue Nov 24, 2021
A pod that has been rejected by admission will have status manager
set the phase to Failed locally, which make take some time to
propagate to the apiserver. The rejected pod will be included in
admission until the apiserver propagates the change back, which
was an unintended regression when checking pod worker state as
authoritative.

A pod that is terminal in the API may still be consuming resources
on the system, so it should still be included in admission.

[ehashman] Rebased on top of kubernetes#104847.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment