Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Ensure Kubelet always reports terminating pod container status #88440
The kubelet must not allow a container that was reported failed in a restartPolicy=Never pod to be reported to the apiserver as success. If a client deletes a restartPolicy=Never pod, the dispatchWork and status manager race to update the container status. The shortcut detection for pods in the Succeeded or Failure phases did not ensure all containers had reported status, and combined with the TerminatePod method handling unrecognized containers as terminated with exitCode 0 the Kubelet could report to the apiserver that all container status was “success”. On the next sync to the apiserver the status of the pod could then be reset to Succeeed which is a violation of the guarantees around terminal phase. A higher level controller watching this could see "pod succeeded" instead of "pod failed", causing it to incorrectly take action based on perceived success.
We should not invoke TerminatePod until the Kubelet has cleaned up and reported all container status, regardless of phase. We should guard against terminate pod accidentally being invoked when no container status is available by reporting 137 as the exit code instead of exit code 0 (never report success when we don't know something succeeded). The kubelet will consider any container without the
Adds an e2e test that detects reporting success when the container failed.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: smarterclayton
The full list of commands accepted by this bot can be found here.
The pull request process is described here
That's a legitimate flake error, although it should never happen either. Probably will be a separate bug:
This new test will have to tolerate that failure type
The kubelet can race when a pod is deleted and report that a container succeeded when it instead failed, and thus the pod is reported as succeeded. Create an e2e test that demonstrates this failure.
The kubelet must not allow a container that was reported failed in a restartPolicy=Never pod to be reported to the apiserver as success. If a client deletes a restartPolicy=Never pod, the dispatchWork and status manager race to update the container status. When dispatchWork (specifically podIsTerminated) returns true, it means all containers are stopped, which means status in the container is accurate. However, the TerminatePod method then clears this status. This results in a pod that has been reported with status.phase=Failed getting reset to status.phase.Succeeded, which is a violation of the guarantees around terminal phase. Ensure the Kubelet never reports that a container succeeded when it hasn't run or been executed by guarding the terminate pod loop from ever reporting 0 in the absence of container status.
After a pod reaches a terminal state and all containers are complete we can delete the pod from the API server. The dispatchWork method needs to wait for all container status to be available before invoking delete. Even after the worker stops, status updates will continue to be delivered and the sync handler will continue to sync the pods, so dispatchWork gets multiple opportunities to see status. The previous code assumed that a pod in Failed or Succeeded had no running containers, but eviction or deletion of running pods could still have running containers whose status needed to be reported. This modifies earlier test to guarantee that the "fallback" exit code 137 is never reported to match the expectation that all pods exit with valid status for all containers (unless some exceptional failure like eviction were to occur while the test is running).
When constructing the API status of a pod, if the pod is marked for deletion no containers should be started. Previously, if a container inside of a terminating pod failed to start due to a container runtime error (that populates reasonCache) the reasonCache would remain populated (it is only updated by syncPod for non-terminating pods) and the delete action on the pod would be delayed until the reasonCache entry expired due to other pods. This dramatically reduces the amount of time the Kubelet waits to delete pods that are terminating and encountered a container runtime error.
The status manager syncBatch() method processes the current state of the cache, which should include all entries in the channel. Flush the channel before we call a batch to avoid unnecessary work and to unblock pod workers when the node is congested. Discovered while investigating long shutdown intervals on the node where the status channel stayed full for tens of seconds. Add a for loop around the select statement to avoid unnecessary invocations of the wait.Forever closure each time.
Found one bug in this PR (fixed earlier) where the wrong default status was being used on init containers (the for loop in defaulting was moving over containers but tried to set initContainer).
Still on hold while I investigate the panic flake (which is not happening anymore after a rebase, so I'm not sure what could have caused it).
The related panic stopped occurring (it was consistent until I rebased and then cleared). The flake for the 128 exit code is handled in the test and is no longer flaking.
Would appreciate a final pass (will keep rerunning for other flakes, but has been clean beyond those).