Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure Kubelet always reports terminating pod container status #88440

Merged

Conversation

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 23, 2020

Fixes #58711

The kubelet must not allow a container that was reported failed in a restartPolicy=Never pod to be reported to the apiserver as success. If a client deletes a restartPolicy=Never pod, the dispatchWork and status manager race to update the container status. The shortcut detection for pods in the Succeeded or Failure phases did not ensure all containers had reported status, and combined with the TerminatePod method handling unrecognized containers as terminated with exitCode 0 the Kubelet could report to the apiserver that all container status was “success”. On the next sync to the apiserver the status of the pod could then be reset to Succeeed which is a violation of the guarantees around terminal phase. A higher level controller watching this could see "pod succeeded" instead of "pod failed", causing it to incorrectly take action based on perceived success.

We should not invoke TerminatePod until the Kubelet has cleaned up and reported all container status, regardless of phase. We should guard against terminate pod accidentally being invoked when no container status is available by reporting 137 as the exit code instead of exit code 0 (never report success when we don't know something succeeded). The kubelet will consider any container without the waiting or terminated state to be still running, like it does for other operations.

Adds an e2e test that detects reporting success when the container failed.

/kind bug

Terminating a restartPolicy=Never pod no longer has a chance to report the pod succeeded when it actually failed.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 23, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton smarterclayton force-pushed the smarterclayton:container_success_fix branch 2 times, most recently from 18c446f to 443b1bc Feb 23, 2020
@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Feb 23, 2020

Still need to check log output of the running cluster under test to verify dispatchWork is being invoked to sync pods that are on the apiserver but terminal (from reading the code I expect it to)

@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Feb 23, 2020

After thinking about this more I need to change this up - we have to invoke the worker as long as any container is running, and so podTerminal is too aggressive to bypass.

@smarterclayton smarterclayton force-pushed the smarterclayton:container_success_fix branch from 443b1bc to 6fdde50 Feb 23, 2020
@smarterclayton smarterclayton force-pushed the smarterclayton:container_success_fix branch 4 times, most recently from 782bf33 to 5719d31 Feb 23, 2020
@sftim

This comment has been minimized.

Copy link
Contributor

sftim commented Feb 24, 2020

I hope it's OK to
/retest

@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Feb 24, 2020

That's a legitimate flake error, although it should never happen either. Probably will be a separate bug:

pod pod-submit-status-2-21 on node e2e-1c8de79c3b-674b9-minion-group-9q19 container unexpected exit code 128: start=2020-02-23 23:52:44 +0000 UTC end=2020-02-23 23:52:44 +0000 UTC reason=ContainerCannotRun message=OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/f727fb86-18c0-406b-96b3-b6a950000f88/volumes/kubernetes.io~secret/default-token-kj42r\\\" to rootfs \\\"/var/lib/docker/overlay2/4f2e4e9bfe2e62719d04b85e7b24f0603ad1d6c2238b3cccdf8bb052db6dc28f/merged\\\" at \\\"/var/run/secrets/kubernetes.io/serviceaccount\\\" caused \\\"stat /var/lib/kubelet/pods/f727fb86-18c0-406b-96b3-b6a950000f88/volumes/kubernetes.io~secret/default-token-kj42r: no such file or directory\\\"\"": unknown

from https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/88440/pull-kubernetes-e2e-gce/1231718114933084162/

This new test will have to tolerate that failure type

@smarterclayton smarterclayton force-pushed the smarterclayton:container_success_fix branch from 5719d31 to 2d8e8bd Feb 24, 2020
@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Mar 3, 2020

/retest

@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Mar 3, 2020

/retest

(unrelated)

@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Mar 3, 2020

/test pull-kubernetes-e2e-gce

@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Mar 3, 2020

The kind issues are interesting, investigating

/hold

@smarterclayton smarterclayton force-pushed the smarterclayton:container_success_fix branch from 09b0f90 to faa8cf5 Mar 3, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 3, 2020
@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Mar 4, 2020

/retest

@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Mar 4, 2020

/test pull-kubernetes-e2e-gce
/test pull-kubernetes-e2e-kind

@smarterclayton smarterclayton force-pushed the smarterclayton:container_success_fix branch from faa8cf5 to 7810e99 Mar 4, 2020
@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Mar 4, 2020

/retest

smarterclayton and others added 5 commits Oct 19, 2019
The kubelet can race when a pod is deleted and report that a container succeeded
when it instead failed, and thus the pod is reported as succeeded. Create an e2e
test that demonstrates this failure.
The kubelet must not allow a container that was reported failed in a
restartPolicy=Never pod to be reported to the apiserver as success.
If a client deletes a restartPolicy=Never pod, the dispatchWork and
status manager race to update the container status. When dispatchWork
(specifically podIsTerminated) returns true, it means all containers
are stopped, which means status in the container is accurate. However,
the TerminatePod method then clears this status. This results in a
pod that has been reported with status.phase=Failed getting reset to
status.phase.Succeeded, which is a violation of the guarantees around
terminal phase.

Ensure the Kubelet never reports that a container succeeded when it
hasn't run or been executed by guarding the terminate pod loop from
ever reporting 0 in the absence of container status.
After a pod reaches a terminal state and all containers are complete
we can delete the pod from the API server. The dispatchWork method
needs to wait for all container status to be available before invoking
delete. Even after the worker stops, status updates will continue to
be delivered and the sync handler will continue to sync the pods, so
dispatchWork gets multiple opportunities to see status.

The previous code assumed that a pod in Failed or Succeeded had no
running containers, but eviction or deletion of running pods could
still have running containers whose status needed to be reported.

This modifies earlier test to guarantee that the "fallback" exit
code 137 is never reported to match the expectation that all pods
exit with valid status for all containers (unless some exceptional
failure like eviction were to occur while the test is running).
When constructing the API status of a pod, if the pod is marked for
deletion no containers should be started. Previously, if a container
inside of a terminating pod failed to start due to a container
runtime error (that populates reasonCache) the reasonCache would
remain populated (it is only updated by syncPod for non-terminating
pods) and the delete action on the pod would be delayed until the
reasonCache entry expired due to other pods.

This dramatically reduces the amount of time the Kubelet waits to
delete pods that are terminating and encountered a container runtime
error.
The status manager syncBatch() method processes the current state
of the cache, which should include all entries in the channel. Flush
the channel before we call a batch to avoid unnecessary work and
to unblock pod workers when the node is congested.

Discovered while investigating long shutdown intervals on the node
where the status channel stayed full for tens of seconds.

Add a for loop around the select statement to avoid unnecessary
invocations of the wait.Forever closure each time.
@smarterclayton smarterclayton force-pushed the smarterclayton:container_success_fix branch from 7810e99 to 8bc5cb0 Mar 4, 2020
@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Mar 4, 2020

Found one bug in this PR (fixed earlier) where the wrong default status was being used on init containers (the for loop in defaulting was moving over containers but tried to set initContainer).

Still on hold while I investigate the panic flake (which is not happening anymore after a rebase, so I'm not sure what could have caused it).

@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Mar 4, 2020

/hold cancel

The related panic stopped occurring (it was consistent until I rebased and then cleared). The flake for the 128 exit code is handled in the test and is no longer flaking.

Would appreciate a final pass (will keep rerunning for other flakes, but has been clean beyond those).

@smarterclayton

This comment has been minimized.

Copy link
Contributor Author

smarterclayton commented Mar 4, 2020

/retest
/test pull-kubernetes-e2e-gcp

@sjenning

This comment has been minimized.

Copy link
Contributor

sjenning commented Mar 4, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Mar 4, 2020
@k8s-ci-robot k8s-ci-robot merged commit 7a513b5 into kubernetes:master Mar 5, 2020
17 checks passed
17 checks passed
cla/linuxfoundation smarterclayton authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-kind-ipv6-parallel Job succeeded.
Details
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kind Job succeeded.
Details
pull-kubernetes-e2e-kind-ipv6 Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.