Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST ONLY - Test a fix for race condition in sync #855

Closed
wants to merge 3 commits into from

Conversation

smarterclayton
Copy link

No description provided.

smarterclayton and others added 3 commits July 9, 2021 11:54
A number of race conditions exist when pods are terminated early in
their lifecycle because components in the kubelet need to know "no
running containers" or "containers can't be started from now on" but
were relying on outdated state.

Only the pod worker knows whether containers are being started for
a given pod, which is required to know when a pod is "terminated"
(no running containers, none coming). Move that responsibility and
podKiller function into the pod workers, and have everything that
was killing the pod go into the UpdatePod loop. Split syncPod into
three phases - setup, terminate containers, and cleanup pod - and
have transitions between those methods be visible to other
components. After this change, to kill a pod you tell the pod worker
to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).

Several places in the kubelet were incorrect about whether they
were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker
exposes methods that allow other loops to know when to set up or tear
down resources based on the state of the pod - these methods remove
the possibility of race conditions by ensuring a single component is
responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.

Removing containers now no longer blocks final pod deletion in the
API server and are handled as background cleanup. Node shutdown
no longer marks pods as failed as they can be restarted in the
next step.

See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details
@openshift-ci-robot openshift-ci-robot added the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Jul 12, 2021
@openshift-ci-robot
Copy link

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits are valid:

The following commits could not be validated and must be approved by a top-level approver:

@openshift-ci
Copy link

openshift-ci bot commented Jul 12, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: smarterclayton
To complete the pull request process, please assign marun after the PR has been reviewed.
You can assign the PR to them by writing /assign @marun in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the vendor-update Touching vendor dir or related files label Jul 12, 2021
@openshift-ci openshift-ci bot requested review from marun and rphillips July 12, 2021 20:24
@smarterclayton
Copy link
Author

/retest

@openshift-ci
Copy link

openshift-ci bot commented Jul 13, 2021

@smarterclayton: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
ci/prow/verify-commits 21a37b5 link /test verify-commits
ci/prow/unit 21a37b5 link /test unit
ci/prow/verify 21a37b5 link /test verify
ci/prow/e2e-gcp 21a37b5 link /test e2e-gcp
ci/prow/e2e-aws-fips 21a37b5 link /test e2e-aws-fips
ci/prow/e2e-gcp-upgrade 21a37b5 link /test e2e-gcp-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ehashman
Copy link

Not seeing the panic in the AWS FIPS job anymore 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants