TEST ONLY - Test a fix for race condition in sync #855

smarterclayton · 2021-07-12T20:24:08Z

No description provided.

A number of race conditions exist when pods are terminated early in their lifecycle because components in the kubelet need to know "no running containers" or "containers can't be started from now on" but were relying on outdated state. Only the pod worker knows whether containers are being started for a given pod, which is required to know when a pod is "terminated" (no running containers, none coming). Move that responsibility and podKiller function into the pod workers, and have everything that was killing the pod go into the UpdatePod loop. Split syncPod into three phases - setup, terminate containers, and cleanup pod - and have transitions between those methods be visible to other components. After this change, to kill a pod you tell the pod worker to UpdatePod({UpdateType: SyncPodKill, Pod: pod}). Several places in the kubelet were incorrect about whether they were handling terminating (should stop running, might have containers) or terminated (no running containers) pods. The pod worker exposes methods that allow other loops to know when to set up or tear down resources based on the state of the pod - these methods remove the possibility of race conditions by ensuring a single component is responsible for knowing each pod's allowed state and other components simply delegate to checking whether they are in the window by UID. Removing containers now no longer blocks final pod deletion in the API server and are handled as background cleanup. Node shutdown no longer marks pods as failed as they can be restarted in the next step. See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details

openshift-ci-robot · 2021-07-12T20:24:16Z

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits are valid:

bb28add|UPSTREAM: 102344: Keep pod worker running until pod is truly complete: the upstream PR kubernetes/kubernetes#102344 has merged

The following commits could not be validated and must be approved by a top-level approver:

21a37b5|DO NOT MERGE: Test fix for race: does not specify an upstream backport in the commit message
c0cbfad|set verbosity to 4: does not specify an upstream backport in the commit message

openshift-ci · 2021-07-12T20:24:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: smarterclayton
To complete the pull request process, please assign marun after the PR has been reviewed.
You can assign the PR to them by writing /assign @marun in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

DOWNSTREAM_OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2021-07-13T15:17:09Z

/retest

openshift-ci · 2021-07-13T17:50:19Z

@smarterclayton: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Rerun command
ci/prow/verify-commits	`21a37b5`	link	`/test verify-commits`
ci/prow/unit	`21a37b5`	link	`/test unit`
ci/prow/verify	`21a37b5`	link	`/test verify`
ci/prow/e2e-gcp	`21a37b5`	link	`/test e2e-gcp`
ci/prow/e2e-aws-fips	`21a37b5`	link	`/test e2e-aws-fips`
ci/prow/e2e-gcp-upgrade	`21a37b5`	link	`/test e2e-gcp-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ehashman · 2021-07-13T18:27:38Z

Not seeing the panic in the AWS FIPS job anymore 👀

smarterclayton and others added 3 commits July 9, 2021 11:54

set verbosity to 4

c0cbfad

DO NOT MERGE: Test fix for race

21a37b5

openshift-ci-robot added the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Jul 12, 2021

smarterclayton mentioned this pull request Jul 12, 2021

Bug 1952224: UPSTREAM: 102344: Keep pod worker running until pod is truly complete #851

Closed

openshift-ci bot added the vendor-update Touching vendor dir or related files label Jul 12, 2021

openshift-ci bot requested review from marun and rphillips July 12, 2021 20:24

smarterclayton mentioned this pull request Jul 13, 2021

Kubelet panic in syncTerminatedPod kubernetes/kubernetes#103625

Closed

smarterclayton closed this Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TEST ONLY - Test a fix for race condition in sync #855

TEST ONLY - Test a fix for race condition in sync #855

smarterclayton commented Jul 12, 2021

openshift-ci-robot commented Jul 12, 2021

openshift-ci bot commented Jul 12, 2021

smarterclayton commented Jul 13, 2021

openshift-ci bot commented Jul 13, 2021

ehashman commented Jul 13, 2021

TEST ONLY - Test a fix for race condition in sync #855

TEST ONLY - Test a fix for race condition in sync #855

Conversation

smarterclayton commented Jul 12, 2021

openshift-ci-robot commented Jul 12, 2021

openshift-ci bot commented Jul 12, 2021

smarterclayton commented Jul 13, 2021

openshift-ci bot commented Jul 13, 2021

ehashman commented Jul 13, 2021