Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: fix create sandbox delete pod race #98933

Merged

Conversation

rphillips
Copy link
Member

What type of PR is this?
/kind bug
/sig node
/kind flake

What this PR does / why we need it:
This PR addresses a race condition where createPodSandbox can return an error from various external modules (CNI, CRI, CSI) when the pod has been deleted. This is not an error case, so we should not log or record an event. I have created a log message at log level 4 if anyone would like to trace this code path.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 9, 2021
Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/triage accepted
/priority important-soon
/lgtm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 9, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2021
@ehashman ehashman added this to Needs Approver in SIG Node PR Triage Feb 9, 2021
@rphillips rphillips force-pushed the fixes/create_sandbox_delete_pod_race branch from 84eca53 to 27f6f83 Compare February 9, 2021 23:00
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 9, 2021
@rphillips rphillips changed the title kubelet: fix create create sandbox delete pod race kubelet: fix create sandbox delete pod race Feb 9, 2021
@rphillips
Copy link
Member Author

/retest

@rphillips rphillips force-pushed the fixes/create_sandbox_delete_pod_race branch from 27f6f83 to 4940c71 Compare February 10, 2021 20:39
@rphillips
Copy link
Member Author

rebased with current master

Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we've broken the assumptions here, causing this test to fail:

for {
if time.Now().Sub(start).Seconds() > 19 {
break
}
pod, err := f.ClientSet.CoreV1().Pods(ns).Get(context.TODO(), mirrorPodName, metav1.GetOptions{})
framework.ExpectNoError(err)
if pod.Status.Phase != v1.PodRunning {
framework.Failf("expected the mirror pod %q to be running, got %q", mirrorPodName, pod.Status.Phase)
}
// have some pause in between the API server queries to avoid throttling
time.Sleep(time.Duration(200) * time.Millisecond)
}

The mirror pod has been deleted by the time we hit that code block, so I'm not sure why the test is asserting it will still be there.

The other failures look like possible flakes,
/retest


// GetPodStatus and the following SyncPod will not return errors in the
// case where the pod has been deleted. We are not adding any pods into
// the fakePodProvider so they are 'deleted'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@kikisdeliveryservice kikisdeliveryservice moved this from Needs Approver to Needs Reviewer in SIG Node PR Triage Feb 11, 2021
@rphillips rphillips force-pushed the fixes/create_sandbox_delete_pod_race branch from 4940c71 to f989ada Compare February 18, 2021 17:22
@rphillips
Copy link
Member Author

/retest

@rphillips rphillips closed this Feb 18, 2021
SIG Node PR Triage automation moved this from Needs Reviewer to Done Feb 18, 2021
@rphillips rphillips reopened this Feb 18, 2021
@rphillips
Copy link
Member Author

... pushed the wrong button on the retest

@rphillips
Copy link
Member Author

/retest

Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2021
@ehashman ehashman moved this from Done to Needs Approver in SIG Node PR Triage Feb 18, 2021
@mrunalp
Copy link
Contributor

mrunalp commented Feb 18, 2021

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ehashman, mrunalp, rphillips

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 18, 2021
@rphillips
Copy link
Member Author

/retest

2 similar comments
@rphillips
Copy link
Member Author

/retest

@rphillips
Copy link
Member Author

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants