Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pod worker deadlock. #65987

Merged
merged 1 commit into from Jul 9, 2018

Conversation

@Random-Liu
Copy link
Member

Random-Liu commented Jul 9, 2018

Preemption will stuck forever if killPodNow timeout once. The sequence is:

  • killPodNow create the response channel (size 0) and give it to pod worker.
  • killPodNow timeout and return.
  • Pod worker finishes killing the pod, and tries to send back response via the channel.

However, because the channel size is 0, and the receiver has exited, the pod worker will stuck forever.

In @jingxu97's case, this causes a critical system pod (apiserver) unable to come up, because the csi pod can't be preempted.

I checked the history, and the bug was introduced 2 years ago 6fefb42.

I think we should at least cherrypick this to 1.11 since preemption is beta and enabled by default in 1.11.

@kubernetes/sig-node-bugs @derekwaynecarr @dashpole @yujuhong
Signed-off-by: Lantao Liu lantaol@google.com

Fix a bug that preempting a pod may block forever.
Fix pod worker deadlock.
Signed-off-by: Lantao Liu <lantaol@google.com>
@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jul 9, 2018

[MILESTONENOTIFIER] Milestone Pull Request Needs Approval

@Random-Liu @derekwaynecarr @yujuhong @kubernetes/sig-node-misc

Action required: This pull request must have the status/approved-for-milestone label applied by a SIG maintainer.

Pull Request Labels
  • sig/node: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help
@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Jul 9, 2018

Kubelet preemption is not related to the priority and preemption beta feature. It is a feature that was introduced with the critical pod annotation prior to priority and preemption. I would suggest cherrypicking this back to older branches since it is quite small.

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Jul 9, 2018

Looks good. Ideally, we should also have unit tests for this, but that'd be more change. Given that the bug is trivial and the fix needs to be cherry-picked to older branches (1.11, 1.10, and 1.9), this is fine.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jul 9, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jul 9, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Random-Liu, yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Jul 9, 2018

/retest

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jul 9, 2018

@Random-Liu: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-local-e2e-containerized 0f4c739 link /test pull-kubernetes-local-e2e-containerized

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@klausenbusk

This comment has been minimized.

Copy link
Contributor

klausenbusk commented Jul 9, 2018

I think I'm affected by this issue (just upgraded 1.10.0->1.11.0). I have two pod stuck in Terminating and Unknown (I deleted the node and rebooted it a few times)

kube-apiserver-b24c2                        0/1       Terminating         4          1h
kube-proxy-l28rx                            0/1       Unknown             1          3h

kubectl -n kube-system delete --force --grace-period=0 pod kube-apiserver-b24c2 just hangs forever, is there anyway I can recover from this?

@klausenbusk

This comment has been minimized.

Copy link
Contributor

klausenbusk commented Jul 9, 2018

kubectl -n kube-system delete --force --grace-period=0 pod kube-apiserver-b24c2 just hangs forever, is there anyway I can recover from this?

I managed to solve it by following this comment: #65936 (comment)
I'm wondering if #65569 and #65936 are affected by this issue or if they are different issue?

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jul 9, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jul 9, 2018

Automatic merge from submit-queue (batch tested with PRs 65987, 65962). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 55620e2 into kubernetes:master Jul 9, 2018

11 of 17 checks passed

pull-kubernetes-local-e2e-containerized Job failed.
Details
Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce
Details
pull-kubernetes-e2e-gce Job triggered.
Details
pull-kubernetes-e2e-gce-100-performance Job triggered.
Details
pull-kubernetes-e2e-kops-aws Job triggered.
Details
pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
cla/linuxfoundation Random-Liu authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

@Random-Liu Random-Liu deleted the Random-Liu:fix-pod-worker-deadlock branch Jul 9, 2018

@ALBERTSINZZ1

This comment has been minimized.

Copy link

ALBERTSINZZ1 commented on pkg/kubelet/pod_workers.go in 0f4c739 Jul 10, 2018

Change out try during a mirrored approach

@Random-Liu Random-Liu removed this from the v1.11 milestone Jul 10, 2018

k8s-github-robot pushed a commit that referenced this pull request Jul 11, 2018

Kubernetes Submit Queue
Merge pull request #66004 from Random-Liu/automated-cherry-pick-of-#6…
…5987-upstream-release-1.11

Automatic merge from submit-queue.

Automated cherry pick of #65987: Fix pod worker deadlock.

Cherry pick of #65987 on release-1.11.

#65987: Fix pod worker deadlock.

k8s-github-robot pushed a commit that referenced this pull request Jul 11, 2018

Kubernetes Submit Queue
Merge pull request #66006 from Random-Liu/automated-cherry-pick-of-#6…
…5987-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #65987: Fix pod worker deadlock.

Cherry pick of #65987 on release-1.9.

#65987: Fix pod worker deadlock.

k8s-github-robot pushed a commit that referenced this pull request Jul 12, 2018

Kubernetes Submit Queue
Merge pull request #66005 from Random-Liu/automated-cherry-pick-of-#6…
…5987-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #65987: Fix pod worker deadlock.

Cherry pick of #65987 on release-1.10.

#65987: Fix pod worker deadlock.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.