Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kubelet deadlock #10182

Merged
merged 4 commits into from Jun 23, 2015
Merged

Fix kubelet deadlock #10182

merged 4 commits into from Jun 23, 2015

Conversation

bprashanth
Copy link
Contributor

  1. User deletes all multi container pods on a node at once, after the containers have started
  2. All pods are synced for each delete notification
  3. Each pod sends a status update because of the bug in my first commit
  4. Status manager's podStatusChannel fills up
  5. An update fails becuase a pod in the channel has already been deleted from the apiserver
  6. Status manager deadlock
  7. SyncPods enters the deadlock next time it tries to talk to the status manager

@dchen1107

@k8s-bot
Copy link

k8s-bot commented Jun 22, 2015

GCE e2e build/test failed for commit 9e1ab43c2fb279a7f9cc28885d69ed03edbdf611.

@dchen1107
Copy link
Member

Ahh, this is the deadlock issue I suspected and planed to look into, but distracted by all those p0 issues. Thanks, but I don't have time to review it today. Assigned it to @yujuhong. YuJu, here is a little bit background:
we found sometimes when currently deleting multiple pods, and each pods with multiple containers, kubelet might run into a deadlock. The issue was originally reported by #9833. Currently we have a proposed workaround #10004 to detect such deadlock, and restart kubelet. But that workaround should be the last resort. Thanks!

cc/ @saad-ali and @ArtfulCoder too.

@k8s-bot
Copy link

k8s-bot commented Jun 22, 2015

GCE e2e build/test passed for commit 5f61392.

@yujuhong
Copy link
Contributor

LGTM overall with a nit/question.

@bprashanth
Copy link
Contributor Author

Added equality check in status manager and a unittests. PTAL.

@k8s-bot
Copy link

k8s-bot commented Jun 22, 2015

GCE e2e build/test failed for commit 9680a1293207b4f42c1e2429473ddf00c85953f3.

@yujuhong
Copy link
Contributor

LGTM overall. e2e failed though.

Also, @dchen1107 for the second LGTM.

@bprashanth
Copy link
Contributor Author

Rerunning failed test on local cluster, pushed anew to get a normal run as well since it passed the previous time so I'm suspecting it's unrelated

@yujuhong
Copy link
Contributor

It's probably not related, as I just saw a similar failure on another PR.

@k8s-bot
Copy link

k8s-bot commented Jun 22, 2015

GCE e2e build/test passed for commit af175cc.

@bprashanth
Copy link
Contributor Author

Risk assessment: medium-low. Code sorts a list of containers and shoves a Delete into a goroutine.
Value: the kubelet doesn't deadlock or flood the apiserver with updates when it has multi-container pods.

@dchen1107 PTAL, I ran the previous run's failing container probe e2e locally and got 4/4 pass

@dchen1107
Copy link
Member

LGTM. @davidopp Can we have your LGTM on this one? Thanks!

@dchen1107 dchen1107 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2015
@j3ffml j3ffml assigned davidopp and unassigned yujuhong Jun 23, 2015
@davidopp
Copy link
Member

LGTM

j3ffml added a commit that referenced this pull request Jun 23, 2015
@j3ffml j3ffml merged commit d212ea1 into kubernetes:master Jun 23, 2015
@bprashanth bprashanth deleted the kubelet_status branch October 26, 2015 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants