Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow CNI cmdDel processing causes infra container to be deleted prematurely #89440

Closed
rajatchopra opened this issue Mar 24, 2020 · 13 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@rajatchopra
Copy link
Contributor

What happened: When a pod is deleted, it calls cmdDel on multus which in turn calls sriov-cni plugin multiple times for all the interfaces. This process takes time, but in the middle of it the infra container gets killed - and network namespace gets deleted.
The result of removing network namespace before cmdDel is done is that proper cleanup does not happen (and in sriov's case, the renaming isn't done and addresses get squandered).

What you expected to happen: Network namespace should continue to exist until CNI plugin is done with the cmdDel command. i.e. do not kill infra container out of band.

How to reproduce it (as minimally and precisely as possible):
Take a CNI plugin (sriov-cni in our case) and modify its DEL command and put a time.Sleep command with a large number (10 seconds?). See that the infra container gets killed before cmdDel is done.

Anything else we need to know?:
This likely belongs to this code block: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1041
where pods whose containers are dead, but infra container is still alive because CNI is working on it, a PodCleanup GC kicks in and kills the infra container too.

Suggestion: A fix where filtering out of pods based on termination status should also have some grace period. Likely here: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L987
The grace period should be at least more than the GC kick-in periodic interval.

Environment: Master and previous versions

@rajatchopra rajatchopra added the kind/bug Categorizes issue or PR as related to a bug. label Mar 24, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 24, 2020
@rajatchopra
Copy link
Contributor Author

/sig network

Attention: @dcbw @pmorie (from git blame and general guidance)
cc @blackgold (for investigating the issue)
cc @kubernetes/sig-network-bugs (for help in triaging)

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 24, 2020
@tedyu
Copy link
Contributor

tedyu commented Mar 24, 2020

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 24, 2020
@uablrek
Copy link
Contributor

uablrek commented Mar 25, 2020

Duplicate? #88543

@thockin
Copy link
Member

thockin commented Apr 2, 2020

It sounds like this is a dup of #88543. Closing in favor of that one.

@thockin thockin closed this as completed Apr 2, 2020
@blackgold
Copy link

@thockin
I think the issue#88543 addresses "make sure that all the networking resources are deleted before removing the pod from the apiserver."

This issue addresses "make sure pause container is alive while cni is detaching devices from pause container"

I tested the PR #89667 and I observe that pause container is deleted before cni can delete all the devices. It does not fix this issue.

@rajatchopra
Copy link
Contributor Author

@thockin This is a different issue (sounds similar, I agree).
The other one deals with: containers are deleted, volumes are cleaned up but cni may still be working. The desire is to not to remove the pod sandboxes in runtime until CNI is done.

This one is critical: pod spec's containers are deleted, infra container gets deleted but CNI may still be working. The desire is not to remove the infra container until CNI is done.

This addresses a different problem but will automatically address issue #88543.

@kmala Your opinion will be useful here.

@kmala
Copy link
Member

kmala commented Apr 8, 2020

The issue i fixed #89667 is different from this and i looked into the PR #89541 and it is not the correct fix for my issue because it introduces a grace period to update the status of pod to status manager and after that grace period pod will be deleted irrespective of the network resources getting removed.
I looked into the issue description and my understanding is that your issue is because of the deletion of network namespace and not deletion of pod sandbox and based on the this comment in docker shim https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockershim/docker_sandbox.go#L252-L257 i assume that CNI should do its best to clean up on an empty network namespace but again i am not sure if that is something generalized to all CNI plugins/CRI plugins.

@rajatchopra
Copy link
Contributor Author

@kmala You are right. The PR #89541 only provides a grace period, it does not guarantee that CNI is finished.
The removal of network namespace does not work well with Infiniband devices, so we need this allowance.

@thockin Can we re-open this issue? Thanks.

@BSWANG
Copy link
Contributor

BSWANG commented Jun 29, 2020

/reopen
#88543 not fix this issue.

@k8s-ci-robot
Copy link
Contributor

@BSWANG: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
#88543 not fix this issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mars1024
Copy link
Member

/reopen ping @BSWANG , any details?

@BSWANG
Copy link
Contributor

BSWANG commented Jul 23, 2020

@mars1024

  • using embedded dockershim cri-runtime
  • kubectl delete --force --grace-period=0 OR pod ignore SIGTERM singal and ungraceful exit.

@chendotjs
Copy link
Contributor

/cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
10 participants