Slow CNI cmdDel processing causes infra container to be deleted prematurely #89440

rajatchopra · 2020-03-24T20:47:04Z

What happened: When a pod is deleted, it calls cmdDel on multus which in turn calls sriov-cni plugin multiple times for all the interfaces. This process takes time, but in the middle of it the infra container gets killed - and network namespace gets deleted.
The result of removing network namespace before cmdDel is done is that proper cleanup does not happen (and in sriov's case, the renaming isn't done and addresses get squandered).

What you expected to happen: Network namespace should continue to exist until CNI plugin is done with the cmdDel command. i.e. do not kill infra container out of band.

How to reproduce it (as minimally and precisely as possible):
Take a CNI plugin (sriov-cni in our case) and modify its DEL command and put a time.Sleep command with a large number (10 seconds?). See that the infra container gets killed before cmdDel is done.

Anything else we need to know?:
This likely belongs to this code block: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1041
where pods whose containers are dead, but infra container is still alive because CNI is working on it, a PodCleanup GC kicks in and kills the infra container too.

Suggestion: A fix where filtering out of pods based on termination status should also have some grace period. Likely here: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L987
The grace period should be at least more than the GC kick-in periodic interval.

Environment: Master and previous versions

rajatchopra · 2020-03-24T20:50:53Z

/sig network

Attention: @dcbw @pmorie (from git blame and general guidance)
cc @blackgold (for investigating the issue)
cc @kubernetes/sig-network-bugs (for help in triaging)

tedyu · 2020-03-24T22:06:28Z

/sig node

uablrek · 2020-03-25T06:49:49Z

Duplicate? #88543

thockin · 2020-04-02T18:13:15Z

It sounds like this is a dup of #88543. Closing in favor of that one.

blackgold · 2020-04-02T20:27:58Z

@thockin
I think the issue#88543 addresses "make sure that all the networking resources are deleted before removing the pod from the apiserver."

This issue addresses "make sure pause container is alive while cni is detaching devices from pause container"

I tested the PR #89667 and I observe that pause container is deleted before cni can delete all the devices. It does not fix this issue.

rajatchopra · 2020-04-07T22:21:03Z

@thockin This is a different issue (sounds similar, I agree).
The other one deals with: containers are deleted, volumes are cleaned up but cni may still be working. The desire is to not to remove the pod sandboxes in runtime until CNI is done.

This one is critical: pod spec's containers are deleted, infra container gets deleted but CNI may still be working. The desire is not to remove the infra container until CNI is done.

This addresses a different problem but will automatically address issue #88543.

@kmala Your opinion will be useful here.

kmala · 2020-04-08T00:24:57Z

The issue i fixed #89667 is different from this and i looked into the PR #89541 and it is not the correct fix for my issue because it introduces a grace period to update the status of pod to status manager and after that grace period pod will be deleted irrespective of the network resources getting removed.
I looked into the issue description and my understanding is that your issue is because of the deletion of network namespace and not deletion of pod sandbox and based on the this comment in docker shim https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockershim/docker_sandbox.go#L252-L257 i assume that CNI should do its best to clean up on an empty network namespace but again i am not sure if that is something generalized to all CNI plugins/CRI plugins.

rajatchopra · 2020-04-08T15:58:52Z

@kmala You are right. The PR #89541 only provides a grace period, it does not guarantee that CNI is finished.
The removal of network namespace does not work well with Infiniband devices, so we need this allowance.

@thockin Can we re-open this issue? Thanks.

BSWANG · 2020-06-29T11:38:25Z

/reopen
#88543 not fix this issue.

k8s-ci-robot · 2020-06-29T11:38:40Z

@BSWANG: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
#88543 not fix this issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mars1024 · 2020-07-23T09:23:38Z

/reopen ping @BSWANG , any details?

BSWANG · 2020-07-23T11:08:58Z

@mars1024

using embedded dockershim cri-runtime
kubectl delete --force --grace-period=0 OR pod ignore SIGTERM singal and ungraceful exit.

chendotjs · 2021-04-12T11:08:44Z

/cc

rajatchopra added the kind/bug Categorizes issue or PR as related to a bug. label Mar 24, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 24, 2020

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 24, 2020

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 24, 2020

evillgenius75 mentioned this issue Mar 25, 2020

v1.18.0 Release Notes: "Known Issues" #86882

Closed

This was referenced Mar 26, 2020

Keep the terminated pod around for a while so that network plugin can… blackgold/kubernetes#1

Closed

Clean terminated pod after grace period so that network plugins can clean devices from pause container #89541

Closed

thockin closed this as completed Apr 2, 2020

blackgold mentioned this issue May 24, 2020

cmdDel fails releasing the device when kubelet deletes pause container k8snetworkplumbingwg/sriov-cni#126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow CNI cmdDel processing causes infra container to be deleted prematurely #89440

Slow CNI cmdDel processing causes infra container to be deleted prematurely #89440

rajatchopra commented Mar 24, 2020

rajatchopra commented Mar 24, 2020

tedyu commented Mar 24, 2020

uablrek commented Mar 25, 2020

thockin commented Apr 2, 2020

blackgold commented Apr 2, 2020

rajatchopra commented Apr 7, 2020

kmala commented Apr 8, 2020

rajatchopra commented Apr 8, 2020

BSWANG commented Jun 29, 2020

k8s-ci-robot commented Jun 29, 2020

mars1024 commented Jul 23, 2020

BSWANG commented Jul 23, 2020

chendotjs commented Apr 12, 2021

Slow CNI cmdDel processing causes infra container to be deleted prematurely #89440

Slow CNI cmdDel processing causes infra container to be deleted prematurely #89440

Comments

rajatchopra commented Mar 24, 2020

rajatchopra commented Mar 24, 2020

tedyu commented Mar 24, 2020

uablrek commented Mar 25, 2020

thockin commented Apr 2, 2020

blackgold commented Apr 2, 2020

rajatchopra commented Apr 7, 2020

kmala commented Apr 8, 2020

rajatchopra commented Apr 8, 2020

BSWANG commented Jun 29, 2020

k8s-ci-robot commented Jun 29, 2020

mars1024 commented Jul 23, 2020

BSWANG commented Jul 23, 2020

chendotjs commented Apr 12, 2021