dockertools: call TearDownPod when GC-ing infra pods #37036

dcbw · 2016-11-17T22:15:43Z

The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
infra pods are now always GC-ed rather than gating them by
containersToKeep.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: #14940
Related: #35572

This change is

dcbw · 2016-11-17T22:16:47Z

@kubernetes/sig-network @bprashanth @thockin @caseydavenport @knobunc @danwinship

bprashanth · 2016-11-17T22:24:42Z

I'll need to page this in @kubernetes/sig-node might be able to review this right away.
Fyi we fixed the ipam leak for kubenet with a hack for the point release: #34278 (comment). I think we discussed on that issue that teardown is called, but we can't find the ip of a dead container to look up its netns.

dcbw · 2016-11-17T23:41:23Z

I think we discussed on that issue that teardown is called, but we can't find the ip of a dead container to look up its netns.

@bprashanth in the case this PR fixes, teardown is not called, because this is container GC and the docker runtime doesn't do any network anything on GC containers...

dcbw · 2016-11-18T15:27:47Z

@k8s-bot verify test this issue #36843

dcbw · 2016-11-18T17:07:40Z

Jenkins verify is failing on update-bazel.sh that doesn't appear to have anything to do with this PR...

mikedanese · 2016-11-18T17:15:24Z

@dcbw can you please run hack/update-hazel.sh?

dcbw · 2016-11-18T17:20:25Z

@mikedanese

So I git fetch upstream master, rebase my chnages on top, and run it:

package github.com/mikedanese/gazel: cannot download, $GOPATH not set. For more details see: go help gopath

So I add "kube::golang::setup_env" to hack/update-bazel.sh and get a ton of:

extract err: cannot find package "k8s.io/gengo/types" in any of:
    /usr/lib/golang/src/k8s.io/gengo/types (from $GOROOT)
    /home/dcbw/Development/containers/kubernetes/_output/local/go/src/k8s.io/gengo/types (from $GOPATH)

whereupon I have tons of changes that are completely unrelated to anything in this PR. Shouldn't somebody else run update-bazel.sh and submit that as a PR, and then I rebase on top?

diff --git a/cmd/genman/BUILD b/cmd/genman/BUILD
index 0ff8bde..d149f5c 100644
--- a/cmd/genman/BUILD
+++ b/cmd/genman/BUILD
@@ -12,7 +12,10 @@ load(

 go_binary(
     name = "genman",
-    srcs = ["gen_kube_man.go"],
+    srcs = [
+        "gen_kube_man.go",
+        "gen_kubectl_man.go",
+    ],
     tags = ["automanaged"],
     deps = [
         "//cmd/genutils:go_default_library",
@@ -23,8 +26,5 @@ go_binary(
         "//pkg/kubectl/cmd:go_default_library",
         "//pkg/kubectl/cmd/util:go_default_library",
         "//plugin/cmd/kube-scheduler/app:go_default_library",
-        "//vendor:github.com/cpuguy83/go-md2man/md2man",
-        "//vendor:github.com/spf13/cobra",
-        "//vendor:github.com/spf13/pflag",
     ],
 )
diff --git a/cmd/genswaggertypedocs/BUILD b/cmd/genswaggertypedocs/BUILD
index 21347db..c266af9 100644
--- a/cmd/genswaggertypedocs/BUILD
+++ b/cmd/genswaggertypedocs/BUILD
@@ -14,9 +14,5 @@ go_binary(
     name = "genswaggertypedocs",
     srcs = ["swagger_type_docs.go"],
     tags = ["automanaged"],
-    deps = [
-        "//pkg/runtime:go_default_library",
-        "//vendor:github.com/golang/glog",
-        "//vendor:github.com/spf13/pflag",
-    ],
+    deps = ["//pkg/runtime:go_default_library"],
 )
<etc>

mikedanese · 2016-11-18T17:36:22Z

@dcbw hmm, it looks like you have a somewhat non standard gopath setup. gazel expects kubernetes source to be checked out in $GOPATH/src/k8s.io/kubernetes. In #36843 we are adding a hard check for this. See:

https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md#clone-your-fork

In the meantime you can apply this diff to fix ci:

diff --git a/pkg/kubelet/BUILD b/pkg/kubelet/BUILD
index eaade63..73d6ebe 100644
--- a/pkg/kubelet/BUILD
+++ b/pkg/kubelet/BUILD
@@ -65,6 +65,7 @@ go_library(
         "//pkg/kubelet/eviction:go_default_library",
         "//pkg/kubelet/images:go_default_library",
         "//pkg/kubelet/kuberuntime:go_default_library",
+        "//pkg/kubelet/leaky:go_default_library",
         "//pkg/kubelet/lifecycle:go_default_library",
         "//pkg/kubelet/metrics:go_default_library",

dcbw · 2016-11-18T17:42:35Z

@dcbw hmm, it looks like you have a somewhat non standard gopath setup. gazel expects kubernetes source to be checked out in $GOPATH/src/k8s.io/kubernetes. In #36843 we are adding a hard check for this. See:

Ah, thanks! So yeah, it was something in the patch. Is the BUILD stuff new? I guess I've either been lucky before that all imports have already been in BUILD.

RE the gopath, I don't have a particularly non-standard setup for Linux developers that I know of. Literally 'git clone kubernetes', cd kubernetes,then make. That has always worked, and appears supported since the purpose of "kube::golang::setup_env" and hack/lib/golang.sh is to set up the Go environment without having GOPATH set. Is that not the case?

dcbw · 2016-11-18T19:30:06Z

@k8s-bot cvm gke e2e test this issue #35913

yujuhong · 2016-11-18T21:06:28Z

@dcbw from reading #14940 (comment), since kubelet cannot recover the netns information when docker container terminated, this fix doesn't always work? Another question is that will the cni plugin returns an error if 1) netns is empty, 2) if teardown is called multiple times?

dcbw · 2016-11-18T21:40:51Z

@dcbw from reading #14940 (comment), since kubelet cannot recover the netns information when docker container terminated, this fix doesn't always work? Another question is that will the cni plugin returns an error if 1) netns is empty, 2) if teardown is called multiple times?

@yujuhong At least for Docker, if the container is "dead" or "exited" the netns doesn't exist anymore, so any interfaces it had will have already been torn down. But IPAM may not have been torn down, and that still needs to happen which this PR allows the CNI plugin to do.

For rkt + --network-plugin=cni the netns is managed by kubelet itself, so it will still be around even if the container has exited, so we don't really have this problem there, and that's one reason the rkt runtime already calls TearDownPod() during GC.

CNI does not require a CNI_NETNS for CNI_DEL actions, and its up to the plugin to do best-effort cleanup when it gets the CNI_DEL call. That should probably include any IPAM to ensure that even if the netns and container are gone, the IPAM gets released.

In any case, this patch doesn't regress the DEL behavior from what it is now. If the CNI plugin requires a netns, it'll still return an error just like it does today. And teardown is already called multiple times due to kubelet races :) So that's not a change in behavior either.

yujuhong · 2016-11-18T23:02:38Z

pkg/kubelet/pod_container_deletor.go

 		}
 	}

 	if len(candidates) <= containersToKeep {
-		return containerStatusbyCreatedList{}
+		// Always destroy infra containers to ensure networking gets torn down


kubelet still relies on one dead container as a checkpoint. I don't think we should change that.

Overall, I'd prefer not making change to this file (pod_container_deletor.go). The podContainerDeletor simply does a more aggressive cleanup, and the regular container GC will kick in later.

It's also a plus to not pollute more files with leaky.PodInfraContainerName

Ok, then we need somewhere else that actually inspects the docker dead docker containers and compares against the running ones, which the GC code already does.

@yujuhong would you be OK with changing the code to tear down networking for all the dead infra containers, but not remove them completely from docker?

There are actually two different moments which may trigger network teardown:

(1) when infra container is down
(2) when infra container is removed

I think codes in removeContainer() handles (2) and here is a fix for (1).

To reduce IPAM leaking asap, I suggest we only trigger network tear down at infra container stopped, e.g. perform infra container network teardown after getting evictableContainers (and before removing any containers).

@dcbw I'd like to try making the change small and localized since it's so close to the release (unless this is not targeted for 1.5).

Change in this file is very docker-specific even though it's used by all implementations (rkt, cri, etc). On top of that, kubenet has its own fix that handles the case where the infra containers get removed underneath kubelet. By leaving everything to the container GC, you have already fixed the leaks in most situation (e.g., docker restart). Would that be sufficient?

yujuhong · 2016-11-18T23:05:26Z

pkg/kubelet/dockertools/container_gc.go

-	glog.V(4).Infof("Removing container %q name %q", id, containerName)
-	err := cgc.client.RemoveContainer(id, dockertypes.ContainerRemoveOptions{RemoveVolumes: true})
+func (cgc *containerGC) removeContainer(containerInfo containerGCInfo) {
+	if !containerInfo.isHostNetwork && containerInfo.containerName == leaky.PodInfraContainerName {


s/leaky.PodInfraContainerName/PodInfraContainerName

It's already defined in dockertools/docker.go

dcbw · 2016-11-22T20:13:59Z

On top of that, kubenet has its own fix that handles the case where the infra containers get removed underneath kubelet.

@yujuhong yeah, kubenet has that workaround, but it was a workaround for this specific problem :) And it's only for kubenet, which doesn't help CNI plugins at all. For OpenShift we just ported your kubenet workaround to our CNI plugin, but I thought I'd try to fix it for real too.

chenchun · 2016-11-24T03:31:13Z

@dcbw This PR is useful, but I think it can't resolve all the situations. Here is an exception:

Suppose we are using host-local ipam in cni plugin

kubelet starts cni plugin to setup network
cni plugin process cni-1 starts and it supposes to setup network
kubectl deletes the pod
kubelet starts cni plugin to teardown network
cni plugin process cni-2 starts and it supposes to teardown network
cni plugin process cni-2 exits
kubelet deletes infrastructure container
cni-1 touches ip file and exits

This is because kubelet didn't hold a lock before starting cni plugin and guarantee that cni-1 must exit before launching cni-2. I'm not sure if kubelet needs to hold a lock. IMO, but this may not be the responsibility of kubelet to hold a lock or to delete these leaky ip files. Anyone who uses cni plugin with host-local ipam will have to launch a demon process to cleanup these leaky ip files.

dcbw · 2016-12-07T16:07:01Z

@yujuhong @chenchun @bprashanth I believe I'd addressed all your comments. First, everything is now specific to the docker runtime, no more kubelet general changes. Second, we now serialize all network plugin operations in the docker runtime to specifically guard against concurrent operations from kubelet's GC goroutine and kubelet's SyncPod() goroutine. This was always a problem before too and we've run into instances of concurrent teardown in the past.

@alexbrand I've made some changes to the Windows parts of dockertools too. Could you verify them? From the comments I understand that every container in the pod gets its own IP address on Windows, but that the Infra container (while started) is basically ignored. So I attempted to ensure that all containers on Windows will get their networking cleaned up when GC happens, whiel on Linux only the Infra container needs this.

freehan · 2017-02-16T19:29:43Z

Added a priority since rebased more than 2 times

The PluginManager almost duplicates the network plugin interface, but not quite since the Init() function should be called by whatever actually finds and creates the network plugin instance. Only then does it get passed off to the PluginManager. The Manager synchronizes pod-specific network operations like setup, teardown, and pod network status. It passes through all other operations so that runtimes don't have to cache the network plugin directly, but can use the PluginManager as a wrapper.

…ations

…rations We need to tear down networking when garbage collecting containers too, and GC is run from a different goroutine in kubelet. We don't want container network operations running for the same pod concurrently.

The docker runtime doesn't tear down networking when GC-ing pods. rkt already does so make docker do it too. To ensure this happens, networking is always torn down for the container even if the container itself is not deleted. This prevents IPAM from leaking when the pod gets killed for some reason outside kubelet (like docker restart) or when pods are killed while kubelet isn't running. Fixes: kubernetes#14940 Related: kubernetes#35572

Dead infra containers may still have network resources allocated to them and may not be GC-ed for a long time. But allowing SyncPod() to restart an infra container before the old one is destroyed prevents network plugins from carrying the old network details (eg IPAM) over to the new infra container.

dcbw · 2017-02-16T19:52:53Z

@freehan rebased, can you re-LGTM? Thanks!

freehan · 2017-02-16T21:46:39Z

@k8s-bot cross build this
@k8s-bot cvm gce e2e test this
@k8s-bot gce etcd3 e2e test this

yujuhong · 2017-02-16T21:55:02Z

/approve

k8s-github-robot · 2017-02-16T21:55:08Z

[APPROVALNOTIFIER] This PR is APPROVED

The following people have approved this PR: dcbw, yujuhong

Needs approval from an approver in each of these OWNERS Files:

~~pkg/kubelet/OWNERS~~ [yujuhong]
~~pkg/kubelet/dockertools/OWNERS~~ [yujuhong]
~~pkg/kubelet/rkt/OWNERS~~ [yujuhong]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-02-17T01:05:11Z

Automatic merge from submit-queue (batch tested with PRs 40505, 34664, 37036, 40726, 41595)

saiwl · 2017-04-12T10:13:52Z

When will this feature be released? We really need it.

yujuhong · 2017-04-12T19:02:12Z

@saiwl this is already included in v1.6.

saiwl · 2017-04-13T06:55:16Z

Is this feature turned on by default? I restarted the docker daemon, but the ip wasn't released. @yujuhong

yujuhong · 2017-04-13T17:40:50Z

Is this feature turned on by default? I restarted the docker daemon, but the ip wasn't released. @yujuhong

This is turned on by default (or simply put it, there is no way to disable this).

If you want to report a bug, please open a new issue with more information (e.g., environment, network setup, steps to reproduce, etc), thanks.

The runtime calling CRI-O may not synchronize pod network operations, but to ensure that they don't overlap, ocicni must do so. Operations are synchronized on a per-pod basis, such that all operations for a given pod are synchronized, but individual pods' operations can run in parallel. This is essentially a port of the Kubernetes PluginManager code from kubernetes/kubernetes#37036. Signed-off-by: Dan Williams <dcbw@redhat.com>

googlebot added the cla: yes label Nov 17, 2016

k8s-github-robot assigned dchen1107 Nov 17, 2016

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Nov 17, 2016

dcbw force-pushed the docker-gc-teardown-pods branch from 34db29e to c7383fa Compare November 17, 2016 23:57

dcbw mentioned this pull request Nov 18, 2016

sdn: garbage-collect dead containers to recover IPAM leases openshift/origin#11964

Merged

dcbw force-pushed the docker-gc-teardown-pods branch from c7383fa to 6d0b221 Compare November 18, 2016 17:44

dcbw mentioned this pull request Nov 18, 2016

sdn: remove IPAM garbage collection hack openshift/origin#11970

Closed

yujuhong reviewed Nov 18, 2016

View reviewed changes

yujuhong self-assigned this Nov 18, 2016

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 24, 2016

dcbw force-pushed the docker-gc-teardown-pods branch from 6d0b221 to 5c49d14 Compare December 7, 2016 16:01

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2016

dcbw force-pushed the docker-gc-teardown-pods branch from 5c49d14 to 17cd07a Compare December 7, 2016 16:58

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 7, 2016

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 16, 2017

freehan added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 16, 2017

dcbw added 7 commits February 16, 2017 13:48

kubelet/network: move mock network plugin to pkg/kubelet/network/testing

6052580

dockershim: use network PluginManager to synchronize pod network oper…

aafd5c9

…ations

rkt: use network PluginManager to synchronize pod network operations

4c3cc67

dcbw force-pushed the docker-gc-teardown-pods branch from 64dc4d2 to 20e1cdb Compare February 16, 2017 19:52

k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 16, 2017

freehan added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 16, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 16, 2017

k8s-github-robot merged commit 98d1cff into kubernetes:master Feb 17, 2017

zhan849 mentioned this pull request Aug 30, 2017

Issue #96 restart kubelet if kubelet is not healthy argoproj/argo-workflows#99

Merged

dcbw mentioned this pull request Sep 6, 2017

Sync pod operations cri-o/ocicni#5

Merged

dcbw mentioned this pull request Sep 14, 2017

CNI getting multiple DEL calls for dead containers #44100

Closed

dcbw mentioned this pull request Aug 16, 2018

Clean up bad OVS ports openshift/origin#20670

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dockertools: call TearDownPod when GC-ing infra pods #37036

dockertools: call TearDownPod when GC-ing infra pods #37036

dcbw commented Nov 17, 2016 •

edited by k8s-oncall

Loading

dcbw commented Nov 17, 2016 •

edited

Loading

bprashanth commented Nov 17, 2016

dcbw commented Nov 17, 2016

dcbw commented Nov 18, 2016

dcbw commented Nov 18, 2016

mikedanese commented Nov 18, 2016

dcbw commented Nov 18, 2016 •

edited

Loading

mikedanese commented Nov 18, 2016

dcbw commented Nov 18, 2016

dcbw commented Nov 18, 2016

yujuhong commented Nov 18, 2016

dcbw commented Nov 18, 2016 •

edited

Loading

yujuhong Nov 18, 2016

dcbw Nov 20, 2016

feiskyer Nov 21, 2016

feiskyer Nov 21, 2016

yujuhong Nov 21, 2016

yujuhong Nov 18, 2016

dcbw commented Nov 22, 2016

chenchun commented Nov 24, 2016 •

edited

Loading

dcbw commented Dec 7, 2016

freehan commented Feb 16, 2017

dcbw commented Feb 16, 2017 •

edited

Loading

freehan commented Feb 16, 2017

yujuhong commented Feb 16, 2017

k8s-github-robot commented Feb 16, 2017

k8s-github-robot commented Feb 17, 2017

saiwl commented Apr 12, 2017

yujuhong commented Apr 12, 2017

saiwl commented Apr 13, 2017

yujuhong commented Apr 13, 2017

dockertools: call TearDownPod when GC-ing infra pods #37036

dockertools: call TearDownPod when GC-ing infra pods #37036

Conversation

dcbw commented Nov 17, 2016 • edited by k8s-oncall Loading

dcbw commented Nov 17, 2016 • edited Loading

bprashanth commented Nov 17, 2016

dcbw commented Nov 17, 2016

dcbw commented Nov 18, 2016

dcbw commented Nov 18, 2016

mikedanese commented Nov 18, 2016

dcbw commented Nov 18, 2016 • edited Loading

mikedanese commented Nov 18, 2016

dcbw commented Nov 18, 2016

dcbw commented Nov 18, 2016

yujuhong commented Nov 18, 2016

dcbw commented Nov 18, 2016 • edited Loading

yujuhong Nov 18, 2016

Choose a reason for hiding this comment

dcbw Nov 20, 2016

Choose a reason for hiding this comment

feiskyer Nov 21, 2016

Choose a reason for hiding this comment

feiskyer Nov 21, 2016

Choose a reason for hiding this comment

yujuhong Nov 21, 2016

Choose a reason for hiding this comment

yujuhong Nov 18, 2016

Choose a reason for hiding this comment

dcbw commented Nov 22, 2016

chenchun commented Nov 24, 2016 • edited Loading

dcbw commented Dec 7, 2016

freehan commented Feb 16, 2017

dcbw commented Feb 16, 2017 • edited Loading

freehan commented Feb 16, 2017

yujuhong commented Feb 16, 2017

k8s-github-robot commented Feb 16, 2017

k8s-github-robot commented Feb 17, 2017

saiwl commented Apr 12, 2017

yujuhong commented Apr 12, 2017

saiwl commented Apr 13, 2017

yujuhong commented Apr 13, 2017

dcbw commented Nov 17, 2016 •

edited by k8s-oncall

Loading

dcbw commented Nov 17, 2016 •

edited

Loading

dcbw commented Nov 18, 2016 •

edited

Loading

dcbw commented Nov 18, 2016 •

edited

Loading

chenchun commented Nov 24, 2016 •

edited

Loading

dcbw commented Feb 16, 2017 •

edited

Loading