Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dockertools: call TearDownPod when GC-ing infra pods #37036

Merged
merged 7 commits into from
Feb 17, 2017

Conversation

dcbw
Copy link
Member

@dcbw dcbw commented Nov 17, 2016

The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
infra pods are now always GC-ed rather than gating them by
containersToKeep.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: #14940
Related: #35572


This change is Reviewable

@dcbw
Copy link
Member Author

dcbw commented Nov 17, 2016

@kubernetes/sig-network @bprashanth @thockin @caseydavenport @knobunc @danwinship

@bprashanth
Copy link
Contributor

I'll need to page this in @kubernetes/sig-node might be able to review this right away.
Fyi we fixed the ipam leak for kubenet with a hack for the point release: #34278 (comment). I think we discussed on that issue that teardown is called, but we can't find the ip of a dead container to look up its netns.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Nov 17, 2016
@dcbw
Copy link
Member Author

dcbw commented Nov 17, 2016

I think we discussed on that issue that teardown is called, but we can't find the ip of a dead container to look up its netns.

@bprashanth in the case this PR fixes, teardown is not called, because this is container GC and the docker runtime doesn't do any network anything on GC containers...

@dcbw
Copy link
Member Author

dcbw commented Nov 18, 2016

@k8s-bot verify test this issue #36843

@dcbw
Copy link
Member Author

dcbw commented Nov 18, 2016

Jenkins verify is failing on update-bazel.sh that doesn't appear to have anything to do with this PR...

@mikedanese
Copy link
Member

@dcbw can you please run hack/update-hazel.sh?

@dcbw
Copy link
Member Author

dcbw commented Nov 18, 2016

@mikedanese

So I git fetch upstream master, rebase my chnages on top, and run it:

package github.com/mikedanese/gazel: cannot download, $GOPATH not set. For more details see: go help gopath

So I add "kube::golang::setup_env" to hack/update-bazel.sh and get a ton of:

extract err: cannot find package "k8s.io/gengo/types" in any of:
    /usr/lib/golang/src/k8s.io/gengo/types (from $GOROOT)
    /home/dcbw/Development/containers/kubernetes/_output/local/go/src/k8s.io/gengo/types (from $GOPATH)

whereupon I have tons of changes that are completely unrelated to anything in this PR. Shouldn't somebody else run update-bazel.sh and submit that as a PR, and then I rebase on top?

diff --git a/cmd/genman/BUILD b/cmd/genman/BUILD
index 0ff8bde..d149f5c 100644
--- a/cmd/genman/BUILD
+++ b/cmd/genman/BUILD
@@ -12,7 +12,10 @@ load(

 go_binary(
     name = "genman",
-    srcs = ["gen_kube_man.go"],
+    srcs = [
+        "gen_kube_man.go",
+        "gen_kubectl_man.go",
+    ],
     tags = ["automanaged"],
     deps = [
         "//cmd/genutils:go_default_library",
@@ -23,8 +26,5 @@ go_binary(
         "//pkg/kubectl/cmd:go_default_library",
         "//pkg/kubectl/cmd/util:go_default_library",
         "//plugin/cmd/kube-scheduler/app:go_default_library",
-        "//vendor:github.com/cpuguy83/go-md2man/md2man",
-        "//vendor:github.com/spf13/cobra",
-        "//vendor:github.com/spf13/pflag",
     ],
 )
diff --git a/cmd/genswaggertypedocs/BUILD b/cmd/genswaggertypedocs/BUILD
index 21347db..c266af9 100644
--- a/cmd/genswaggertypedocs/BUILD
+++ b/cmd/genswaggertypedocs/BUILD
@@ -14,9 +14,5 @@ go_binary(
     name = "genswaggertypedocs",
     srcs = ["swagger_type_docs.go"],
     tags = ["automanaged"],
-    deps = [
-        "//pkg/runtime:go_default_library",
-        "//vendor:github.com/golang/glog",
-        "//vendor:github.com/spf13/pflag",
-    ],
+    deps = ["//pkg/runtime:go_default_library"],
 )
<etc>

@mikedanese
Copy link
Member

@dcbw hmm, it looks like you have a somewhat non standard gopath setup. gazel expects kubernetes source to be checked out in $GOPATH/src/k8s.io/kubernetes. In #36843 we are adding a hard check for this. See:

https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md#clone-your-fork

In the meantime you can apply this diff to fix ci:

diff --git a/pkg/kubelet/BUILD b/pkg/kubelet/BUILD
index eaade63..73d6ebe 100644
--- a/pkg/kubelet/BUILD
+++ b/pkg/kubelet/BUILD
@@ -65,6 +65,7 @@ go_library(
         "//pkg/kubelet/eviction:go_default_library",
         "//pkg/kubelet/images:go_default_library",
         "//pkg/kubelet/kuberuntime:go_default_library",
+        "//pkg/kubelet/leaky:go_default_library",
         "//pkg/kubelet/lifecycle:go_default_library",
         "//pkg/kubelet/metrics:go_default_library",

@dcbw
Copy link
Member Author

dcbw commented Nov 18, 2016

@dcbw hmm, it looks like you have a somewhat non standard gopath setup. gazel expects kubernetes source to be checked out in $GOPATH/src/k8s.io/kubernetes. In #36843 we are adding a hard check for this. See:

Ah, thanks! So yeah, it was something in the patch. Is the BUILD stuff new? I guess I've either been lucky before that all imports have already been in BUILD.

RE the gopath, I don't have a particularly non-standard setup for Linux developers that I know of. Literally 'git clone kubernetes', cd kubernetes,then make. That has always worked, and appears supported since the purpose of "kube::golang::setup_env" and hack/lib/golang.sh is to set up the Go environment without having GOPATH set. Is that not the case?

@dcbw
Copy link
Member Author

dcbw commented Nov 18, 2016

@k8s-bot cvm gke e2e test this issue #35913

@yujuhong
Copy link
Contributor

@dcbw from reading #14940 (comment), since kubelet cannot recover the netns information when docker container terminated, this fix doesn't always work? Another question is that will the cni plugin returns an error if 1) netns is empty, 2) if teardown is called multiple times?

@dcbw
Copy link
Member Author

dcbw commented Nov 18, 2016

@dcbw from reading #14940 (comment), since kubelet cannot recover the netns information when docker container terminated, this fix doesn't always work? Another question is that will the cni plugin returns an error if 1) netns is empty, 2) if teardown is called multiple times?

@yujuhong At least for Docker, if the container is "dead" or "exited" the netns doesn't exist anymore, so any interfaces it had will have already been torn down. But IPAM may not have been torn down, and that still needs to happen which this PR allows the CNI plugin to do.

For rkt + --network-plugin=cni the netns is managed by kubelet itself, so it will still be around even if the container has exited, so we don't really have this problem there, and that's one reason the rkt runtime already calls TearDownPod() during GC.

CNI does not require a CNI_NETNS for CNI_DEL actions, and its up to the plugin to do best-effort cleanup when it gets the CNI_DEL call. That should probably include any IPAM to ensure that even if the netns and container are gone, the IPAM gets released.

In any case, this patch doesn't regress the DEL behavior from what it is now. If the CNI plugin requires a netns, it'll still return an error just like it does today. And teardown is already called multiple times due to kubelet races :) So that's not a change in behavior either.

}
}

if len(candidates) <= containersToKeep {
return containerStatusbyCreatedList{}
// Always destroy infra containers to ensure networking gets torn down
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubelet still relies on one dead container as a checkpoint. I don't think we should change that.

Overall, I'd prefer not making change to this file (pod_container_deletor.go). The podContainerDeletor simply does a more aggressive cleanup, and the regular container GC will kick in later.

It's also a plus to not pollute more files with leaky.PodInfraContainerName

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then we need somewhere else that actually inspects the docker dead docker containers and compares against the running ones, which the GC code already does.

@yujuhong would you be OK with changing the code to tear down networking for all the dead infra containers, but not remove them completely from docker?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually two different moments which may trigger network teardown:

(1) when infra container is down
(2) when infra container is removed

I think codes in removeContainer() handles (2) and here is a fix for (1).

To reduce IPAM leaking asap, I suggest we only trigger network tear down at infra container stopped, e.g. perform infra container network teardown after getting evictableContainers (and before removing any containers).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually two different moments which may trigger network teardown:

(1) when infra container is down
(2) when infra container is removed

I think codes in removeContainer() handles (2) and here is a fix for (1).

To reduce IPAM leaking asap, I suggest we only trigger network tear down at infra container stopped, e.g. perform infra container network teardown after getting evictableContainers (and before removing any containers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcbw I'd like to try making the change small and localized since it's so close to the release (unless this is not targeted for 1.5).

Change in this file is very docker-specific even though it's used by all implementations (rkt, cri, etc). On top of that, kubenet has its own fix that handles the case where the infra containers get removed underneath kubelet. By leaving everything to the container GC, you have already fixed the leaks in most situation (e.g., docker restart). Would that be sufficient?

glog.V(4).Infof("Removing container %q name %q", id, containerName)
err := cgc.client.RemoveContainer(id, dockertypes.ContainerRemoveOptions{RemoveVolumes: true})
func (cgc *containerGC) removeContainer(containerInfo containerGCInfo) {
if !containerInfo.isHostNetwork && containerInfo.containerName == leaky.PodInfraContainerName {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/leaky.PodInfraContainerName/PodInfraContainerName

It's already defined in dockertools/docker.go

@yujuhong yujuhong self-assigned this Nov 18, 2016
@dcbw
Copy link
Member Author

dcbw commented Nov 22, 2016

On top of that, kubenet has its own fix that handles the case where the infra containers get removed underneath kubelet.

@yujuhong yeah, kubenet has that workaround, but it was a workaround for this specific problem :) And it's only for kubenet, which doesn't help CNI plugins at all. For OpenShift we just ported your kubenet workaround to our CNI plugin, but I thought I'd try to fix it for real too.

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 24, 2016
@chenchun
Copy link
Contributor

chenchun commented Nov 24, 2016

@dcbw This PR is useful, but I think it can't resolve all the situations. Here is an exception:

Suppose we are using host-local ipam in cni plugin

  1. kubelet starts cni plugin to setup network
  2. cni plugin process cni-1 starts and it supposes to setup network
  3. kubectl deletes the pod
  4. kubelet starts cni plugin to teardown network
  5. cni plugin process cni-2 starts and it supposes to teardown network
  6. cni plugin process cni-2 exits
  7. kubelet deletes infrastructure container
  8. cni-1 touches ip file and exits

This is because kubelet didn't hold a lock before starting cni plugin and guarantee that cni-1 must exit before launching cni-2. I'm not sure if kubelet needs to hold a lock. IMO, but this may not be the responsibility of kubelet to hold a lock or to delete these leaky ip files. Anyone who uses cni plugin with host-local ipam will have to launch a demon process to cleanup these leaky ip files.

@dcbw
Copy link
Member Author

dcbw commented Dec 7, 2016

@yujuhong @chenchun @bprashanth I believe I'd addressed all your comments. First, everything is now specific to the docker runtime, no more kubelet general changes. Second, we now serialize all network plugin operations in the docker runtime to specifically guard against concurrent operations from kubelet's GC goroutine and kubelet's SyncPod() goroutine. This was always a problem before too and we've run into instances of concurrent teardown in the past.

@alexbrand I've made some changes to the Windows parts of dockertools too. Could you verify them? From the comments I understand that every container in the pod gets its own IP address on Windows, but that the Infra container (while started) is basically ignored. So I attempted to ensure that all containers on Windows will get their networking cleaned up when GC happens, whiel on Linux only the Infra container needs this.

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2016
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 7, 2016
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 16, 2017
@freehan freehan added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 16, 2017
@freehan
Copy link
Contributor

freehan commented Feb 16, 2017

Added a priority since rebased more than 2 times

The PluginManager almost duplicates the network plugin interface, but
not quite since the Init() function should be called by whatever
actually finds and creates the network plugin instance.  Only then
does it get passed off to the PluginManager.

The Manager synchronizes pod-specific network operations like setup,
teardown, and pod network status.  It passes through all other
operations so that runtimes don't have to cache the network plugin
directly, but can use the PluginManager as a wrapper.
…rations

We need to tear down networking when garbage collecting containers too,
and GC is run from a different goroutine in kubelet.  We don't want
container network operations running for the same pod concurrently.
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
Dead infra containers may still have network resources allocated to
them and may not be GC-ed for a long time.  But allowing SyncPod()
to restart an infra container before the old one is destroyed
prevents network plugins from carrying the old network details
(eg IPAM) over to the new infra container.
@dcbw
Copy link
Member Author

dcbw commented Feb 16, 2017

@freehan rebased, can you re-LGTM? Thanks!

@k8s-github-robot k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 16, 2017
@freehan freehan added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 16, 2017
@freehan
Copy link
Contributor

freehan commented Feb 16, 2017

@k8s-bot cross build this
@k8s-bot cvm gce e2e test this
@k8s-bot gce etcd3 e2e test this

@yujuhong
Copy link
Contributor

/approve

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

The following people have approved this PR: dcbw, yujuhong

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 16, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 40505, 34664, 37036, 40726, 41595)

@k8s-github-robot k8s-github-robot merged commit 98d1cff into kubernetes:master Feb 17, 2017
@saiwl
Copy link

saiwl commented Apr 12, 2017

When will this feature be released? We really need it.

@yujuhong
Copy link
Contributor

@saiwl this is already included in v1.6.

@saiwl
Copy link

saiwl commented Apr 13, 2017

Is this feature turned on by default? I restarted the docker daemon, but the ip wasn't released. @yujuhong

@yujuhong
Copy link
Contributor

Is this feature turned on by default? I restarted the docker daemon, but the ip wasn't released. @yujuhong

This is turned on by default (or simply put it, there is no way to disable this).

If you want to report a bug, please open a new issue with more information (e.g., environment, network setup, steps to reproduce, etc), thanks.

dcbw added a commit to dcbw/ocicni that referenced this pull request Sep 5, 2017
The runtime calling CRI-O may not synchronize pod network operations,
but to ensure that they don't overlap, ocicni must do so.  Operations
are synchronized on a per-pod basis, such that all operations for a
given pod are synchronized, but individual pods' operations can
run in parallel.

This is essentially a port of the Kubernetes PluginManager code from
kubernetes/kubernetes#37036.

Signed-off-by: Dan Williams <dcbw@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet