-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dockertools: call TearDownPod when GC-ing infra pods #37036
dockertools: call TearDownPod when GC-ing infra pods #37036
Conversation
@kubernetes/sig-network @bprashanth @thockin @caseydavenport @knobunc @danwinship |
I'll need to page this in @kubernetes/sig-node might be able to review this right away. |
@bprashanth in the case this PR fixes, teardown is not called, because this is container GC and the docker runtime doesn't do any network anything on GC containers... |
34db29e
to
c7383fa
Compare
Jenkins verify is failing on update-bazel.sh that doesn't appear to have anything to do with this PR... |
@dcbw can you please run hack/update-hazel.sh? |
So I git fetch upstream master, rebase my chnages on top, and run it:
So I add "kube::golang::setup_env" to hack/update-bazel.sh and get a ton of:
whereupon I have tons of changes that are completely unrelated to anything in this PR. Shouldn't somebody else run update-bazel.sh and submit that as a PR, and then I rebase on top?
|
@dcbw hmm, it looks like you have a somewhat non standard gopath setup. gazel expects kubernetes source to be checked out in $GOPATH/src/k8s.io/kubernetes. In #36843 we are adding a hard check for this. See: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md#clone-your-fork In the meantime you can apply this diff to fix ci: diff --git a/pkg/kubelet/BUILD b/pkg/kubelet/BUILD
index eaade63..73d6ebe 100644
--- a/pkg/kubelet/BUILD
+++ b/pkg/kubelet/BUILD
@@ -65,6 +65,7 @@ go_library(
"//pkg/kubelet/eviction:go_default_library",
"//pkg/kubelet/images:go_default_library",
"//pkg/kubelet/kuberuntime:go_default_library",
+ "//pkg/kubelet/leaky:go_default_library",
"//pkg/kubelet/lifecycle:go_default_library",
"//pkg/kubelet/metrics:go_default_library", |
Ah, thanks! So yeah, it was something in the patch. Is the BUILD stuff new? I guess I've either been lucky before that all imports have already been in BUILD. RE the gopath, I don't have a particularly non-standard setup for Linux developers that I know of. Literally 'git clone kubernetes', cd kubernetes,then make. That has always worked, and appears supported since the purpose of "kube::golang::setup_env" and hack/lib/golang.sh is to set up the Go environment without having GOPATH set. Is that not the case? |
c7383fa
to
6d0b221
Compare
@dcbw from reading #14940 (comment), since kubelet cannot recover the netns information when docker container terminated, this fix doesn't always work? Another question is that will the cni plugin returns an error if 1) netns is empty, 2) if teardown is called multiple times? |
@yujuhong At least for Docker, if the container is "dead" or "exited" the netns doesn't exist anymore, so any interfaces it had will have already been torn down. But IPAM may not have been torn down, and that still needs to happen which this PR allows the CNI plugin to do. For rkt + --network-plugin=cni the netns is managed by kubelet itself, so it will still be around even if the container has exited, so we don't really have this problem there, and that's one reason the rkt runtime already calls TearDownPod() during GC. CNI does not require a CNI_NETNS for CNI_DEL actions, and its up to the plugin to do best-effort cleanup when it gets the CNI_DEL call. That should probably include any IPAM to ensure that even if the netns and container are gone, the IPAM gets released. In any case, this patch doesn't regress the DEL behavior from what it is now. If the CNI plugin requires a netns, it'll still return an error just like it does today. And teardown is already called multiple times due to kubelet races :) So that's not a change in behavior either. |
pkg/kubelet/pod_container_deletor.go
Outdated
} | ||
} | ||
|
||
if len(candidates) <= containersToKeep { | ||
return containerStatusbyCreatedList{} | ||
// Always destroy infra containers to ensure networking gets torn down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubelet still relies on one dead container as a checkpoint. I don't think we should change that.
Overall, I'd prefer not making change to this file (pod_container_deletor.go
). The podContainerDeletor simply does a more aggressive cleanup, and the regular container GC will kick in later.
It's also a plus to not pollute more files with leaky.PodInfraContainerName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then we need somewhere else that actually inspects the docker dead docker containers and compares against the running ones, which the GC code already does.
@yujuhong would you be OK with changing the code to tear down networking for all the dead infra containers, but not remove them completely from docker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are actually two different moments which may trigger network teardown:
(1) when infra container is down
(2) when infra container is removed
I think codes in removeContainer()
handles (2) and here is a fix for (1).
To reduce IPAM leaking asap, I suggest we only trigger network tear down at infra container stopped, e.g. perform infra container network teardown after getting evictableContainers (and before removing any containers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are actually two different moments which may trigger network teardown:
(1) when infra container is down
(2) when infra container is removed
I think codes in removeContainer()
handles (2) and here is a fix for (1).
To reduce IPAM leaking asap, I suggest we only trigger network tear down at infra container stopped, e.g. perform infra container network teardown after getting evictableContainers (and before removing any containers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcbw I'd like to try making the change small and localized since it's so close to the release (unless this is not targeted for 1.5).
Change in this file is very docker-specific even though it's used by all implementations (rkt, cri, etc). On top of that, kubenet has its own fix that handles the case where the infra containers get removed underneath kubelet. By leaving everything to the container GC, you have already fixed the leaks in most situation (e.g., docker restart). Would that be sufficient?
glog.V(4).Infof("Removing container %q name %q", id, containerName) | ||
err := cgc.client.RemoveContainer(id, dockertypes.ContainerRemoveOptions{RemoveVolumes: true}) | ||
func (cgc *containerGC) removeContainer(containerInfo containerGCInfo) { | ||
if !containerInfo.isHostNetwork && containerInfo.containerName == leaky.PodInfraContainerName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/leaky.PodInfraContainerName/PodInfraContainerName
It's already defined in dockertools/docker.go
@yujuhong yeah, kubenet has that workaround, but it was a workaround for this specific problem :) And it's only for kubenet, which doesn't help CNI plugins at all. For OpenShift we just ported your kubenet workaround to our CNI plugin, but I thought I'd try to fix it for real too. |
@dcbw This PR is useful, but I think it can't resolve all the situations. Here is an exception: Suppose we are using host-local ipam in cni plugin
This is because kubelet didn't hold a lock before starting cni plugin and guarantee that |
6d0b221
to
5c49d14
Compare
@yujuhong @chenchun @bprashanth I believe I'd addressed all your comments. First, everything is now specific to the docker runtime, no more kubelet general changes. Second, we now serialize all network plugin operations in the docker runtime to specifically guard against concurrent operations from kubelet's GC goroutine and kubelet's SyncPod() goroutine. This was always a problem before too and we've run into instances of concurrent teardown in the past. @alexbrand I've made some changes to the Windows parts of dockertools too. Could you verify them? From the comments I understand that every container in the pod gets its own IP address on Windows, but that the Infra container (while started) is basically ignored. So I attempted to ensure that all containers on Windows will get their networking cleaned up when GC happens, whiel on Linux only the Infra container needs this. |
5c49d14
to
17cd07a
Compare
Added a priority since rebased more than 2 times |
The PluginManager almost duplicates the network plugin interface, but not quite since the Init() function should be called by whatever actually finds and creates the network plugin instance. Only then does it get passed off to the PluginManager. The Manager synchronizes pod-specific network operations like setup, teardown, and pod network status. It passes through all other operations so that runtimes don't have to cache the network plugin directly, but can use the PluginManager as a wrapper.
…rations We need to tear down networking when garbage collecting containers too, and GC is run from a different goroutine in kubelet. We don't want container network operations running for the same pod concurrently.
The docker runtime doesn't tear down networking when GC-ing pods. rkt already does so make docker do it too. To ensure this happens, networking is always torn down for the container even if the container itself is not deleted. This prevents IPAM from leaking when the pod gets killed for some reason outside kubelet (like docker restart) or when pods are killed while kubelet isn't running. Fixes: kubernetes#14940 Related: kubernetes#35572
Dead infra containers may still have network resources allocated to them and may not be GC-ed for a long time. But allowing SyncPod() to restart an infra container before the old one is destroyed prevents network plugins from carrying the old network details (eg IPAM) over to the new infra container.
64dc4d2
to
20e1cdb
Compare
@freehan rebased, can you re-LGTM? Thanks! |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED The following people have approved this PR: dcbw, yujuhong Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Automatic merge from submit-queue (batch tested with PRs 40505, 34664, 37036, 40726, 41595) |
When will this feature be released? We really need it. |
@saiwl this is already included in v1.6. |
Is this feature turned on by default? I restarted the docker daemon, but the ip wasn't released. @yujuhong |
This is turned on by default (or simply put it, there is no way to disable this). If you want to report a bug, please open a new issue with more information (e.g., environment, network setup, steps to reproduce, etc), thanks. |
The runtime calling CRI-O may not synchronize pod network operations, but to ensure that they don't overlap, ocicni must do so. Operations are synchronized on a per-pod basis, such that all operations for a given pod are synchronized, but individual pods' operations can run in parallel. This is essentially a port of the Kubernetes PluginManager code from kubernetes/kubernetes#37036. Signed-off-by: Dan Williams <dcbw@redhat.com>
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
infra pods are now always GC-ed rather than gating them by
containersToKeep.
This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.
Fixes: #14940
Related: #35572
This change is![Reviewable](https://camo.githubusercontent.com/2d899f4291d07d3cd2fa4aaae1e3b243f164c23fce87d30a589ace0d496a444c/68747470733a2f2f72657669657761626c652e6b756265726e657465732e696f2f7265766965775f627574746f6e2e737667)