-
Notifications
You must be signed in to change notification settings - Fork 712
-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubeadm reset with Calico and Containerd #1639
Comments
/assign @bart0sh |
It would also be great to get the perspective from the Calico team /cc @caseydavenport |
Ah, looks to me like the Calico CNI plugin is trying to access the apiserver and failing. It does this as part of tearing down networking for pods. This might be because the kube-proxy has been removed, and the iptables cleared out? That will cut anyone off trying to access the apiserver via the Service. |
@caseydavenport right in this case its actually the api-server being stopped before other pods and the Calico CNI won't ever be able to get that CRD. This behavior makes sense during normal execution when the del command comes from the Kubelet, but in this case when a user is trying to clean up a node with Looking at the code it seems that it uses that CRD here, Im wondering if there is a workaround to prevent the Calico CNI from trying to access the api-server? |
I will think it through. At first glance, I'm worried that we'd be sacrificing the mainline case to cover for an exceptional one. But, there may be ways to safely remove that dependency. What options do we have on the kubeadm side? Could we make sure the apiserver is the last resource to be destroyed? |
Ok so some more info: Please let me know if you would like to me to provide any more evidence. |
I can't reproduce this. Here is what I did:
@dkoshkin I suspect that you're using different calico version. |
@bart0sh please try deploying a bunch of other non static pods, lets assume nginx for now, depending on the order of the pods being deleted if kube-apiserver or etcd pods are stopped first the other nginx pods wont be able to be stopped. You can also reproduce this with |
I've thought about this a little bit more and I'm worried it's a bit deeper than just the bit of code linked above. Even if we stop checking that CRD, fundamentally Calico uses the apiserver to store information about pods it has networked in the apiserver, and on delete it needs to clean that up. For this scenario, obviously the apiserver is not coming back, but in normal operation you can experience intermittent loss of access to the api. The option we have is to make Calico essentially ignore errors on tear down, thus orphaning state in the apiserver, and then writing some external controller do some correlation to clean it up later if the apiserver ever comes back. This approach is relatively complicated and I worry will reduce Calico's robustness in normal operation. |
please re-open if found a bug on the side of kubeadm. |
Is there any updates for this? |
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use
kubeadm version
):Environment:
kubectl version
):AWS
uname -a
):Linux ip-10-0-192-121.us-west-2.compute.internal 3.10.0-957.1.3.el7.x86_64 kubeadm join on slave node fails preflight checks #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
What happened?
Installing with Containerd 1.2.5 and Calico v3.6.3 and then running
kubeadm reset
on one of the control-plane nodes.What you expected to happen?
kubeadm reset
should remove all pods.How to reproduce it (as minimally and precisely as possible)?
containerd config default
to get the default settings.kubeadm reset
Anything else we need to know?
This happens because ListKubeContainers() returns a list of pod IDs, based on the order of those IDs, trying to delete any pods after
kube-apiserver
is deleted will fail. On everystopp
thecalico
CNI plugin tries to reach out to the apiserver.A possible simple solution to try to increase the chances of all pods being deleted is order the returned list with
kube-apiserver
always being the last pod.This still would fail in a scenario when the
kube-apiserver
cannot be reached.The text was updated successfully, but these errors were encountered: