New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm: Improve resiliency in CreateOrMutateConfigMap #85763
kubeadm: Improve resiliency in CreateOrMutateConfigMap #85763
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ereslibre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
With several control plane nodes it was very easy to reproduce the |
/assign @neolit123 @rosti @fabriziopandini @yastij |
@neolit123 sorry, did a last minute push to simplify a little more when we are setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/lgtm |
As an aside: I think this is a backport candidate. It's a very focused patch that makes control plane join much more stable. Can we justify a backport here? |
CreateOrMutateConfigMap was not resilient when it was trying to Create the ConfigMap. If this operation returned an unknown error the whole operation would fail, because it was strict in what error it was expecting right afterwards: if the error returned by the Create call was a IsAlreadyExists error, it would work fine. However, if an unexpected error (such as an EOF) happened, this call would fail. We are seeing this error specially when running control plane node joins in an automated fashion, where things happen at a relatively high speed pace. It was specially easy to reproduce with kind, with several control plane instances. E.g.: ``` [upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace I1130 11:43:42.788952 887 round_trippers.go:443] POST https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps?timeout=10s in 1013 milliseconds Post https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps?timeout=10s: unexpected EOF unable to create ConfigMap k8s.io/kubernetes/cmd/kubeadm/app/util/apiclient.CreateOrMutateConfigMap /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/util/apiclient/idempotency.go:65 ``` This change makes this logic more resilient to unknown errors. It will retry on the light of unknown errors until some of the expected error happens: either `IsAlreadyExists`, in which case we will mutate the ConfigMap, or no error, in which case the ConfigMap has been created.
backports are critical-urgent or important-soon (nowadays) in terms of priority label. as much as it's worth having the backoff i personally haven't seen reports about this particular flake. my vote is 50/50 on backport. in my opinion we should have the refactor we are discussing here in 1.18: |
/lgtm |
That's interesting. I could see that literally everytime with kind when creating more than one control plane instance.
Absolutely, 👍
👍 |
is that in parallel? as in joining more than on CP in parallel. |
No, in sequence: https://github.com/kubernetes-sigs/kind/blob/64a5e320ad92d5746c3e750545d955d0812464bf/pkg/internal/cluster/create/actions/kubeadmjoin/join.go#L82-L89 |
/retest |
we no longer have HA clusters based on kind (only kinder), but kinder also uses serial join of CP nodes and this flake cannot be seen. i also haven't seen it in recent kind experiments. |
Realized I had Output
Also, on kinder we are doing https://github.com/kubernetes/kubeadm/blob/f4019dfc8ffd9bac687757231422e649acf3e96b/kinder/pkg/cluster/manager/actions/kubeadm-join.go#L102 after every new control plane join. Also, in kinder the loadbalancer is reconfigured after every join, instead of preconfigured from the beginning as in kind. This might have an impact on the reproducibility of the issue. Seen on the wild as well: kubernetes-sigs/kind#1020 -- it was supposed to be a full hard disk, but not really conclusive IMO. |
After seeing that this rarely happens on the wild I don't think we should backport this one, unless we get more reports. |
My stance on backports is that we need to do it if it's a bug and it's P0. We may consider doing it if it's a bug + P1. But P2 is definitely not high enough. |
It seems the consensus is for not backporting now. |
we briefly mentioned this here: tracking issue: kubernetes/kubeadm#1606 |
I have been experiencing something that looks like related to this here: In some reproducable scenarios when setting up a multi control plane cluster the first |
…ill now run as static pods on the master nodes. Note that with some clusters the nginx/keepalived setup (regardless of this commit) currently joining the first secondary master node always fails while all subsequent masters can be joined just fine. A fix to this is expected with Kubernetes 1.18.0, see also kubernetes/kubernetes#85763
first join should not always fail pre 1.18. it is possible that the api-server (from the init node) to not be ready for a while and an aggressive subsequent join would fail, adding some timeout before the join should resolve that or making sure the primary node is Ready. recently in cluster API we saw an issue where the second CP join failed with ConfigMap update error, but it was caused by load balancer blackout when a new member is added to the LB config. so waiting for a while or for the right time to join makes sense in higher level infrastructure. for kubeadm we are adding retries where we can and when we can. the timeline is not clear. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
CreateOrMutateConfigMap was not resilient when it was trying to Create
the ConfigMap. If this operation returned an unknown error the whole
operation would fail, because it was strict in what error it was
expecting right afterwards: if the error returned by the Create call
was a IsAlreadyExists error, it would work fine. However, if an
unexpected error (such as an EOF) happened, this call would fail.
We are seeing this error specially when running control plane node
joins in an automated fashion, where things happen at a relatively
high speed pace.
It was specially easy to reproduce with kind, with several control
plane instances. E.g.:
This change makes this logic more resilient to unknown errors. It will
retry on the light of unknown errors until some of the expected error
happens: either
IsAlreadyExists
, in which case we will mutate theConfigMap, or no error, in which case the ConfigMap has been created.
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
/priority important-longterm
/cc @kubernetes/sig-cluster-lifecycle-pr-reviews