Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: Add ability to retry ConfigMap get if certain errors happen #78915

Merged

Conversation

@ereslibre
Copy link
Member

commented Jun 11, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
During the control plane joins, sometimes the control plane returns an
expected error when trying to download the kubeadm-config ConfigMap.
This is a workaround for this issue until the root cause is completely
identified and fixed.

Which issue(s) this PR fixes:
xrefs kubernetes-sigs/kind#588

Does this PR introduce a user-facing change?:

kubeadm: implement retry logic for certain ConfigMap failures when joining nodes

/assign @fabriziopandini
/assign @neolit123

@neolit123

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

@kubernetes/sig-cluster-lifecycle-pr-reviews
/priority critical-urgent

this is a workaround for a etcd sync failure we are seeing with 3.3.10.
hopefully newer versions of etcd fix it.

we'd need to cherry-pick for 1.15 tomorrow (i can do that).

@neolit123
Copy link
Member

left a comment

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ereslibre, neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ereslibre

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2019

/hold

Submitted for early feedback, will update soon.

@ereslibre ereslibre force-pushed the ereslibre:retry-configmap-get-on-unauthorized branch from f9c8516 to 22338e8 Jun 12, 2019

@ereslibre

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2019

/hold cancel

I believe this should be ready.

@ereslibre ereslibre force-pushed the ereslibre:retry-configmap-get-on-unauthorized branch from 22338e8 to 59b1b74 Jun 12, 2019

@ereslibre ereslibre changed the title Add ability to retry ConfigMap get if certain errors happen kubeadm: Add ability to retry ConfigMap get if certain errors happen Jun 12, 2019

@ereslibre

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2019

/retest

@neolit123
Copy link
Member

left a comment

@rosti @fabriziopandini PTAL for more 👀
/lgtm
/hold

we need this before tonight GMT.
and thanks @ereslibre for the quick fix.

@ereslibre

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2019

@neolit123 sorry, had to update the bazel build, it was outdated.

@neolit123

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jun 12, 2019

@liggitt

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

/hold cancel

@timothysc
Copy link
Member

left a comment

This seems pretty hacky imo.

We should trigger on a specific error condition IMO.
/hold
/assign @timothysc @gyuho

@ereslibre

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2019

Do we have a specific error condition that occurs on the race of new members joining?

There are several things going on here. The problem being workarounded here is not etcd specific, but when a new apiserver joins, and the LB has not set healthchecks, we can reach the newly joined apiserver that is not yet fully initialized. This will sometimes make the new apiserver to respond with a 401 Unauthorized HTTP error code.

At first, on this PR we only tolerated this error (401), but then we changed it to tolerate any kind of error and retry in light of any error, at least for some backoff.

This whole discussion goes back to what is kubeadm's responsability if for example the LB is not properly configured. Should we really write this mitigations at kubeadm level? Can we wait on a specific place instead of retrying virtually everywhere? [tricky if we have a LB in front, we could assume everything is fine and the LB could RR us and then start hitting the faulty instance].

I didn't have time to check the original issue and the real problem behind it (was thinking in doing so next cycle), but was asked to write this workaround for 1.15.

My position on this has always been that despite I don't really like this approach we could get the mitigation in, revert it and think about the problem with more time, investigating the root causes of this problem.

@liggitt

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

the LB has not set healthchecks, we can reach the newly joined apiserver that is not yet fully initialized

That's the real issue, but for run-once setup operations, retrying before hard failing doesn't bother me that much in addition

kubeadm: Add ability to retry ConfigMap get if certain errors happen
During the control plane joins, sometimes the control plane returns an
expected error when trying to download the `kubeadm-config` ConfigMap.
This is a workaround for this issue until the root cause is completely
identified and fixed.

Ideally, this commit should be reverted in the near future.

@ereslibre ereslibre force-pushed the ereslibre:retry-configmap-get-on-unauthorized branch from 5eb63c1 to 26c9965 Jun 12, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jun 12, 2019

@neolit123

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

/lgtm
/hold cancel
we discussed during the kubeadm office hours that we can have this temporary at least.

@k8s-ci-robot k8s-ci-robot added lgtm and removed do-not-merge/hold labels Jun 12, 2019

@timothysc

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

That's the real issue, but for run-once setup operations, retrying before hard failing doesn't bother me that much in addition

Depends... We could be masking a deployment issue. I'd feel much safer if had a specific list of known conditions. That said, we're doing some funky things that we never really did before. Like trying to stand-up a cluster concurrently.

I still don't like this, but happy to let it go so long as we keep tracking, and rooting out the main issues.

@neolit123

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

/retest

@k8s-ci-robot k8s-ci-robot merged commit 4f29960 into kubernetes:master Jun 12, 2019

23 checks passed

cla/linuxfoundation ereslibre authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

@ereslibre ereslibre deleted the ereslibre:retry-configmap-get-on-unauthorized branch Jun 13, 2019

@BenTheElder

This comment has been minimized.

Copy link
Member

commented Jun 14, 2019

catching up on this...

During the control plane joins, sometimes the control plane returns an
expected error when trying to download the kubeadm-config ConfigMap.
This is a workaround for this issue until the root cause is completely
identified and fixed.

Are these actually expected? If so, why would this be a "workaround"?

It might actually make sense to retry on all errors before failing a run-once setup tool like this

+1

That said, we're doing some funky things that we never really did before. Like trying to stand-up a cluster concurrently.

FWIW kind still intentionally does not do join concurrently yet while this stabilizes (though kinder appears to), it just does it faster than most deployers and still sees issues with this prior to this patch.


Related: it looks like kubespray offers kubeadm HA mode with an nginx loadbalancer. This has doesn't look to have healthchecks, will reach out.

https://github.com/kubernetes-sigs/kubespray/blob/3b7797b1a1d478493fc9b1510499fe371ddd7e2c/roles/kubernetes/node/templates/loadbalancer/nginx.conf.j2

@ereslibre

This comment has been minimized.

Copy link
Member Author

commented Jun 14, 2019

Are these actually expected? If so, why would this be a "workaround"?

I'd like to first state that I haven't debugged this issue yet personally, and I'd like to do so as soon as possible in order to put more facts onto the table, but I will base my judgement in the evidence presented in kubernetes-sigs/kind#588.

It really depends on where the responsability of handling the fault tolerance is. In my opinion it would be perfectly reasonable for kubeadm to do not perform any kind of retry (modulo exceptions; we have retries for Conflict on certain places, and that's perfectly fine IMO) and, in case a LB is used, to document that it needs to be properly configured with HTTP healthchecks.

Assuming on kubeadm that virtually any request to the apiserver could fail has several downsides:

  • Maintenance burden; we have to assume that any request to the apiserver could fail. If only certain requests are protected against this, any change of order in the logic could bring "unprotected" requests to the front, and hit the same problem again.

  • Hide potential problems: from my point of view it's extremely weird to retry a request when a 401 HTTP error code is received; is a status code that is supposed to be final. In this workaround, as suggested, we are not even retrying in light of a 401, but in case of any error, literally.

    • If only 401 is retried, looks arbitrary and tied to a specific implementation detail of the apiserver, and it's not kubeadm responsability to watch for this.
    • If any error is retried, the codebase is weak and could potentially hide temporary random errors.

The apiserver is returning a 401 HTTP error code when initializing; during this time its /healthz endpoint does not report success. Thus, I think it's not safe to ask for a specific apiserver behavior during initialization. I think it's completely reasonable to not expect anything from the apiserver until its health endpoint returns success.

kubeadm does not know and does not care if a LB is used, it feels weird to add protective code in case something goes wrong when reaching an apiserver. On the other hand, the component creating a LB should ensure that only healthy apiservers are considered and used in its active pool, it's a LB configuration detail and the component using kubeadm should either:

  • Use kubeadm in a sequential way, making sure that the next step can be done based on the current state of the cluster.
  • Allow for concurrent actions, making sure that all components used for this goal are properly configured.

All that said, this is all based on information presented on the mentioned issue; I didn't have the chance to reproduce and debug the issue yet. Sorry for the long post :/

@BenTheElder

This comment has been minimized.

Copy link
Member

commented Jun 14, 2019

I'd like to first state that I haven't debugged this issue yet personally, and I'd like to do so as soon as possible in order to put more facts onto the table, but I will base my judgement in the evidence presented in kubernetes-sigs/kind#588.

Er not talking about kind specifically, or all errors. I read some of the discussion as @liggitt saying some of the errors are actually expected to occur and should be handled (?)

@neolit123

This comment has been minimized.

Copy link
Member

commented Jun 14, 2019

if i get a 401 i would assume that the server is already up and i'm not authorized to use it, the forbidden error is also misleading. like Tim said during the office hours, the server passes through these states that we are seeing, but instead i think we should be seeing an error that tells us "hey, try again later, i'm restarting".

@ereslibre

This comment has been minimized.

Copy link
Member Author

commented Jun 15, 2019

the server passes through these states that we are seeing, but instead i think we should be seeing an error that tells us "hey, try again later, i'm restarting".

I agree. However, it already has a healthcheck endpoint and we are ignoring it. I wonder how fair it is to ask for a specific initialization behavior while the apiserver health check is not reporting success.

Besides, the health check is interesting in any case, to remove unhealthy apiservers after initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.