Retry controller join for transient failures #962

jnummelin · 2021-06-06T09:16:39Z

Signed-off-by: Jussi Nummelin jnummelin@mirantis.com

Issue
Fixes #866

What this PR Includes
This PR makes controller join process to retry the CA sync. With automation tools like k0sctl the controllers are joined so fast that on the initial controller the sync API might not yet be up-and-running. That'll make the underlying init system (e.g. systemd) to restart the service. The job will get done for sure, but it'll take a bit long. Testing with this fix and with k0sctl shows this shaves ~1min off from the 3+3 setup.

kke · 2021-06-07T08:52:56Z

cmd/controller/controller.go

+	err = retry.Do(func() error {
+		caData, err = joinClient.GetCA()
+		if err != nil {
+			return fmt.Errorf("failed to sync CA: %w", err)
+		}
+		return nil
+	})


FYI, total retry time is ~52 seconds, pause between two last attempts is 25 seconds.

Maybe could be improved by giving up early in case of some unrecoverable failures:

err = retry.Do( func() error { .... }, retry.RetryIf( func(err error) bool { return !strings.Contains(err.Error(), "something") }, ), )

Maybe could be improved by giving up early in case of some unrecoverable failures

I actually did think of this, but it's just super hard to make difference what is un-recoverable and what is not. E.g. with k0sctl (as it's super fast) the initial trys might give plain IO timeout (the join API is not yet up on the first controller), some RBAC related 40x (initial controller not yet pushed RBAC stuff for join tokens) or something else.

FYI, total retry time is ~52 seconds, pause between two last attempts is 25 seconds.

I think this is OK for this case.

Signed-off-by: Jussi Nummelin <jnummelin@mirantis.com>

jnummelin added this to the 1.21.x June milestone Jun 6, 2021

jnummelin requested a review from a team as a code owner June 6, 2021 09:16

jnummelin requested review from kke and mviitane June 6, 2021 09:16

kke previously approved these changes Jun 7, 2021

View reviewed changes

Retry controller join for transient failures

85a7104

Signed-off-by: Jussi Nummelin <jnummelin@mirantis.com>

jnummelin dismissed kke’s stale review via 85a7104 June 7, 2021 09:21

jnummelin force-pushed the retry-controller-join branch from 4298e92 to 85a7104 Compare June 7, 2021 09:21

kke approved these changes Jun 7, 2021

View reviewed changes

jnummelin merged commit 3e209f6 into k0sproject:main Jun 7, 2021

jnummelin deleted the retry-controller-join branch June 7, 2021 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry controller join for transient failures #962

Retry controller join for transient failures #962

jnummelin commented Jun 6, 2021

kke Jun 7, 2021

jnummelin Jun 7, 2021 •

edited

Retry controller join for transient failures #962

Retry controller join for transient failures #962

Conversation

jnummelin commented Jun 6, 2021

kke Jun 7, 2021

Choose a reason for hiding this comment

jnummelin Jun 7, 2021 • edited

Choose a reason for hiding this comment

jnummelin Jun 7, 2021 •

edited