Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry controller join for transient failures #962

Merged
merged 1 commit into from
Jun 7, 2021

Conversation

jnummelin
Copy link
Collaborator

Signed-off-by: Jussi Nummelin jnummelin@mirantis.com

Issue
Fixes #866

What this PR Includes
This PR makes controller join process to retry the CA sync. With automation tools like k0sctl the controllers are joined so fast that on the initial controller the sync API might not yet be up-and-running. That'll make the underlying init system (e.g. systemd) to restart the service. The job will get done for sure, but it'll take a bit long. Testing with this fix and with k0sctl shows this shaves ~1min off from the 3+3 setup.

@jnummelin jnummelin added this to the 1.21.x June milestone Jun 6, 2021
@jnummelin jnummelin requested a review from a team as a code owner June 6, 2021 09:16
@jnummelin jnummelin requested review from kke and mviitane June 6, 2021 09:16
kke
kke previously approved these changes Jun 7, 2021
Comment on lines +138 to +144
err = retry.Do(func() error {
caData, err = joinClient.GetCA()
if err != nil {
return fmt.Errorf("failed to sync CA: %w", err)
}
return nil
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, total retry time is ~52 seconds, pause between two last attempts is 25 seconds.

Maybe could be improved by giving up early in case of some unrecoverable failures:

		err = retry.Do(
			func() error {
				....
			},
			retry.RetryIf(
				func(err error) bool {
					return !strings.Contains(err.Error(), "something")
				},
			),
		)

Copy link
Collaborator Author

@jnummelin jnummelin Jun 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe could be improved by giving up early in case of some unrecoverable failures

I actually did think of this, but it's just super hard to make difference what is un-recoverable and what is not. E.g. with k0sctl (as it's super fast) the initial trys might give plain IO timeout (the join API is not yet up on the first controller), some RBAC related 40x (initial controller not yet pushed RBAC stuff for join tokens) or something else.

FYI, total retry time is ~52 seconds, pause between two last attempts is 25 seconds.

I think this is OK for this case.

Signed-off-by: Jussi Nummelin <jnummelin@mirantis.com>
@jnummelin jnummelin merged commit 3e209f6 into k0sproject:main Jun 7, 2021
@jnummelin jnummelin deleted the retry-controller-join branch June 7, 2021 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CA syncking should be retried
2 participants