Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make the kubeadm bootstrap token TTL configurable #2770

Closed
sbueringer opened this issue Mar 25, 2020 · 23 comments
Closed

make the kubeadm bootstrap token TTL configurable #2770

sbueringer opened this issue Mar 25, 2020 · 23 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@sbueringer
Copy link
Member

sbueringer commented Mar 25, 2020

What steps did you take and what happened:
I setup an e2e test for CAPO and in some cases nodes were not able to join the cluster.

What did you expect to happen:
Nodes are able to join the cluster

Anything else you would like to add:

(See also: https://kubernetes.slack.com/archives/C8TSNPY4T/p1585035222084200)

To be clear that's not a problem for me anymore, but maybe for somebody else. I'm opening this issue only to discuss if we should change anything. My use case is already solved.

I'm currently setting up e2e tests for CAPO in OpenLab. These tests are not using the built-in images / binaries from the OS. Before kubeadm runs, CI artifacts are downloaded from the kubernetes-release-dev GCS bucket. Because of oversubscription of the underlying OpenStack and some not-ideal bootstrap code (installing and using gsutil instead of just curl), this sometimes took over 20 min. The kubeadm join call then failed because the bootstrap token was not valid anymore.

The kubeadm bootstrapper generates bootstrap tokens with a (hard-coded) default TTL of 15 minutes. These tokens are renewed/refreshed until the infrastructure (e.g. OpenStackMachine) is marked as ready. In case of CAPA/CAPO this means the VM is in state running/active. The problem is now that if it takes longer then 15 minutes after the VM is in state running/active to execute kubeadm join, so the bootstrap token won't be valid anymore. As I said I solved my problem already (by speeding up the ci artifacts download).

Some ideas how this can be solved, if we want to change anything:

  • increase the hard coded default TTL
  • make the TTL configurable
  • refresh the token until the node joined the cluster (if we want, we can even revoke it after the node has joined)

Environment:

  • Cluster-api version: v0.3.2
  • Minikube/KIND version: 0.7.0
  • Kubernetes version: (use kubectl version): 1.17.3

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 25, 2020
@sbueringer sbueringer changed the title Increase kubeadm bootstrap token ttl Increase kubeadm bootstrap token TTL Mar 25, 2020
@detiber
Copy link
Member

detiber commented Mar 25, 2020

  • increase the hard coded default TTL

I don't think this is necessarily something that we would want to do, unless (as mentioned below) we add revocation code after the related node has successfully joined.

  • make the TTL configurable
  • refresh the token until the node joined the cluster (if we want, we can even revoke it after the node has joined)

This makes a lot of sense given that we already refresh the token periodically before the infrastructure is provisioned.

Since there are already other controllers where we are adding watches to the workload cluster for Node resources, I do not see any reason why we wouldn't also do the same here, which would more easily allow us to add code to revoke the tokens after node joins.

If we do add code to revoke the tokens, we would have to be cautious that it wouldn't impact the use of a shared bootstrap token by MachinePools and would only affect tokens that are used by Nodes generated by individual Machines.

@neolit123
Copy link
Member

/remove-kind bug
/kind future

@k8s-ci-robot k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Mar 25, 2020
@k8s-ci-robot
Copy link
Contributor

@neolit123: The label(s) kind/future cannot be applied, because the repository doesn't have them

In response to this:

/remove-kind bug
/kind future

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@neolit123
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 25, 2020
@ncdc ncdc added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Mar 25, 2020
@ncdc ncdc added this to the Next milestone Mar 25, 2020
@prankul88
Copy link
Contributor

/assign

  • make the TTL configurable
  • refresh the token until the node joined the cluster (if we want, we can even revoke it after the node has joined)

@detiber You would suggest the latter one?

And if we plan the former, what would be the correct way to make TTL configurable?

@vincepri
Copy link
Member

vincepri commented Apr 6, 2020

Let's work on make it configurable, we can add a optional flag API field with the same default we have today to KubeadmConfig, this way it'll be backward-compatible and it won't be a breaking change.

@detiber
Copy link
Member

detiber commented Apr 6, 2020

@vincepri If we want to go down the configurable route, I'm wondering if a field might be better, since one might want a different TTL based on which infrastructure provider you may be using for a given workload cluster.

@vincepri
Copy link
Member

vincepri commented Apr 6, 2020

flag/field, yes I mistyped

@prankul88
Copy link
Contributor

@vincepri @detiber Thanks.

@zjs
Copy link

zjs commented May 4, 2020

refresh the token until the node joined the cluster (if we want, we can even revoke it after the node has joined)

This feels like a bug, not a feature, to me. What's the remediation if the token expires before the node joins the cluster?

@vincepri
Copy link
Member

vincepri commented May 4, 2020

@zjs There is no automatic remediation today, users can setup a MachineHealthCheck to remediate machines that weren’t able to join a cluster after a specific timeout

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2020
@neolit123
Copy link
Member

/retitle make the kubeadm bootstrap token TTL configurable
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 3, 2020
@k8s-ci-robot k8s-ci-robot changed the title Increase kubeadm bootstrap token TTL make the kubeadm bootstrap token TTL configurable Aug 3, 2020
@detiber
Copy link
Member

detiber commented Aug 3, 2020

refresh the token until the node joined the cluster (if we want, we can even revoke it after the node has joined)

This feels like a bug, not a feature, to me. What's the remediation if the token expires before the node joins the cluster?

Refreshing the token does not change the token in this case, it only extends the expiration time for the token, which is actually the current behavior:

// refreshToken extends the TTL for an existing token
func refreshToken(c client.Client, token string) error {
substrs := bootstraputil.BootstrapTokenRegexp.FindStringSubmatch(token)
if len(substrs) != 3 {
return errors.Errorf("the bootstrap token %q was not of the form %q", token, bootstrapapi.BootstrapTokenPattern)
}
tokenID := substrs[1]
secretName := bootstraputil.BootstrapTokenSecretName(tokenID)
secret := &v1.Secret{}
if err := c.Get(context.TODO(), client.ObjectKey{Name: secretName, Namespace: metav1.NamespaceSystem}, secret); err != nil {
return err
}
if secret.Data == nil {
return errors.Errorf("Invalid bootstrap secret %q, remove the token from the kubadm config to re-create", secretName)
}
secret.Data[bootstrapapi.BootstrapTokenExpirationKey] = []byte(time.Now().UTC().Add(DefaultTokenTTL).Format(time.RFC3339))
return c.Update(context.TODO(), secret)
}

We would just need to extend the behavior longer than we currently do (until Machine.Status.InfrastructureReady) to something more indicative of bootstrapping being completed, such as the presence of a NodeRef.

@vincepri
Copy link
Member

vincepri commented Aug 3, 2020

Could we also detect if a token is expired (for some reason) and generate a new one?

@detiber
Copy link
Member

detiber commented Aug 3, 2020

Could we also detect if a token is expired (for some reason) and generate a new one?

Potentially, but my concern there would be around what happens for anything that may be attempting to use that bootstrap token, we don't really have a way to update it for anything that may be in the process of bootstrapping. Probably not much of a concern if we are talking about MachineDeployments/MachineSets/Machines, since an appropriately deployed MHC should "fix" things up eventually.

It might be more complicated when dealing with MachinePools, especially if we consider a potential future implementation where scaling is deferred to an external autoscaling integration.

@vincepri
Copy link
Member

vincepri commented Aug 3, 2020

I was thinking about MachinePool yeah, that might be a separate issue though

@vincepri
Copy link
Member

Possibly can be superseded by #3762

@superbrothers
Copy link
Contributor

superbrothers commented Dec 24, 2020

Our baremetal machine takes a long time to set up,so kubeadm bootstrap token expires.

error execution phase preflight: couldn't validate the identity of the API Server: could not find a JWS signature in the cluster-info ConfigMap for token ID "r7haco"

We will work to reduce the time it takes to set up our machine, but we want TTL to be configurable.

@superbrothers
Copy link
Contributor

TIL: kubeadm bootstrap token TTL has already been configurable.

	fs.DurationVar(&kubeadmbootstrapcontrollers.DefaultTokenTTL, "bootstrap-token-ttl", 15*time.Minute,
		"The amount of time the bootstrap token will be valid")

Therefore, the value of DefaultTokenTTL in bootstrap/kubeadm/controllers/token.go has no effect. I was confused about changing DefaultTokenTTL (It looks like a constant variable) with an argument in bootstrap/kubeadm/main.go.

@vincepri
Copy link
Member

vincepri commented Jan 4, 2021

The token should also be renewed if the Machine's Infrastructure doesn't transition to Ready state

@vincepri
Copy link
Member

vincepri commented Feb 8, 2021

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

10 participants