Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced kubernetes version upgrades for workload clusters #3203

Closed
JTarasovic opened this issue Jun 17, 2020 · 27 comments
Closed

Enhanced kubernetes version upgrades for workload clusters #3203

JTarasovic opened this issue Jun 17, 2020 · 27 comments
Labels
area/upgrades Issues or PRs related to upgrades kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Milestone

Comments

@JTarasovic
Copy link

User Story

As an operator, I would like to be able to easily update the Kubernetes version of my workload clusters to be able to stay on top of security patches and new features.

Detailed Description

The procedure for updating the k8s version currently* is to copy the MachineTemplate for KCP, update KCP w/ new version and reference to new MachineTemplate which causes a rollout. Rinse and repeat for MachineDeployments.

Ideally, I'd be able to declare my intent to upgrade the workload cluster and that would be reconciled and rolled out for me.

Anything else you would like to add:

Discussed on 17 June 2020 weekly meeting.

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 17, 2020
@fabriziopandini
Copy link
Member

This issue requires a certain degree of coordination across several components, so the first question in my mind is where to implement this logic.
I don't think this should go at cluster level, because cluster main responsibility is the cluster infrastructure, so what about assuming this should be implemented in separated extension (with it's own CRD/controller)?

@vincepri
Copy link
Member

/milestone v0.4.0

We should revisit in v1alpha4 timeframe, probably needs a more detailed proposal

@k8s-ci-robot k8s-ci-robot added this to the v0.4.0 milestone Jun 18, 2020
@CecileRobertMichon
Copy link
Contributor

cc @rbitia

Ria, this might fit into your "cluster group" proposal?

@JTarasovic
Copy link
Author

JTarasovic commented Jun 18, 2020

We have a relatively small (but growing) number of clusters so we're currently doing upgrades sort of manually. Conceptually, we think about our clusters in 3 streams - alpha, beta and stable - and roll out upgrades and configuration changes according to stream.

Our plan right now is to have common configuration for a stream in a CR (StreamConfig) w/ a controller. The StreamConfig controller would reconcile to ClusterConfigs based on label / annotation with its controller handling the actual cluster resource reconciliation (eg creation, k8s version upgrades, etc).1

I don't think that it's CAPIs responsibility to implement all of that (or any) but if we can do some of the common stuff (version upgrades) here, that seems like it would be super valuable for the whole community. It also seems like the logic would be broadly applicable - copy template, update KCP, rollout, copy template, update MDs, rollout, profit2.


1Names are illustrative and not definitive. Something, something hard problems in Computer Science. 2Grossly over-simplified here for effect.

@vincepri
Copy link
Member

Thanks for the extra context @JTarasovic, from everything I'm hearing here it might be worth considering some extra utilities/libraries/commands under clusterctl which could perform some variations of the concepts described above.

@seh
Copy link

seh commented Jun 23, 2020

Ideally, I'd be able to declare my intent to upgrade the workload cluster and that would be reconciled and rolled out for me.

I find that if I change the "spec.version" field in an existing KubeadmControlPlane object and apply the change, usually the controllers will upgrade my control plane, without me introducing a new (AWS)MachineTemplate. It sounds like that's not supposed to work, and yet it does—most of the time. Why is that?

@JTarasovic
Copy link
Author

Does it actually change the version of the running cluster - eg kubectl get no -o wide shows the new version?

It did not in our experience. It would roll the control plane instances but they'd still be on the previous version.

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Jun 24, 2020

This is how upgrading k8s version on control planes works currently: https://cluster-api.sigs.k8s.io/tasks/kubeadm-control-plane.html?highlight=rolling#how-to-upgrade-the-kubernetes-control-plane-version

Note that you might need to update the image as well if you are specifying the image to use in the machine template.

@seh
Copy link

seh commented Jun 24, 2020

Does it actually change the version of the running cluster - eg kubectl get no -o wide shows the new version?

Yes, it shows the new version there.

@Arvinderpal Arvinderpal mentioned this issue Aug 3, 2020
9 tasks
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 22, 2020
@vincepri
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 28, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2020
@vincepri
Copy link
Member

vincepri commented Jan 4, 2021

Any updates or actions items here?

@JTarasovic
Copy link
Author

I think the clusterctl rollout issue linked above is a good first approximation but I agree w/ @detiber's comment there:

propose support in upstream Kubernetes/kubectl/kubebuilder for a sub-resource type

as that should allow folks to build controllers on top of it.

I'm cool with closing this issue in favor of that.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 3, 2021
@CecileRobertMichon
Copy link
Contributor

I think the clusterctl rollout feature doesn't solve the problem of having to update the image + k8s version for every machine deployment / machine pool / kubeadm control plane that you want to upgrade as a user, although it does give more control on the rollout of machines. It would still be nice to have some sort of higher order "upgrade my cluster" automation. @craiglpeters @devigned and I were discussing this earlier today and one thing that came up was maybe having a way to tell your management cluster which image to use for which k8s version and having the machine template look that up instead of having to individually update the image version on each cluster. This would also allow patching images across all your clusters if you have to rebuild an image for the same k8s version (eg. because of a CVE).

@CecileRobertMichon
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 10, 2021
@fiunchinho
Copy link
Contributor

fiunchinho commented Mar 9, 2021

We have a relatively small (but growing) number of clusters so we're currently doing upgrades sort of manually. Conceptually, we think about our clusters in 3 streams - alpha, beta and stable - and roll out upgrades and configuration changes according to stream.

Our plan right now is to have common configuration for a stream in a CR (StreamConfig) w/ a controller. The StreamConfig controller would reconcile to ClusterConfigs based on label / annotation with its controller handling the actual cluster resource reconciliation (eg creation, k8s version upgrades, etc).

I don't think that it's CAPIs responsibility to implement all of that (or any) but if we can do some of the common stuff (version upgrades) here, that seems like it would be super valuable for the whole community. It also seems like the logic would be broadly applicable - copy template, update KCP, rollout, copy template, update MDs, rollout, profit.

We are in a really similar situation with a large number of clusters and three different pipelines/streams for development/staging/production clusters. We are starting the development of a new component to handle this in a similar fashion (copy template, update KCP, update MachinePool, etc), so it'd be great if we could share tooling. We were also interested in making this component capable of orchestrating this upgrade process so we could, for instance, decide to upgrade node pools one after the other, with some wait period in between, instead of all at once.

If I understand it correctly, this proposal adds kubectl rollout like subcommands to clusterctl but this wouldn't solve the use cases discussed above.

Should we submit a new CAEP proposal for discussion?

@enxebre
Copy link
Member

enxebre commented Apr 8, 2021

Same use case here, looping over machine compute scalable resources e.g machineDeployments to upgrade them one by one against the current control plane version.

For scenarios where more control is required it'd be possibly good to have autoUpgrade: false/true control per machine scalable resource. So you can leveraged are more controlled upgrade for a given machine pool e.g #4346

@smcaine
Copy link

smcaine commented Apr 14, 2021

we have similar use case, we are using gitops + capi, to upgrade our clusters, for now we have to create new machinetemplate, update kcp, wait for that to finish delete old template, create new machinetemplate for machinedeployment, wait for rollout, delete old machinetemplate.. an operator or additional feature/resource that could handle this lifecycle as a whole (declaritively) would be ideal for us, so we can upgrade the KCP and machinedeployments machinetemplate references at same time and let the cluster reconcile and upgrade the controlplane and workers in correct order, then purge unwanted machinetemplates

@enxebre
Copy link
Member

enxebre commented Apr 21, 2021

This relates to the cluster Class discussion #4430.
This will require a considerable amount of work and thinking to get it right. @vincepri is this work still intended to make it to v1alpha4 or can we move it to next milestone?

/area upgrades

@k8s-ci-robot k8s-ci-robot added the area/upgrades Issues or PRs related to upgrades label Apr 21, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2021
@fabriziopandini
Copy link
Member

What about closing this given the ClusterClass work?

@sbueringer
Copy link
Member

Agree. This will be 100% covered by what we want to do with ClusterClass.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 21, 2021
@fabriziopandini
Copy link
Member

/close
As per comment above this is part of ClusterClass; ongoing work in #5059

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close
As per comment above this is part of ClusterClass; ongoing work in #5059

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrades Issues or PRs related to upgrades kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests