Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kops Rollout Strategies - WIP #4038

Closed
wants to merge 9 commits into
base: master
from

Conversation

Projects
None yet
10 participants
@gambol99
Copy link
Member

gambol99 commented Dec 11, 2017

The current implementation only has two strategies for rollout, it's either cloud only or a serial drain. This PR keeps the current implements adding a few extra features and also add a duplicated and re-scaling strategies.

  • Strategies can be specified / override on the command and or they can be added to the InstanceGroups themselves
  • The PR is written with the intent of Kops API server rather than a CLI, thus added error signalling and cancellation to rollouts
  • The default rollout is retained, one by one with or without a drain. Note, drain is now a option rather than an experimental feature which can be used across any of the rollout strategies.
  • Rollouts can occur in batches of X instances and you can limits rollouts to a specific count of Y. i.e you want to test a rollout but only Y instances with a instancegroup.
  • Instancegroup can also be rollout out concurrently via the node-batch option
  • Added a duplication strategy which copies the group and removes the source
  • Added a scale-up strategy which scales the instancegroup by 2, can drain and scales back
  • Added the default serial roll with additional counts, drains and batch options
  • Note i still need to add the tests and docs but raising now to get feedback
@chrislovecnm

This comment has been minimized.

Copy link
Member

chrislovecnm commented Dec 11, 2017

Awesome - this is going to need a bunch of reviews. A thought that helped me and Justin was to pull apart the APIs and the business logic into separate PRs. The APIs are the important a part.

I think we talked about this but we want to follow the pattern of the deployment.

  1. Deployment == instance group
  2. Replica set == machine set
  3. Pod == machine

I do not think we need the machine api in this yet, but I will let you make that call.

Again awesome - let me know

@gambol99 gambol99 force-pushed the gambol99:rollout branch 3 times, most recently from 127ac40 to 476b791 Dec 11, 2017

Yes bool
Force bool
// rollingUpdateOptions is the command Object for a Rolling Update.
type rollingUpdateOptions struct {

This comment has been minimized.

@justinsb

justinsb Dec 14, 2017

Member

So we export these in the hope that someone embedding kops could call RunRollingUpdateCluster


// ClusterName is the name of the kops cluster
ClusterName string
// Count provides a optional argument to indicate to only perform a subnet of instances with one or remove instancegroups

This comment has been minimized.

@justinsb

justinsb Dec 14, 2017

Member

Comment here doesn't really make sense (to me)

// InitDefaults are some sane defaults for the rollouts
func (o *rollingUpdateOptions) initDefaults() {
o.BastionInterval = 5 * time.Minute
o.Batch = 0

This comment has been minimized.

@justinsb

justinsb Dec 14, 2017

Member

Surprised this isn't 1

o.BastionInterval = 5 * time.Minute

o.PostDrainDelay = 90 * time.Second
o.PostDrainDelay = 0

This comment has been minimized.

@justinsb

justinsb Dec 14, 2017

Member

I think we should avoid changing the behaviour (in this direction) unless we have a strong reason to do so. Is there a reason?

cmd.Flags().BoolVar(&options.CloudOnly, "cloudonly", options.CloudOnly, "Perform rolling update without confirming progress with k8s")

cmd.Flags().BoolVar(&drainOption, "drain", drainOption, "Indicates we should drain the node perform terminating")

This comment has been minimized.

@justinsb

justinsb Dec 14, 2017

Member

drain the node before terminating ?

This comment has been minimized.

@gambol99

gambol99 Feb 22, 2018

Author Member

just means we do a cordon and drain to move pods before terminating the node

This comment has been minimized.

@justinsb

justinsb Feb 28, 2018

Member

I meant I think there's a typo : s/perform/before/g ?

newMinSize = int32(p.GroupUpdate.CloudGroup.Size() * 2)
newMaxSize = int32(p.GroupUpdate.CloudGroup.MaxSize)
if newMinSize > newMaxSize {
newMaxSize = newMinSize

This comment has been minimized.

@justinsb

justinsb Dec 14, 2017

Member

I'm missing something here. If min = max , won't this leave the cluster unchanged in size?

This comment has been minimized.

@gambol99

gambol99 Feb 22, 2018

Author Member

I'm basically using the min as a desired count here .. so i double the original min and if the new max size doesn't support it we set them equal ..

config.Spec.MaxSize = &newMaxSize

// @step: attempt to adjust the size of the instancegroup
if _, err := update.Clientset.InstanceGroupsFor(update.Cluster).Update(config); err != nil {

This comment has been minimized.

@justinsb

justinsb Dec 14, 2017

Member

Again it feels we really want that intermediate object


// @check if we need to drain the nodes
if !update.CloudOnly && strategy.Drain {
update.Infof("attempting to drian the nodes from instancegroup: %s, batch: %d", name, strategy.Batch)

This comment has been minimized.

@justinsb

justinsb Dec 14, 2017

Member

typo: drian

// fiddling with the max size
{
update.Infof("adjusting the size on instancegroup: %s back to original min/max (%d/%d)", name, oldMinSize, oldMaxSize)
config.Spec.MinSize = &oldMinSize

This comment has been minimized.

@justinsb

justinsb Dec 14, 2017

Member

I'm a little concerned about the state we're "stuck in" if we crash here also

fmt.Printf("\n")
fmt.Printf(starline)
fmt.Printf("\n")
glog.Infof("[UPDATE] A new kops version is available/recommended: %s\n", recommended)

This comment has been minimized.

@justinsb

justinsb Dec 14, 2017

Member

I like the [UPDATE] delineation, but I previously thought that the whitespace and starline were needed so that it really stood out to users.

This comment has been minimized.

@gambol99

gambol99 Feb 22, 2018

Author Member

Happy to put this one back :-)

This comment has been minimized.

@chrislovecnm

chrislovecnm Mar 11, 2018

Member

Almost wondering if we should pull this out, and get it in another PR. It is kinda outside the scope of the PR.

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Dec 14, 2017

So I'm torn - this is a great cleanup, and I really want the functionality, but I think we're trying to get there via the machines API ( https://github.com/kubernetes/kube-deploy/tree/master/cluster-api ), which I think wouldn't have the problem of what happens if the rolling update is interrupted mid-flow, because we would have the state in the API. But there are still open issues here - like how do we do an update where we can't reach the API server in that world (perhaps we can only do a timed update).

Can we discuss how we can get this - either in office hours or some other convenient time?

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Dec 18, 2017

@gambol99 PR needs rebase

@chrislovecnm

This comment has been minimized.

Copy link
Member

chrislovecnm commented Jan 26, 2018

@gambol99 bump, I know you are super busy, but would love to get this updated, and possibly in. This is a huge win for kops.

@gambol99

This comment has been minimized.

Copy link
Member Author

gambol99 commented Jan 30, 2018

hi @chrislovecnm ... i'm gonna be at this weeks out of hours, so we can discuss then? ..

@gambol99 gambol99 force-pushed the gambol99:rollout branch from 476b791 to f61ed03 Jan 30, 2018

@gambol99

This comment has been minimized.

Copy link
Member Author

gambol99 commented Jan 30, 2018

/hold

@gambol99 gambol99 force-pushed the gambol99:rollout branch from f61ed03 to 22538b6 Jan 30, 2018

@gambol99

This comment has been minimized.

Copy link
Member Author

gambol99 commented Feb 2, 2018

Need to add tests and add the interactive flag back as i remove in a rebase :-)

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Feb 2, 2018

@gambol99 PR needs rebase

@gambol99 gambol99 force-pushed the gambol99:rollout branch from 22538b6 to 1fe70f6 Feb 3, 2018

@gambol99 gambol99 force-pushed the gambol99:rollout branch from 1fe70f6 to 71a3571 Feb 3, 2018

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Feb 4, 2018

@gambol99 PR needs rebase

@justinsb justinsb added this to the 1.10 milestone May 26, 2018

@gambol99 gambol99 force-pushed the gambol99:rollout branch from 45ba870 to d236512 Jun 5, 2018

@rdrgmnzs

This comment has been minimized.

Copy link
Contributor

rdrgmnzs commented Jun 5, 2018

@gambol99 @justinsb it would be amazing to have this in 1.10, is there anything missing in this PR still?

@gambol99 gambol99 force-pushed the gambol99:rollout branch 2 times, most recently from 6ec4dd2 to 6767c48 Jun 11, 2018

@whithajess

This comment has been minimized.

Copy link

whithajess commented Jun 14, 2018

Could we just get a MVP out for this? Serious need for some of this functionality.

gambol99 added some commits Nov 10, 2017

Rolling Update Strategies
- working on rollout stratergies
- fixed up the controller logic for coordinating the goroutines
- changed the name of the rollout strategy to Name, rather the rollout
- added and tested the scale-up strategy
- using the runtime group size rather than the minimum size
- fixing up various vetting issues i came across on the journey
- changed froma NewCloner to a DeepCopy()
- adding a termintation policy method to the cloud interface, this is…
… important to ensure we don't

  leave the instancegroup in a position where we can't scale down post a error
- updated the cli documentaton for cluster rolling update
- fixing the typo on the command line
- added a default 20secs to the post drain
- ensure we don't attempt to drain on a bastion node

@gambol99 gambol99 force-pushed the gambol99:rollout branch from 6767c48 to aba058e Jun 18, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jun 24, 2018

@gambol99: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kimxogus

This comment has been minimized.

Copy link
Contributor

kimxogus commented Jul 13, 2018

Really hope this to be merged

@justinsb justinsb modified the milestones: 1.10, 1.11 Jul 18, 2018

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Jul 18, 2018

We discussed this in kops office hours, and are delaying it because of the overlap with the cluster-api / machines-api

@merlineus

This comment has been minimized.

Copy link

merlineus commented Sep 2, 2018

@gambol99 great job, thank you! Can't wait it will be merged :)

What conditions are taken into account for cluster validation? Is it pod livinessProbe or readinessProbe?

It is very useful to use storage optimized nodes such as i3 or c5d for mongodb or postgresql clusters. The most common way is to declare cluster members as statefullset in different instancegroups (as kops master nodes) which should be rollout each after succesfull data replication of previous one. The best way to detect replication is successfully finished is readinessProbe. So is it possible with your strategy to achieve this behaviour?

@justinsb justinsb modified the milestones: 1.11, 1.12 Oct 10, 2018

@sstarcher

This comment has been minimized.

Copy link
Contributor

sstarcher commented Dec 15, 2018

@justinsb @gambol99 could we revisit this or are we still waiting on cluster-api / machines-api changes. If any assistance is needed I can probably make some time to assist.

@gambol99 gambol99 closed this Feb 11, 2019

@gambol99

This comment has been minimized.

Copy link
Member Author

gambol99 commented Feb 11, 2019

Closing the PR ... it's was discussed in a kops out of hours a long time back and decided to forgo the merge as it conflicted with the cluster-api.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.