Parallelize and improve rolling updates even more #1718

hubt · 2017-01-31T06:20:49Z

Repeated from #1134 (comment)
It looks like rolling update only parallelizes by rolling one node from each instance group concurrently. In a medium or large cluster, that's going to take a while. Rather than having to wait for an hour or temporarily expand the ASG, it'd be nice if I could specify a roll concurrency parameter in absolute number or as a percentage, similar to how a rollingUpdateStrategy of a deployment has a maxUnavailable. It's true that in a heterogeneous cluster it could simultaneously roll all your big nodes, but that's an acceptable risk to save hours of waiting.

If you want to be fancy, you could attempt to auto-detect an acceptable roll rate by making sure there are no Pending pods, but that's probably too tricky for right now.

Another good suggestion was to pre-scale ASG before rolling.

ese · 2017-01-31T22:19:22Z

Trying to consolidate #1452 here

Rolling-updates has some issues that should be handled:

Upgrade a cluster with high resource utilization can suffer of unschedulable evicted pods until new nodes boot up
Large size clusters can take long time to do rolling-update.

In order to handle this issues rolling-update could take and strategy to bring up and terminate nodes at the same time and by N instances:

Bring up first N nodes maxSurge
Be smart with numbers of nodes it can terminate or use maxUnavailable parameter

should parameters (maxSurge and maxUnavailable) be defined by cluster or instancegroup? This values surely are very tight to the min and max parameters for ASGs and probably can be infered by these plus the current desired

Regarding to be smarter and know if there is enough resources to drain a node cluster autoscaler can be inspiration https://github.com/kubernetes/contrib/blob/master/cluster-autoscaler

The simple flow should be:

before_desire_capacity=current desire capacity
While instances needs update
  scale-out IG to desire capacity=(maxSurge+before_desire_capacity)
  validate cluster
  detach N instances from IG to fit capacity=(before_desire_capacity - maxUnavailable) and terminate them

chrislovecnm · 2017-03-01T20:49:46Z

Another idea that @hubt mentioned on chat is that some users may want to cordon all nodes and wait for the new node before the pods are rescheduled.

chrislovecnm · 2017-04-22T16:18:27Z

I realized that my algorithm will take forever in case number of nodes are more than a handful.
So an alternate and easier implementation, at least to begin with, can be as follows

Start new nodes = as many as old nodes
issue pods evict from old nodes (I am assuming that k8s will continue to try to find nodes to move pods to, once new nodes are ready, all pods will be evicted)
as nodes eviction is completed, remove them from ASG and shut them down

Advantages of this algorithm are

simpler, quicker to implement, at least for first cut
will not take hours or days even if someone has 100s or 1000s of nodes
can be easily enhanced with min/max surge algorithm afterwards

thanks @vinayagg

chrislovecnm · 2017-04-22T16:26:03Z

I need to also research the Pod Disruption stuff upstream.

edude03 · 2017-06-23T22:34:46Z

Prescaling the number of nodes seems like a good idea, but what about masters? It's my understanding that kops doesn't yet manage etcd membership.

chrislovecnm · 2017-06-23T22:54:18Z

I would not recommend prescaling masters, as we have already run into problems with HA upgrades. Looking headroom on the masters is usually not a huge issue. And the code running on the masters 'should' be built for failover. The practice for upgrading Masters is on at a time.

chrislovecnm · 2017-07-27T21:13:38Z

I have put in a PR #2818 that starts to address this issue

This PR includes two new strategies that influence node replacement. The first option is the current code path. All masters and bastions are rolled sequentially before the nodes, and this flag does not control their replacement.

We are now including three new strategies that influence node replacement. All masters and bastions are rolled sequentially before the nodes, and this flag does not influence their replacement. These strategies utilize the feature flag mentioned above.

"default" - A node is drained then deleted. The cloud then replaces the node automatically.
"create-all-new-ig-first" - All node instance groups are duplicated first; then all old nodes are cordoned.
"create-new-by-ig" - As each node instance group rolls, the instance group is duplicated, then all old nodes are cordoned.

The second and third options create new instance groups. In order to use this ALPHA feature
you need to enable +DrainAndValidateRollingUpdate,+RollingUpdateStrategies feature flags.

The second option pre-creates a whole new set of instance groups first, while the third option only creates a new ig as each ig is rolled.

The first and default option is the original code path.

dcowden · 2017-10-11T19:29:11Z

@chrislovecnm We've been talking about rolling updates a lot here over the past week. We've had a great experience with a running kops cluster [thanks], but we've had less great experiences with rolling updates. It seems that its usually cluster networking [weave]. We think its due to weave's decentralized approach of storing ips.

Anyhow, we've concluded that roling updates will take a long time no matter what, and we'd rather not risk the whole cluster at once.

We would like to take rolling updates slow. like, really slow. For example, update a couple of nodes a day, with the idea that its always going on-- an 'evergreen cluster' if you will. This way, we never have risk of a big downtime, and we never have more than a couple of nodes broken-- we can just cordon them and deal with it.

Is this a stupid idea? If not, i think what we want is kops rolling update, where the time between nodes is on the order of a day or more, which i think requires moving the state of the upgrade in progress to the server side, or maybe taking a whole other strategy.

What are your thoughts?

toidiu · 2017-10-11T22:02:45Z

@dcowden We personally dont use rolling-restarts because the cluster is fronted by an ha-proxy and by the end of a roling-restart we would end up with stale records. However, we considered it a bit and decided it was actually too slow to be useful.

The restriction you pose makes perfect sense in the context of weave - but I would point out that your suggesting is to peg the performance to the 'lowest performing component' (not to hate on weave, it serves a purpose, but its peer-peer discovery makes it not so ideal for high-frequency node changes). You might want to look at calico and flannel which use etcd for storing their state (disclaimer: I am no where close to being an expert in CNI technology)

Instead what you are suggesting could be an additional feature on top of rolling update. "Roll the cluster with X nodes over Y time"

dcowden · 2017-10-11T22:54:14Z

hi @toidiu We are currently stuck with weave. canal is our first choice, but unfortunately we have extremely high security requirements, and one of them is that our pod network is encrypted. weave supports this. Flannel is working on adding encryption, and weave is working on the problems associated with the decentralized model. For now, we're stuck until one of them moves.

dcowden · 2017-10-12T00:03:41Z

As a side note, we're also looking into using wireguard. Has anyone tried that with kubernetes?

chrislovecnm · 2017-10-12T05:57:13Z

Anyhow, we've concluded that rolling updates will take a long time no matter what, and we'd rather not risk the whole cluster at once.

Ish ... we have some ideas. As much as it has been tested rolling-update is not that risky. It uses the same HA principle as failover.

We would like to take rolling updates slow. like, really slow. For example, update a couple of nodes a day, with the idea that its always going on-- an 'evergreen cluster' if you will. This way, we never have risk of a big downtime, and we never have more than a couple of nodes broken-- we can just cordon them and deal with it.

You can, but how many nodes? If you do it that slow just cordon and drain a node and delete it in ec2. Wash, rinse, repeat. You could build a simple controller that kills the oldest node :) It gets complicated managing the launch configurations :)

Is this a stupid idea? If not, i think what we want is kops rolling update, where the time between nodes is on the order of a day or more, which i think requires moving the state of the upgrade in progress to the server side, or maybe taking a whole other strategy.

I think you are overthinking it, personally. But I understand your concerns.

This is another approach, which is kinda what I am going to code next:

upgrade the masters (you have to upgrade them first, and at the same time, not choice)
create a new ig, and let stuff migrate over there. Only when stuff deploys will it move over
new ig is stable
cardon the old ig
drain it
kill the ig

dcowden · 2017-10-12T12:34:31Z

@chrislovecnm thanks. in our experience, rolling updates have broken things quite frequently. I think @toidiu is right-- that's weave's fault, and the right solution is to fix weave not to tiptoe around it.

I like your ig-base rolling update approach. would it be possible to have a manual step/trigger between steps 3 and 4?

chrislovecnm · 2017-10-12T16:35:27Z

have you worked with the weave team? Where is the issue?

dcowden · 2017-10-12T16:46:58Z

The main issue is still pending here, and one possible solution is underway here

fejta-bot · 2018-01-10T19:14:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-02-11T00:50:10Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

Deshke · 2018-02-11T07:47:06Z

what is the latest here? after clicking through multiple tickets and MR maybe this #4038 is the latest ?

chrislovecnm · 2018-02-13T06:59:14Z

/lifecycle frozen
/remove-lifecycle stale

@Deshke #4038 is the next generation of rolling-updates. Hopefully @gambol99 has some bandwidth to get the PR in :)

omerh · 2019-01-23T13:47:49Z

Instead of rolling-update I am doing the following steps after updating:

rolling upgrade to masters only
for each instance group double the desired capacity, wait till ready
cordon all old nodes in each instance group
revert back to the original desired capacity (so aws autoscale will terminate the old instances as its the default autoscale behaviour)

This is cutting more than in half of the time spending with rolling updates.

Freyert · 2019-04-01T16:07:47Z

Since #4038 has been closed, I'm wondering if there's any opposition to a smaller change in the face of the cluster-api/machine-api changes?

Sticking with the current rolling-update strategy, but surging by 1 VM before beginning the node drainage?

It's not ideal to temporarily shift more load onto VMs and then shortly take that extra loaded VM down as well.

If not, would someone kindly update this issue with how cluster-api/machine-api will solve this problems :). Thanks!

hubt mentioned this issue Jan 31, 2017

Rolling updates with drain and validate #1134

Merged

10 tasks

chrislovecnm self-assigned this Jan 31, 2017

chrislovecnm added the area/rolling-update label Feb 17, 2017

chrislovecnm mentioned this issue Apr 22, 2017

kops rolling-update should do a real rolling-update #37

Closed

chrislovecnm mentioned this issue Apr 22, 2017

rolling update may result in downtime #1452

Closed

chrislovecnm mentioned this issue May 2, 2017

kops rolling release doesn't work well with few nodes #2471

Closed

chrislovecnm mentioned this issue Jul 1, 2017

rolling update when changing instance type in a group #2578

Closed

chrislovecnm mentioned this issue Jul 27, 2017

new work on rolling update strategies #2818

Closed

5 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 11, 2018

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 13, 2018

BrianChristie mentioned this issue Oct 26, 2018

rolling-update very slow #5989

Closed

isaaguilar mentioned this issue Aug 14, 2019

rolling-update features cordon/batch/detach #7407

Closed

johngmyers mentioned this issue Jan 4, 2020

Option to increase concurrency of rolling update within instancegroup #8271

Merged

k8s-ci-robot closed this as completed in #8271 Jan 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize and improve rolling updates even more #1718

Parallelize and improve rolling updates even more #1718

hubt commented Jan 31, 2017

ese commented Jan 31, 2017 •

edited

Loading

chrislovecnm commented Mar 1, 2017

chrislovecnm commented Apr 22, 2017

chrislovecnm commented Apr 22, 2017

edude03 commented Jun 23, 2017

chrislovecnm commented Jun 23, 2017

chrislovecnm commented Jul 27, 2017

dcowden commented Oct 11, 2017

toidiu commented Oct 11, 2017

dcowden commented Oct 11, 2017 •

edited

Loading

dcowden commented Oct 12, 2017

chrislovecnm commented Oct 12, 2017

dcowden commented Oct 12, 2017

chrislovecnm commented Oct 12, 2017

dcowden commented Oct 12, 2017

fejta-bot commented Jan 10, 2018

fejta-bot commented Feb 11, 2018

Deshke commented Feb 11, 2018

chrislovecnm commented Feb 13, 2018

omerh commented Jan 23, 2019

Freyert commented Apr 1, 2019

Parallelize and improve rolling updates even more #1718

Parallelize and improve rolling updates even more #1718

Comments

hubt commented Jan 31, 2017

ese commented Jan 31, 2017 • edited Loading

chrislovecnm commented Mar 1, 2017

chrislovecnm commented Apr 22, 2017

chrislovecnm commented Apr 22, 2017

edude03 commented Jun 23, 2017

chrislovecnm commented Jun 23, 2017

chrislovecnm commented Jul 27, 2017

dcowden commented Oct 11, 2017

toidiu commented Oct 11, 2017

dcowden commented Oct 11, 2017 • edited Loading

dcowden commented Oct 12, 2017

chrislovecnm commented Oct 12, 2017

dcowden commented Oct 12, 2017

chrislovecnm commented Oct 12, 2017

dcowden commented Oct 12, 2017

fejta-bot commented Jan 10, 2018

fejta-bot commented Feb 11, 2018

Deshke commented Feb 11, 2018

chrislovecnm commented Feb 13, 2018

omerh commented Jan 23, 2019

Freyert commented Apr 1, 2019

ese commented Jan 31, 2017 •

edited

Loading

dcowden commented Oct 11, 2017 •

edited

Loading