Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize and improve rolling updates even more #1718

Closed
hubt opened this issue Jan 31, 2017 · 21 comments · Fixed by #8271
Closed

Parallelize and improve rolling updates even more #1718

hubt opened this issue Jan 31, 2017 · 21 comments · Fixed by #8271
Assignees
Labels
area/rolling-update lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@hubt
Copy link

hubt commented Jan 31, 2017

Repeated from #1134 (comment)
It looks like rolling update only parallelizes by rolling one node from each instance group concurrently. In a medium or large cluster, that's going to take a while. Rather than having to wait for an hour or temporarily expand the ASG, it'd be nice if I could specify a roll concurrency parameter in absolute number or as a percentage, similar to how a rollingUpdateStrategy of a deployment has a maxUnavailable. It's true that in a heterogeneous cluster it could simultaneously roll all your big nodes, but that's an acceptable risk to save hours of waiting.

If you want to be fancy, you could attempt to auto-detect an acceptable roll rate by making sure there are no Pending pods, but that's probably too tricky for right now.

Another good suggestion was to pre-scale ASG before rolling.

@ese
Copy link
Contributor

ese commented Jan 31, 2017

Trying to consolidate #1452 here

Rolling-updates has some issues that should be handled:

  • Upgrade a cluster with high resource utilization can suffer of unschedulable evicted pods until new nodes boot up
  • Large size clusters can take long time to do rolling-update.

In order to handle this issues rolling-update could take and strategy to bring up and terminate nodes at the same time and by N instances:

  • Bring up first N nodes maxSurge
  • Be smart with numbers of nodes it can terminate or use maxUnavailable parameter

should parameters (maxSurge and maxUnavailable) be defined by cluster or instancegroup? This values surely are very tight to the min and max parameters for ASGs and probably can be infered by these plus the current desired

Regarding to be smarter and know if there is enough resources to drain a node cluster autoscaler can be inspiration https://github.com/kubernetes/contrib/blob/master/cluster-autoscaler

The simple flow should be:

before_desire_capacity=current desire capacity
While instances needs update
  scale-out IG to desire capacity=(maxSurge+before_desire_capacity)
  validate cluster
  detach N instances from IG to fit capacity=(before_desire_capacity - maxUnavailable) and terminate them

@chrislovecnm
Copy link
Contributor

Another idea that @hubt mentioned on chat is that some users may want to cordon all nodes and wait for the new node before the pods are rescheduled.

@chrislovecnm
Copy link
Contributor

I realized that my algorithm will take forever in case number of nodes are more than a handful.
So an alternate and easier implementation, at least to begin with, can be as follows

Start new nodes = as many as old nodes
issue pods evict from old nodes (I am assuming that k8s will continue to try to find nodes to move pods to, once new nodes are ready, all pods will be evicted)
as nodes eviction is completed, remove them from ASG and shut them down

Advantages of this algorithm are

simpler, quicker to implement, at least for first cut
will not take hours or days even if someone has 100s or 1000s of nodes
can be easily enhanced with min/max surge algorithm afterwards

thanks @vinayagg

@chrislovecnm
Copy link
Contributor

I need to also research the Pod Disruption stuff upstream.

@edude03
Copy link

edude03 commented Jun 23, 2017

Prescaling the number of nodes seems like a good idea, but what about masters? It's my understanding that kops doesn't yet manage etcd membership.

@chrislovecnm
Copy link
Contributor

I would not recommend prescaling masters, as we have already run into problems with HA upgrades. Looking headroom on the masters is usually not a huge issue. And the code running on the masters 'should' be built for failover. The practice for upgrading Masters is on at a time.

@chrislovecnm
Copy link
Contributor

I have put in a PR #2818 that starts to address this issue

This PR includes two new strategies that influence node replacement. The first option is the current code path. All masters and bastions are rolled sequentially before the nodes, and this flag does not control their replacement.

We are now including three new strategies that influence node replacement. All masters and bastions are rolled sequentially before the nodes, and this flag does not influence their replacement. These strategies utilize the feature flag mentioned above.

  1. "default" - A node is drained then deleted. The cloud then replaces the node automatically.

  2. "create-all-new-ig-first" - All node instance groups are duplicated first; then all old nodes are cordoned.

  3. "create-new-by-ig" - As each node instance group rolls, the instance group is duplicated, then all old nodes are cordoned.

    The second and third options create new instance groups. In order to use this ALPHA feature
    you need to enable +DrainAndValidateRollingUpdate,+RollingUpdateStrategies feature flags.

The second option pre-creates a whole new set of instance groups first, while the third option only creates a new ig as each ig is rolled.

The first and default option is the original code path.

@dcowden
Copy link

dcowden commented Oct 11, 2017

@chrislovecnm We've been talking about rolling updates a lot here over the past week. We've had a great experience with a running kops cluster [thanks], but we've had less great experiences with rolling updates. It seems that its usually cluster networking [weave]. We think its due to weave's decentralized approach of storing ips.

Anyhow, we've concluded that roling updates will take a long time no matter what, and we'd rather not risk the whole cluster at once.

We would like to take rolling updates slow. like, really slow. For example, update a couple of nodes a day, with the idea that its always going on-- an 'evergreen cluster' if you will. This way, we never have risk of a big downtime, and we never have more than a couple of nodes broken-- we can just cordon them and deal with it.

Is this a stupid idea? If not, i think what we want is kops rolling update, where the time between nodes is on the order of a day or more, which i think requires moving the state of the upgrade in progress to the server side, or maybe taking a whole other strategy.

What are your thoughts?

@toidiu
Copy link

toidiu commented Oct 11, 2017

@dcowden We personally dont use rolling-restarts because the cluster is fronted by an ha-proxy and by the end of a roling-restart we would end up with stale records. However, we considered it a bit and decided it was actually too slow to be useful.

The restriction you pose makes perfect sense in the context of weave - but I would point out that your suggesting is to peg the performance to the 'lowest performing component' (not to hate on weave, it serves a purpose, but its peer-peer discovery makes it not so ideal for high-frequency node changes). You might want to look at calico and flannel which use etcd for storing their state (disclaimer: I am no where close to being an expert in CNI technology)

Instead what you are suggesting could be an additional feature on top of rolling update. "Roll the cluster with X nodes over Y time"

@dcowden
Copy link

dcowden commented Oct 11, 2017

hi @toidiu We are currently stuck with weave. canal is our first choice, but unfortunately we have extremely high security requirements, and one of them is that our pod network is encrypted. weave supports this. Flannel is working on adding encryption, and weave is working on the problems associated with the decentralized model. For now, we're stuck until one of them moves.

@dcowden
Copy link

dcowden commented Oct 12, 2017

As a side note, we're also looking into using wireguard. Has anyone tried that with kubernetes?

@chrislovecnm
Copy link
Contributor

Anyhow, we've concluded that rolling updates will take a long time no matter what, and we'd rather not risk the whole cluster at once.

Ish ... we have some ideas. As much as it has been tested rolling-update is not that risky. It uses the same HA principle as failover.

We would like to take rolling updates slow. like, really slow. For example, update a couple of nodes a day, with the idea that its always going on-- an 'evergreen cluster' if you will. This way, we never have risk of a big downtime, and we never have more than a couple of nodes broken-- we can just cordon them and deal with it.

You can, but how many nodes? If you do it that slow just cordon and drain a node and delete it in ec2. Wash, rinse, repeat. You could build a simple controller that kills the oldest node :) It gets complicated managing the launch configurations :)

Is this a stupid idea? If not, i think what we want is kops rolling update, where the time between nodes is on the order of a day or more, which i think requires moving the state of the upgrade in progress to the server side, or maybe taking a whole other strategy.

I think you are overthinking it, personally. But I understand your concerns.

This is another approach, which is kinda what I am going to code next:

  1. upgrade the masters (you have to upgrade them first, and at the same time, not choice)
  2. create a new ig, and let stuff migrate over there. Only when stuff deploys will it move over
  3. new ig is stable
  4. cardon the old ig
  5. drain it
  6. kill the ig

@dcowden
Copy link

dcowden commented Oct 12, 2017

@chrislovecnm thanks. in our experience, rolling updates have broken things quite frequently. I think @toidiu is right-- that's weave's fault, and the right solution is to fix weave not to tiptoe around it.

I like your ig-base rolling update approach. would it be possible to have a manual step/trigger between steps 3 and 4?

@chrislovecnm
Copy link
Contributor

have you worked with the weave team? Where is the issue?

@dcowden
Copy link

dcowden commented Oct 12, 2017

The main issue is still pending here, and one possible solution is underway here

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 11, 2018
@Deshke
Copy link

Deshke commented Feb 11, 2018

what is the latest here? after clicking through multiple tickets and MR maybe this #4038 is the latest ?

@chrislovecnm
Copy link
Contributor

/lifecycle frozen
/remove-lifecycle stale

@Deshke #4038 is the next generation of rolling-updates. Hopefully @gambol99 has some bandwidth to get the PR in :)

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 13, 2018
@omerh
Copy link

omerh commented Jan 23, 2019

Instead of rolling-update I am doing the following steps after updating:

  • rolling upgrade to masters only
  • for each instance group double the desired capacity, wait till ready
  • cordon all old nodes in each instance group
  • revert back to the original desired capacity (so aws autoscale will terminate the old instances as its the default autoscale behaviour)

This is cutting more than in half of the time spending with rolling updates.

@Freyert
Copy link

Freyert commented Apr 1, 2019

Since #4038 has been closed, I'm wondering if there's any opposition to a smaller change in the face of the cluster-api/machine-api changes?

Sticking with the current rolling-update strategy, but surging by 1 VM before beginning the node drainage?

It's not ideal to temporarily shift more load onto VMs and then shortly take that extra loaded VM down as well.

If not, would someone kindly update this issue with how cluster-api/machine-api will solve this problems :). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rolling-update lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.