New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kops rolling-update should do a real rolling-update #37

Closed
justinsb opened this Issue Jul 5, 2016 · 13 comments

Comments

Projects
None yet
7 participants
@justinsb
Member

justinsb commented Jul 5, 2016

Right now it is a hacky timing-based loop.

Ideally we would wait for the new node to be registered.

@justinsb justinsb added the P1 label Jul 5, 2016

@justinsb

This comment has been minimized.

Show comment
Hide comment
@justinsb

justinsb Jul 18, 2016

Member

We should also evict nodes before terminating them.

Member

justinsb commented Jul 18, 2016

We should also evict nodes before terminating them.

@gigaroby

This comment has been minimized.

Show comment
Hide comment
@gigaroby

gigaroby Jul 18, 2016

What about using a similar strategy to that followed by pod rolling updates in kubernetes?
There could a parameter that controls how many nodes are affected at any given time (let's call it N) and the strategy could be something like:

  1. increase the size of the instance group by N
  2. wait for the new N nodes to register themselves to the master
  3. cordon and drain N old nodes, then delete them from the instance group
  4. wait for the new instances to be available on the master
  5. repeat from step 3 until there are no more old instances
  6. decrease the size of the instance group back by N again

I am not an expert on how ASGs work on EC2 but doing this naively may have some problems, namely that it could be slow to wait for the ASG to realize that N nodes have been deleted and to spawn new ones and also the behavior of the last step may be a bit problematic.

gigaroby commented Jul 18, 2016

What about using a similar strategy to that followed by pod rolling updates in kubernetes?
There could a parameter that controls how many nodes are affected at any given time (let's call it N) and the strategy could be something like:

  1. increase the size of the instance group by N
  2. wait for the new N nodes to register themselves to the master
  3. cordon and drain N old nodes, then delete them from the instance group
  4. wait for the new instances to be available on the master
  5. repeat from step 3 until there are no more old instances
  6. decrease the size of the instance group back by N again

I am not an expert on how ASGs work on EC2 but doing this naively may have some problems, namely that it could be slow to wait for the ASG to realize that N nodes have been deleted and to spawn new ones and also the behavior of the last step may be a bit problematic.

@justinsb

This comment has been minimized.

Show comment
Hide comment
@justinsb

justinsb Aug 2, 2016

Member

Note: we should also rolling-update the masters before the nodes, in the case of an version upgrade.

Member

justinsb commented Aug 2, 2016

Note: we should also rolling-update the masters before the nodes, in the case of an version upgrade.

@chrislovecnm

This comment has been minimized.

Show comment
Hide comment
@chrislovecnm

chrislovecnm Dec 19, 2016

Member

This is handled much better in my open PR.

Member

chrislovecnm commented Dec 19, 2016

This is handled much better in my open PR.

@chrislovecnm chrislovecnm self-assigned this Dec 19, 2016

@kris-nova

This comment has been minimized.

Show comment
Hide comment
@kris-nova

kris-nova Dec 19, 2016

Member

@chrislovecnm - pointer please? Which PR?

Member

kris-nova commented Dec 19, 2016

@chrislovecnm - pointer please? Which PR?

@chrislovecnm

This comment has been minimized.

Show comment
Hide comment
@chrislovecnm

chrislovecnm Dec 19, 2016

Member

#1134 Drain and validate in rolling update

Member

chrislovecnm commented Dec 19, 2016

#1134 Drain and validate in rolling update

@justinsb justinsb modified the milestones: 1.5.0, 1.5 Dec 28, 2016

@justinsb justinsb modified the milestones: 1.5.1, 1.5.0 Jan 29, 2017

@mirague

This comment has been minimized.

Show comment
Hide comment
@mirague

mirague Feb 16, 2017

Is this is expected in 1.5.1?

mirague commented Feb 16, 2017

Is this is expected in 1.5.1?

@chrislovecnm

This comment has been minimized.

Show comment
Hide comment
@chrislovecnm

chrislovecnm Feb 16, 2017

Member

@mirague I am working on the PR again today, hopefully, our next release with a feature flag to turn it on.

Member

chrislovecnm commented Feb 16, 2017

@mirague I am working on the PR again today, hopefully, our next release with a feature flag to turn it on.

@mirague

This comment has been minimized.

Show comment
Hide comment
@mirague

mirague Feb 17, 2017

That's fantastic to hear, thanks @chrislovecnm !

mirague commented Feb 17, 2017

That's fantastic to hear, thanks @chrislovecnm !

@toidiu

This comment has been minimized.

Show comment
Hide comment
@toidiu

toidiu Apr 10, 2017

any update on this?

toidiu commented Apr 10, 2017

any update on this?

@Miyurz

This comment has been minimized.

Show comment
Hide comment
@Miyurz

Miyurz Apr 20, 2017

@chrislovecnm any updates here ?

Miyurz commented Apr 20, 2017

@chrislovecnm any updates here ?

@chrislovecnm

This comment has been minimized.

Show comment
Hide comment
@chrislovecnm

chrislovecnm Apr 22, 2017

Member

@Miyurz we have had a feature flag in that allows for drain and validate. It is stable, especially for stateless applications. We need some more TLC for stateless applications.

Use KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate" to use beta code that drains the nodes and validates the cluster. New flags for Drain and Validation operations will be shown when
the environment variable is set.

Member

chrislovecnm commented Apr 22, 2017

@Miyurz we have had a feature flag in that allows for drain and validate. It is stable, especially for stateless applications. We need some more TLC for stateless applications.

Use KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate" to use beta code that drains the nodes and validates the cluster. New flags for Drain and Validation operations will be shown when
the environment variable is set.

@chrislovecnm

This comment has been minimized.

Show comment
Hide comment
@chrislovecnm

chrislovecnm Apr 22, 2017

Member

Closing is this is stable, and we have #1718

Member

chrislovecnm commented Apr 22, 2017

Closing is this is stable, and we have #1718

cloudbow pushed a commit to cloudbow/kops that referenced this issue Jun 8, 2018

Arun George Arun George
Merge pull request #37 in SAPI/slingtv-proxy-api from feature/nba_ful…
…l to develop

* commit '0c6bd01682f69a91c55fb14cd872c83accc8c299':
  Refactor live job to abstract class and implementations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment