Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

'AvailabilityZone' nodepool rolling strategy #1514

Conversation

davidmccormick
Copy link
Contributor

@davidmccormick davidmccormick commented Dec 14, 2018

The default 'Parallel' nodepool rolling strategy is a bit dangerous because it can take down nodes across multiple availability zones and so, if you are unlucky and have an application with only a 1, 2 or 3 pods the parallel roll could take down all of your instances of at the same time - which is definitely not desirable!

The 'Sequential' nodepool rolling strategy is a lot safer because it will only roll a single nodepool at a time, but is slow as a result. When you have multiple different node pools, or large clusters, you may find that your rolls, although safe, take a long time to complete.

The new 'AvailabilityZone' nodepool rolling strategy fits between the two. It will roll all of nodepools that belong to the same AWS AvailabilityZone at the same time and roll the availability zones in order. Example: -

worker:
  nodePoolRollingStrategy: AvailabilityZone
  nodePools:
    "TestA":
       subnet: a (in availability zone us-west-2a)
    "TestB":
      subnet: b (in availability zone us-west-2b)
    "TestC":
       subnet: a (in availability zone us-west-2a)
    "TestD":
      subnet: b (in availability zone us-west-2b)

In config above, "TestA" and "TestC" will roll in parallel with each other first (because they are both availability zone us-west-2a), then once they have finished rolling, "TestB" and "TestD" will roll in parallel.

Note: Only nodepools containing subnets in the same aws availability zone can be rolled using this strategy otherwise you will get an error when rendering the root cloud-formation stack.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 14, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: mumoshu

If they are not already assigned, you can assign the PR to them by writing /assign @mumoshu in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@davidmccormick davidmccormick changed the title Implement an 'AvailabilityZone' nodepool rolling strategy WIP: Implement an 'AvailabilityZone' nodepool rolling strategy Dec 14, 2018
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 14, 2018
@codecov-io
Copy link

codecov-io commented Dec 14, 2018

Codecov Report

Merging #1514 into master will not change coverage.
The diff coverage is 33.33%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1514   +/-   ##
=======================================
  Coverage   25.46%   25.46%           
=======================================
  Files          97       97           
  Lines        5003     5003           
=======================================
  Hits         1274     1274           
  Misses       3582     3582           
  Partials      147      147
Impacted Files Coverage Δ
pkg/api/controller.go 0% <0%> (ø) ⬆️
pkg/model/compiler.go 55.12% <50%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8b7380...cf6e7ed. Read the comment docs.

@davidmccormick davidmccormick force-pushed the roll-nodepools-by-availabilityzone branch from b9522f1 to cf6e7ed Compare January 2, 2019 13:15
@davidmccormick
Copy link
Contributor Author

Rebased this on top of fix for workers not correctly starting when using the experimental TLS boostrap feature. This issue was preventing me from performing real testing of this AZ rolling change.

@davidmccormick davidmccormick changed the title WIP: Implement an 'AvailabilityZone' nodepool rolling strategy 'AvailabilityZone' nodepool rolling strategy Jan 2, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 2, 2019
@davidmccormick
Copy link
Contributor Author

This is now ready and tested. I think that we should consider making this the default nodepool rolling strategy as it is less likely to cause outages than 'Parallel' because a it ensures that disruption is localised within in each well-defined failure domain rather than across all of the cluster.

@davidmccormick
Copy link
Contributor Author

Merge after #1522

Copy link
Contributor

@mumoshu mumoshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Yes, making this default makes sense to me. Would you mind opening up an another PR for that? Thanks for your contribution anyway!

@mumoshu mumoshu merged commit d7e4216 into kubernetes-retired:master Jan 9, 2019
@mumoshu mumoshu added this to the v0.13.0 milestone Jan 9, 2019
kevtaylor pushed a commit to HotelsDotCom/kube-aws that referenced this pull request Jan 9, 2019
* ALWAYS write a kubeconfig file - even when using the TLS bootstrapping feature

* TLS worker pem symlink is always created so use it always in the kubeconfig

* Implement nodePoolRollingStrategy - AvailabilityZone

* spelling correction
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants