'AvailabilityZone' nodepool rolling strategy #1514

davidmccormick · 2018-12-14T18:18:10Z

The default 'Parallel' nodepool rolling strategy is a bit dangerous because it can take down nodes across multiple availability zones and so, if you are unlucky and have an application with only a 1, 2 or 3 pods the parallel roll could take down all of your instances of at the same time - which is definitely not desirable!

The 'Sequential' nodepool rolling strategy is a lot safer because it will only roll a single nodepool at a time, but is slow as a result. When you have multiple different node pools, or large clusters, you may find that your rolls, although safe, take a long time to complete.

The new 'AvailabilityZone' nodepool rolling strategy fits between the two. It will roll all of nodepools that belong to the same AWS AvailabilityZone at the same time and roll the availability zones in order. Example: -

worker:
  nodePoolRollingStrategy: AvailabilityZone
  nodePools:
    "TestA":
       subnet: a (in availability zone us-west-2a)
    "TestB":
      subnet: b (in availability zone us-west-2b)
    "TestC":
       subnet: a (in availability zone us-west-2a)
    "TestD":
      subnet: b (in availability zone us-west-2b)

In config above, "TestA" and "TestC" will roll in parallel with each other first (because they are both availability zone us-west-2a), then once they have finished rolling, "TestB" and "TestD" will roll in parallel.

Note: Only nodepools containing subnets in the same aws availability zone can be rolled using this strategy otherwise you will get an error when rendering the root cloud-formation stack.

k8s-ci-robot · 2018-12-14T18:18:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: mumoshu

If they are not already assigned, you can assign the PR to them by writing /assign @mumoshu in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-io · 2018-12-14T18:54:54Z

Codecov Report

Merging #1514 into master will not change coverage.
The diff coverage is 33.33%.

@@           Coverage Diff           @@
##           master    #1514   +/-   ##
=======================================
  Coverage   25.46%   25.46%           
=======================================
  Files          97       97           
  Lines        5003     5003           
=======================================
  Hits         1274     1274           
  Misses       3582     3582           
  Partials      147      147

Impacted Files	Coverage Δ
pkg/api/controller.go	`0% <0%> (ø)`	⬆️
pkg/model/compiler.go	`55.12% <50%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8b7380...cf6e7ed. Read the comment docs.

…g feature

…config

davidmccormick · 2019-01-02T13:17:18Z

Rebased this on top of fix for workers not correctly starting when using the experimental TLS boostrap feature. This issue was preventing me from performing real testing of this AZ rolling change.

davidmccormick · 2019-01-02T14:57:02Z

This is now ready and tested. I think that we should consider making this the default nodepool rolling strategy as it is less likely to cause outages than 'Parallel' because a it ensures that disruption is localised within in each well-defined failure domain rather than across all of the cluster.

davidmccormick · 2019-01-02T15:03:29Z

Merge after #1522

mumoshu

Awesome! Yes, making this default makes sense to me. Would you mind opening up an another PR for that? Thanks for your contribution anyway!

* ALWAYS write a kubeconfig file - even when using the TLS bootstrapping feature * TLS worker pem symlink is always created so use it always in the kubeconfig * Implement nodePoolRollingStrategy - AvailabilityZone * spelling correction

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 14, 2018

k8s-ci-robot requested review from cknowles and mumoshu December 14, 2018 18:18

davidmccormick changed the title ~~Implement an 'AvailabilityZone' nodepool rolling strategy~~ WIP: Implement an 'AvailabilityZone' nodepool rolling strategy Dec 14, 2018

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 14, 2018

davidmccormick added 4 commits January 2, 2019 12:40

ALWAYS write a kubeconfig file - even when using the TLS bootstrappin…

3f3e2ae

…g feature

TLS worker pem symlink is always created so use it always in the kube…

8a25475

…config

Implement nodePoolRollingStrategy - AvailabilityZone

7bec0b5

spelling correction

cf6e7ed

davidmccormick force-pushed the roll-nodepools-by-availabilityzone branch from b9522f1 to cf6e7ed Compare January 2, 2019 13:15

davidmccormick changed the title ~~WIP: Implement an 'AvailabilityZone' nodepool rolling strategy~~ 'AvailabilityZone' nodepool rolling strategy Jan 2, 2019

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 2, 2019

mumoshu approved these changes Jan 9, 2019

View reviewed changes

mumoshu merged commit d7e4216 into kubernetes-retired:master Jan 9, 2019

mumoshu added this to the v0.13.0 milestone Jan 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'AvailabilityZone' nodepool rolling strategy #1514

'AvailabilityZone' nodepool rolling strategy #1514

davidmccormick commented Dec 14, 2018 •

edited

Loading

k8s-ci-robot commented Dec 14, 2018

codecov-io commented Dec 14, 2018 •

edited

Loading

davidmccormick commented Jan 2, 2019

davidmccormick commented Jan 2, 2019

davidmccormick commented Jan 2, 2019

mumoshu left a comment

'AvailabilityZone' nodepool rolling strategy #1514

'AvailabilityZone' nodepool rolling strategy #1514

Conversation

davidmccormick commented Dec 14, 2018 • edited Loading

k8s-ci-robot commented Dec 14, 2018

codecov-io commented Dec 14, 2018 • edited Loading

Codecov Report

davidmccormick commented Jan 2, 2019

davidmccormick commented Jan 2, 2019

davidmccormick commented Jan 2, 2019

mumoshu left a comment

Choose a reason for hiding this comment

davidmccormick commented Dec 14, 2018 •

edited

Loading

codecov-io commented Dec 14, 2018 •

edited

Loading