-
Notifications
You must be signed in to change notification settings - Fork 295
'AvailabilityZone' nodepool rolling strategy #1514
'AvailabilityZone' nodepool rolling strategy #1514
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov Report
@@ Coverage Diff @@
## master #1514 +/- ##
=======================================
Coverage 25.46% 25.46%
=======================================
Files 97 97
Lines 5003 5003
=======================================
Hits 1274 1274
Misses 3582 3582
Partials 147 147
Continue to review full report at Codecov.
|
b9522f1
to
cf6e7ed
Compare
Rebased this on top of fix for workers not correctly starting when using the experimental TLS boostrap feature. This issue was preventing me from performing real testing of this AZ rolling change. |
This is now ready and tested. I think that we should consider making this the default nodepool rolling strategy as it is less likely to cause outages than 'Parallel' because a it ensures that disruption is localised within in each well-defined failure domain rather than across all of the cluster. |
Merge after #1522 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Yes, making this default makes sense to me. Would you mind opening up an another PR for that? Thanks for your contribution anyway!
* ALWAYS write a kubeconfig file - even when using the TLS bootstrapping feature * TLS worker pem symlink is always created so use it always in the kubeconfig * Implement nodePoolRollingStrategy - AvailabilityZone * spelling correction
The default 'Parallel' nodepool rolling strategy is a bit dangerous because it can take down nodes across multiple availability zones and so, if you are unlucky and have an application with only a 1, 2 or 3 pods the parallel roll could take down all of your instances of at the same time - which is definitely not desirable!
The 'Sequential' nodepool rolling strategy is a lot safer because it will only roll a single nodepool at a time, but is slow as a result. When you have multiple different node pools, or large clusters, you may find that your rolls, although safe, take a long time to complete.
The new 'AvailabilityZone' nodepool rolling strategy fits between the two. It will roll all of nodepools that belong to the same AWS AvailabilityZone at the same time and roll the availability zones in order. Example: -
In config above, "TestA" and "TestC" will roll in parallel with each other first (because they are both availability zone
us-west-2a
), then once they have finished rolling, "TestB" and "TestD" will roll in parallel.Note: Only nodepools containing subnets in the same aws availability zone can be rolled using this strategy otherwise you will get an error when rendering the root cloud-formation stack.