-
Notifications
You must be signed in to change notification settings - Fork 295
Allow major Etcd upgrades with safe roll-back #1773
Allow major Etcd upgrades with safe roll-back #1773
Conversation
…igration (copy of all kubernetes data) Simialr to the approach used during the stack migration but this time the major/minor version of etcd is used to control migration e.g 3.2.x -> 3.3.x will cause a migration. It is safer because should the CF roll fail the previous etcd's should still be available to fall-back to. Bring all new etcds up at same time during a migration. Correct looking up of configsets now that the instance name has changed. When an etcd has an attached NIC use that address rather than the machine's private dnsname Update Etcd to 3.3.17 release Fix etcdadm so that it can still detect cluster healthy now written to stderr Update etcd migration to respect keys with leases Fix building etcd endpoints where the interfaces are listed in different orders Move to a two export process for retrieving keys and values using etcdctl 'json' export type for key/value data, and then again using its 'fields' export type in order to successfully extract key/lease data. Process the two files back together with a nod to performance.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ror if the default etcd version hadn't been correctly linked into the binary - which it isn't during testing).
Codecov Report
@@ Coverage Diff @@
## master #1773 +/- ##
==========================================
- Coverage 24.76% 24.52% -0.25%
==========================================
Files 98 98
Lines 5023 5085 +62
==========================================
+ Hits 1244 1247 +3
- Misses 3640 3699 +59
Partials 139 139
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments.
pkg/model/context.go
Outdated
if state.EtcdMigrationEnabled { | ||
logger.Warn("Performing a Major Etcd Version Upgrade: -") | ||
logger.Warn("To do this we will spin up new etcd servers and then export the existing kubernetes state to them.") | ||
logger.Warn("There will be cluster disruption until all of your existing controllers have rolled.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the window of disruption here? Are we talking service outages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - we are talking about a service outage of the kubernetes api of several minutes (good case) longer if a roll-back is triggered after api-servers have been rolled.
Values: []*string{aws.String("owned")}, | ||
}, | ||
{ | ||
Name: aws.String("tag:kube-aws:etcd_upgrade_group"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this backward compatible with older clusters? I guess it'd just return 0 instances right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes by returning 0 servers which match it knows it needs to perform a migration - so in effect even if we kept the version of etcd the same (3.2.26) it would still want to perform a migration the first time this code is merged and create Etcdv3dot2iX instances (removing the old Etcd[0-2] ones)
|
||
logger.Debugf("<- received %d instances from AWS", len(resp.Reservations)) | ||
if len(resp.Reservations) == 0 { | ||
logger.Debugf("There are 0 instances matching major-minor version %s - this is an upgrade...", c.Etcd.Cluster.MajorMinorVersion()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this catches the backwards compatibility issues, cool.
pkg/model/context.go
Outdated
logger.Warn("If the cloudformation update fails (at any point) then we will roll back to the original etcd servers.") | ||
logger.Warn("You MAY lose/rollback changes that are made to the cluster AFTER the etcd export has been performed!") | ||
logger.Warn("This operation is best scheduled for a quiet time or in an outage window.") | ||
if state.EtcdMigrationExistingEndpoints, err = s.lookupExistingEtcdEndpoints(c); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to bomb out on lookupExistingEtcdEndpoints
, should we try and do that before telling the user that we're Performing a Major Etcd Upgrade
?
As a layman, I might be concerned that something may have changed before the thing blew up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - I've swapped it around so we announce the migration only if the lookup is successful.
/lgtm |
Safer major etcd upgrades by spinning up new etcds and performing a migration (copy of all kubernetes data)
Similar to the approach used during the stack migration but this time the major/minor version of etcd is used to control migration e.g 3.2.x -> 3.3.x will cause a migration and the migration is handled within the same etcd stack. It is safer because should the CF roll fail the previous etcd's should still be available to roll-back to. It works by embedding the version into the etcd cloudformation logical names so that version upgrades generates a new set of etcds, e.g.
kube-aws uses a new tag on the etcds
kube-aws:etcd_upgrade_group
to look for etcds in the same upgrade group as the requested deployment; if the cluster exists (i.e. cluster etcd stack exists) but there are no matching etcd_upgrade_group instances then kube-aws will trigger the migration mode. In migration mode the leader of the new etcd cluster exports the kubernetes registry from the existing/old etcds and restores them into the new cluster.Features