Allow major Etcd upgrades with safe roll-back #1773

davidmccormick · 2019-11-13T10:22:44Z

Safer major etcd upgrades by spinning up new etcds and performing a migration (copy of all kubernetes data)

Similar to the approach used during the stack migration but this time the major/minor version of etcd is used to control migration e.g 3.2.x -> 3.3.x will cause a migration and the migration is handled within the same etcd stack. It is safer because should the CF roll fail the previous etcd's should still be available to roll-back to. It works by embedding the version into the etcd cloudformation logical names so that version upgrades generates a new set of etcds, e.g.

+00:00:58	CREATE_IN_PROGRESS      		Etcdv3dot3i2ENI
+00:00:58	CREATE_IN_PROGRESS      		Etcdv3dot3i2LC
+00:00:58	CREATE_IN_PROGRESS      		Etcdv3dot3i0LC
+00:00:58	CREATE_IN_PROGRESS      		Etcdv3dot3i1LC
+00:00:59	CREATE_IN_PROGRESS      		Etcdv3dot3i2ENI       	"Resource creation Initiated"
+00:00:59	CREATE_IN_PROGRESS      		Etcdv3dot3i1ENI
+00:00:59	CREATE_IN_PROGRESS      		Etcdv3dot3i2LC        	"Resource creation Initiated"
+00:00:59	CREATE_IN_PROGRESS      		Etcdv3dot3i0LC        	"Resource creation Initiated"
+00:00:59	CREATE_IN_PROGRESS      		Etcdv3dot3i1LC        	"Resource creation Initiated"
+00:00:59	CREATE_IN_PROGRESS      		Etcdv3dot3i0ENI
+00:00:59	UPDATE_IN_PROGRESS      		IAMRoleEtcd
+00:00:59	CREATE_IN_PROGRESS      		Etcdv3dot3i1ENI       	"Resource creation Initiated"
+00:00:59	CREATE_COMPLETE         		Etcdv3dot3i2LC
+00:00:59	CREATE_COMPLETE         		Etcdv3dot3i0LC
+00:00:59	CREATE_COMPLETE         		Etcdv3dot3i1LC
+00:00:59	CREATE_IN_PROGRESS      		Etcdv3dot3i0ENI       	"Resource creation Initiated"
+00:01:09	UPDATE_COMPLETE         		IAMRoleEtcd
+00:01:15	CREATE_COMPLETE         		Etcdv3dot3i2ENI
+00:01:15	CREATE_COMPLETE         		Etcdv3dot3i1ENI
+00:01:15	CREATE_COMPLETE         		Etcdv3dot3i0ENI
+00:01:17	CREATE_IN_PROGRESS      		Etcdv3dot3i2EBS
+00:01:17	CREATE_IN_PROGRESS      		Etcdv3dot3i2EBS       	"Resource creation Initiated"
+00:01:17	CREATE_IN_PROGRESS      		Etcdv3dot3i1EBS
+00:01:18	CREATE_IN_PROGRESS      		Etcdv3dot3i1EBS       	"Resource creation Initiated"
+00:01:18	CREATE_IN_PROGRESS      		Etcdv3dot3i0EBS
+00:01:18	CREATE_IN_PROGRESS      		Etcdv3dot3i0EBS       	"Resource creation Initiated"
+00:01:33	CREATE_COMPLETE         		Etcdv3dot3i2EBS
+00:01:34	CREATE_COMPLETE         		Etcdv3dot3i1EBS
+00:01:34	CREATE_COMPLETE         		Etcdv3dot3i0EBS
+00:01:35	CREATE_IN_PROGRESS      		Etcdv3dot3i2
+00:01:36	CREATE_IN_PROGRESS      		Etcdv3dot3i1
+00:01:36	CREATE_IN_PROGRESS      		Etcdv3dot3i2          	"Resource creation Initiated"
+00:01:36	CREATE_IN_PROGRESS      		Etcdv3dot3i0
+00:01:36	CREATE_IN_PROGRESS      		Etcdv3dot3i1          	"Resource creation Initiated"
+00:01:37	CREATE_IN_PROGRESS      		Etcdv3dot3i0          	"Resource creation Initiated"
+00:03:59	CREATE_IN_PROGRESS      		Etcdv3dot3i2          	"Received SUCCESS signal with UniqueId i-0797f12fe4b3d5840"
+00:04:00	CREATE_COMPLETE         		Etcdv3dot3i2
+00:04:00	CREATE_IN_PROGRESS      		Etcdv3dot3i0          	"Received SUCCESS signal with UniqueId i-09caf63ec80dd3f37"
+00:04:01	CREATE_COMPLETE         		Etcdv3dot3i0
+00:04:39	CREATE_IN_PROGRESS      		Etcdv3dot3i1          	"Received SUCCESS signal with UniqueId i-02b447cadb74b4265"
+00:04:40	CREATE_COMPLETE         		Etcdv3dot3i1

kube-aws uses a new tag on the etcds kube-aws:etcd_upgrade_group to look for etcds in the same upgrade group as the requested deployment; if the cluster exists (i.e. cluster etcd stack exists) but there are no matching etcd_upgrade_group instances then kube-aws will trigger the migration mode. In migration mode the leader of the new etcd cluster exports the kubernetes registry from the existing/old etcds and restores them into the new cluster.

Features

Brings all new etcds up at same time during a migration.
When an etcd has an attached NIC use that address rather than the machine's private dnsname
Update Etcd to 3.3.17 release
Fix etcdadm so that it can still detect cluster healthy now written to stderr
Update etcd migration to respect keys with leases
Fix building etcd endpoints where the interfaces are listed in different orders
Move to a two export process for retrieving keys and values using etcdctl 'json' export type for key/value data, and then again using its 'fields' export type in order to successfully extract key/lease data. * Process the two files back together with a nod to performance.

…igration (copy of all kubernetes data) Simialr to the approach used during the stack migration but this time the major/minor version of etcd is used to control migration e.g 3.2.x -> 3.3.x will cause a migration. It is safer because should the CF roll fail the previous etcd's should still be available to fall-back to. Bring all new etcds up at same time during a migration. Correct looking up of configsets now that the instance name has changed. When an etcd has an attached NIC use that address rather than the machine's private dnsname Update Etcd to 3.3.17 release Fix etcdadm so that it can still detect cluster healthy now written to stderr Update etcd migration to respect keys with leases Fix building etcd endpoints where the interfaces are listed in different orders Move to a two export process for retrieving keys and values using etcdctl 'json' export type for key/value data, and then again using its 'fields' export type in order to successfully extract key/lease data. Process the two files back together with a nod to performance.

k8s-ci-robot · 2019-11-13T10:23:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign davidmccormick
You can assign the PR to them by writing /assign @davidmccormick in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ror if the default etcd version hadn't been correctly linked into the binary - which it isn't during testing).

codecov-io · 2019-11-13T12:39:32Z

Codecov Report

Merging #1773 into master will decrease coverage by 0.24%.
The diff coverage is 5.4%.

@@            Coverage Diff             @@
##           master    #1773      +/-   ##
==========================================
- Coverage   24.76%   24.52%   -0.25%     
==========================================
  Files          98       98              
  Lines        5023     5085      +62     
==========================================
+ Hits         1244     1247       +3     
- Misses       3640     3699      +59     
  Partials      139      139

Impacted Files	Coverage Δ
pkg/api/etcd_cluster.go	`18.51% <0%> (-7.8%)`	⬇️
pkg/model/context.go	`0% <0%> (ø)`	⬆️
pkg/api/etcd.go	`29.54% <0%> (-1.78%)`	⬇️
pkg/model/etcd_node.go	`6.95% <10%> (-0.26%)`	⬇️
pkg/model/etcd_cluster.go	`91.3% <100%> (+1.3%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37d1ec7...8da1fcf. Read the comment docs.

dominicgunn

Couple of comments.

dominicgunn · 2019-11-14T16:02:41Z

pkg/model/context.go

+		if state.EtcdMigrationEnabled {
+			logger.Warn("Performing a Major Etcd Version Upgrade: -")
+			logger.Warn("To do this we will spin up new etcd servers and then export the existing kubernetes state to them.")
+			logger.Warn("There will be cluster disruption until all of your existing controllers have rolled.")


What is the window of disruption here? Are we talking service outages?

Good question - we are talking about a service outage of the kubernetes api of several minutes (good case) longer if a roll-back is triggered after api-servers have been rolled.

dominicgunn · 2019-11-14T16:05:35Z

pkg/model/context.go

+				Values: []*string{aws.String("owned")},
+			},
+			{
+				Name:   aws.String("tag:kube-aws:etcd_upgrade_group"),


Is this backward compatible with older clusters? I guess it'd just return 0 instances right?

Yes by returning 0 servers which match it knows it needs to perform a migration - so in effect even if we kept the version of etcd the same (3.2.26) it would still want to perform a migration the first time this code is merged and create Etcdv3dot2iX instances (removing the old Etcd[0-2] ones)

dominicgunn · 2019-11-14T16:06:26Z

pkg/model/context.go

+
+	logger.Debugf("<- received %d instances from AWS", len(resp.Reservations))
+	if len(resp.Reservations) == 0 {
+		logger.Debugf("There are 0 instances matching major-minor version %s - this is an upgrade...", c.Etcd.Cluster.MajorMinorVersion())


Looks like this catches the backwards compatibility issues, cool.

dominicgunn · 2019-11-14T16:09:15Z

pkg/model/context.go

+			logger.Warn("If the cloudformation update fails (at any point) then we will roll back to the original etcd servers.")
+			logger.Warn("You MAY lose/rollback changes that are made to the cluster AFTER the etcd export has been performed!")
+			logger.Warn("This operation is best scheduled for a quiet time or in an outage window.")
+			if state.EtcdMigrationExistingEndpoints, err = s.lookupExistingEtcdEndpoints(c); err != nil {


If we're going to bomb out on lookupExistingEtcdEndpoints, should we try and do that before telling the user that we're Performing a Major Etcd Upgrade?

As a layman, I might be concerned that something may have changed before the thing blew up.

Good idea - I've swapped it around so we announce the migration only if the lookup is successful.

…eeded.

dominicgunn · 2019-11-15T12:47:24Z

/lgtm

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 13, 2019

davidmccormick requested a review from dominicgunn November 13, 2019 10:22

k8s-ci-robot requested a review from danielfm November 13, 2019 10:22

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 13, 2019

Fix tests by specifying etcdversion (I added some code to throw an er…

865fa09

…ror if the default etcd version hadn't been correctly linked into the binary - which it isn't during testing).

davidmccormick added this to the v0.15.0 milestone Nov 13, 2019

davidmccormick requested a review from omar-nahhas November 13, 2019 12:13

dominicgunn reviewed Nov 14, 2019

View reviewed changes

Only announce the migration if the lookup of etcd endpoints have succ…

8da1fcf

…eeded.

k8s-ci-robot assigned dominicgunn Nov 15, 2019

k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2019

davidmccormick merged commit ea799be into kubernetes-retired:master Nov 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow major Etcd upgrades with safe roll-back #1773

Allow major Etcd upgrades with safe roll-back #1773

davidmccormick commented Nov 13, 2019 •

edited

k8s-ci-robot commented Nov 13, 2019

codecov-io commented Nov 13, 2019 •

edited

dominicgunn left a comment

dominicgunn Nov 14, 2019

davidmccormick Nov 14, 2019

dominicgunn Nov 14, 2019

davidmccormick Nov 14, 2019 •

edited

dominicgunn Nov 14, 2019

dominicgunn Nov 14, 2019

davidmccormick Nov 15, 2019

dominicgunn commented Nov 15, 2019

Allow major Etcd upgrades with safe roll-back #1773

Allow major Etcd upgrades with safe roll-back #1773

Conversation

davidmccormick commented Nov 13, 2019 • edited

k8s-ci-robot commented Nov 13, 2019

codecov-io commented Nov 13, 2019 • edited

Codecov Report

dominicgunn left a comment

Choose a reason for hiding this comment

dominicgunn Nov 14, 2019

Choose a reason for hiding this comment

davidmccormick Nov 14, 2019

Choose a reason for hiding this comment

dominicgunn Nov 14, 2019

Choose a reason for hiding this comment

davidmccormick Nov 14, 2019 • edited

Choose a reason for hiding this comment

dominicgunn Nov 14, 2019

Choose a reason for hiding this comment

dominicgunn Nov 14, 2019

Choose a reason for hiding this comment

davidmccormick Nov 15, 2019

Choose a reason for hiding this comment

dominicgunn commented Nov 15, 2019

davidmccormick commented Nov 13, 2019 •

edited

codecov-io commented Nov 13, 2019 •

edited

davidmccormick Nov 14, 2019 •

edited