kubeadm etcd modifying recovery steps #56500

sbezverk · 2017-11-28T15:03:46Z

Modifying etcd recovery steps for the case of failed upgrade

sbezverk · 2017-11-28T15:04:24Z

/assign @luxas

sbezverk · 2017-11-28T16:02:04Z

/test pull-kubernetes-e2e-gce

luxas

As discussed in the SIG meeting, we won't downgrade etcd (as etcd doesn't support that) so the line should look sth like this instead:

return true, fmt.Errorf("the requested etcd version (%s) for Kubernetes v(%s) is lower than the currently running version (%s)", desiredEtcdVersion.String(), cfg.KubernetesVersion, currentEtcdVersion.String())

return false, fmt.Errorf("the requested etcd version (%s) for Kubernetes v(%s) is lower than the currently running version (%s). Proceeding with the Kubernetes downgrade but won't downgrade etcd", desiredEtcdVersion.String(), cfg.KubernetesVersion, currentEtcdVersion.String())

Also, we concluded to not try to automatically restore the etcd data, so you should remove the rollbackEtcdData function.

luxas · 2017-11-28T18:20:10Z

cmd/kubeadm/app/phases/upgrade/staticpods.go

+// completed but api/controller-manager/scheduler experienced a problem as a result ALL manifests, etcd including
+// would be rolled back. Currently downgrade for etcd is not working and this case needs to be prevented.
+// rollbackOldManifests needs to be aware whether to restore etcd manifest or not.
+// (TODO) re-evaluate etcd downgrade story.


As discussed in the meeting, we'll just skip the etcd downgrade procedure when downgrading from v1.9 to v1.8

luxas · 2017-11-28T18:54:27Z

cmd/kubeadm/app/phases/upgrade/staticpods.go

@@ -127,6 +127,12 @@ func (spm *KubeStaticPodPathManager) BackupEtcdDir() string {
 }

 func upgradeComponent(component string, waiter apiclient.Waiter, pathMgr StaticPodPathManager, cfg *kubeadmapi.MasterConfiguration, beforePodHash string, recoverManifests map[string]string) error {


nit: change the cfg parameter to just nodeName string as that is the only thing used in cfg
I'd prefer if you parameterized the rollback function to call here instead of baking in this recoverEtcd logic... can you do that please?

here is the thing, if a separate function used for etcd manifest rollback then:

We just end up duplicating the code

componentUpgrade will needs to be changed and checked everytime when roll back is called to see which rollback needs to be called, normal or etcd specific, I think it will bring more confusion than gain from separating. Please let me know if you still wants me to change it.

ok, no need to change it

luxas · 2017-11-28T18:58:52Z

cmd/kubeadm/app/phases/upgrade/staticpods.go

+			return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir)
+		}
+		// Since etcd cluster came back up with the old manifest, it is not a fatal failure in theory the uprgade can proceed further
+		return false, fmt.Errorf("non-fatal error upgrading local etcd cluster: %v", err)


let's consider this a fatal error (quit the general upgrade as the etcd version would be wrong if we can't upgrade it)
The purpose of the rollback here and re-check is to avoid disruption for the user (the upgrade failed but the state was preserved)

s/non-fatal error upgrading local etcd cluster/fatal error when trying to upgrade the etcd cluster: %v. Rolled the state back to pre-upgrade state./

done, but then we will not have any non-fatal cases unless the complete upgrade is successful, now the question is why do we need fatal/non-fatal return?

the first case is non-fatal

luxas · 2017-11-28T19:00:20Z

cmd/kubeadm/app/phases/upgrade/staticpods.go

+// would be rolled back. Currently downgrade for etcd is not working and this case needs to be prevented.
+// rollbackOldManifests needs to be aware whether to restore etcd manifest or not.
+// (TODO) re-evaluate etcd downgrade story.
+func rollbackOldManifests(oldManifests map[string]string, origErr error, pathMgr StaticPodPathManager, restoreEtcd bool) error {


write you own rollback func for etcd instead for passing this extra flag for better readability?

see previous comment. with the current way upgrade is coded, having a separate rollback for etcd will make things more complex. Please re-consider. Please ping me on slack to discuss.

luxas

/lgtm

timothysc · 2017-11-29T18:08:40Z

cmd/kubeadm/app/phases/upgrade/staticpods.go

+	if _, err = etcdCluster.GetEtcdClusterStatus(); err != nil {
+		// Despite the fact that upgradeComponent was sucessfull, there is something wrong with etcd cluster
+		// First step is to rollback to the old etcd manifest
+		if err := rollbackOldManifests(recoverManifests, err, pathMgr, true); err != nil {


why does this exist? If it started up you can't roll it back safely, without restoring the snapshot.

Here is I address situation when upgradeComponent was successful, but upgradeComponent check ONLY if POD is running and its sha got changed. This additional check verifies if ETCD process is responsive. If it is not the case then we try to roll back.

timothysc

see comment.

timothysc · 2017-11-29T18:22:34Z

When upgrading etcd we will need to do the following:

backup the etcd data (cp -r) and force a snapshot for good measure
roll the upgrade
if for any reason the upgrade is not smooth,
- data must be restored
- restore original manifests

sbezverk · 2017-11-29T18:46:08Z

@luxas @timothysc Gents, please sync up between each other I get contradicting reviews. From one side we do not restore data, from the other we do. Could you please agree on one?

timothysc · 2017-11-29T18:53:21Z

@sbezverk sorry about that. We've been chatting on slack about this, I'm a belt and suspenders person when it comes to etcd transitions... take all precautions.

sbezverk · 2017-11-29T19:28:13Z

@luxas @timothysc I will restore Etcd Data restore func and call it before each rollback.

timothysc

minor comments about the messages at the end of the block.

timothysc · 2017-11-29T20:37:10Z

cmd/kubeadm/app/phases/upgrade/staticpods.go

+				return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir)
+			}
+
+			return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir)


Isn't this a successful rollback point here? Am I missing something?

you are right, fixed..

timothysc · 2017-11-29T20:43:39Z

cmd/kubeadm/app/phases/upgrade/staticpods.go

+			return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir)
+		}
+
+		return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir)


same comment.

timothysc

/lgtm

k8s-github-robot · 2017-11-29T20:56:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: luxas, sbezverk, timothysc

Associated issue: 56499

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~cmd/kubeadm/OWNERS~~ [luxas,timothysc]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-11-29T20:56:05Z

[MILESTONENOTIFIER] Milestone Pull Request Current

@dmmcquay @fabriziopandini @luxas @sbezverk @timothysc

Note: This pull request is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Pull Request Labels

sig/cluster-lifecycle: Pull Request will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

k8s-github-robot · 2017-11-29T23:26:00Z

Automatic merge from submit-queue (batch tested with PRs 56497, 56500, 55018, 56544, 56425). If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 28, 2017

k8s-github-robot assigned fabriziopandini and dmmcquay Nov 28, 2017

k8s-ci-robot assigned luxas Nov 28, 2017

sbezverk changed the title ~~kubeadm etcd modifying recover steps~~ kubeadm etcd modifying recovery steps Nov 28, 2017

luxas added this to the v1.9 milestone Nov 28, 2017

k8s-github-robot added the milestone/incomplete-labels label Nov 28, 2017

luxas reviewed Nov 28, 2017

View reviewed changes

luxas approved these changes Nov 28, 2017

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 28, 2017

k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Nov 28, 2017

timothysc reviewed Nov 29, 2017

View reviewed changes

timothysc suggested changes Nov 29, 2017

View reviewed changes

kubeadm etcd modifying recover steps

294114a

k8s-ci-robot assigned timothysc Nov 29, 2017

timothysc approved these changes Nov 29, 2017

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 29, 2017

k8s-github-robot merged commit b86569f into kubernetes:master Nov 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubeadm etcd modifying recovery steps #56500

kubeadm etcd modifying recovery steps #56500

sbezverk commented Nov 28, 2017

sbezverk commented Nov 28, 2017

sbezverk commented Nov 28, 2017

luxas left a comment

luxas Nov 28, 2017

sbezverk Nov 28, 2017

luxas Nov 28, 2017

sbezverk Nov 28, 2017

luxas Nov 28, 2017

luxas Nov 28, 2017

luxas Nov 28, 2017

sbezverk Nov 28, 2017

luxas Nov 28, 2017

luxas Nov 28, 2017

sbezverk Nov 28, 2017

luxas left a comment

timothysc Nov 29, 2017

sbezverk Nov 29, 2017

timothysc left a comment

timothysc commented Nov 29, 2017

sbezverk commented Nov 29, 2017

timothysc commented Nov 29, 2017

sbezverk commented Nov 29, 2017

timothysc left a comment

timothysc Nov 29, 2017

sbezverk Nov 29, 2017

timothysc Nov 29, 2017

sbezverk Nov 29, 2017

timothysc left a comment

k8s-github-robot commented Nov 29, 2017

k8s-github-robot commented Nov 29, 2017

k8s-github-robot commented Nov 29, 2017

		@@ -127,6 +127,12 @@ func (spm *KubeStaticPodPathManager) BackupEtcdDir() string {
		}

		func upgradeComponent(component string, waiter apiclient.Waiter, pathMgr StaticPodPathManager, cfg *kubeadmapi.MasterConfiguration, beforePodHash string, recoverManifests map[string]string) error {

kubeadm etcd modifying recovery steps #56500

kubeadm etcd modifying recovery steps #56500

Conversation

sbezverk commented Nov 28, 2017

sbezverk commented Nov 28, 2017

sbezverk commented Nov 28, 2017

luxas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luxas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timothysc left a comment

Choose a reason for hiding this comment

timothysc commented Nov 29, 2017

sbezverk commented Nov 29, 2017

timothysc commented Nov 29, 2017

sbezverk commented Nov 29, 2017

timothysc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timothysc left a comment

Choose a reason for hiding this comment

k8s-github-robot commented Nov 29, 2017

k8s-github-robot commented Nov 29, 2017

k8s-github-robot commented Nov 29, 2017