Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm etcd modifying recovery steps #56500

Merged
merged 1 commit into from Nov 29, 2017
Merged

kubeadm etcd modifying recovery steps #56500

merged 1 commit into from Nov 29, 2017

Conversation

sbezverk
Copy link
Contributor

Closes #56499

Modifying etcd recovery steps for the case of failed upgrade

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 28, 2017
@sbezverk
Copy link
Contributor Author

/assign @luxas

@sbezverk sbezverk changed the title kubeadm etcd modifying recover steps kubeadm etcd modifying recovery steps Nov 28, 2017
@sbezverk
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce

@luxas luxas added this to the v1.9 milestone Nov 28, 2017
@luxas luxas added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. status/approved-for-milestone and removed milestone/incomplete-labels labels Nov 28, 2017
Copy link
Member

@luxas luxas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the SIG meeting, we won't downgrade etcd (as etcd doesn't support that) so the line should look sth like this instead:

return true, fmt.Errorf("the requested etcd version (%s) for Kubernetes v(%s) is lower than the currently running version (%s)", desiredEtcdVersion.String(), cfg.KubernetesVersion, currentEtcdVersion.String())
return false, fmt.Errorf("the requested etcd version (%s) for Kubernetes v(%s) is lower than the currently running version (%s). Proceeding with the Kubernetes downgrade but won't downgrade etcd", desiredEtcdVersion.String(), cfg.KubernetesVersion, currentEtcdVersion.String())

Also, we concluded to not try to automatically restore the etcd data, so you should remove the rollbackEtcdData function.

// completed but api/controller-manager/scheduler experienced a problem as a result ALL manifests, etcd including
// would be rolled back. Currently downgrade for etcd is not working and this case needs to be prevented.
// rollbackOldManifests needs to be aware whether to restore etcd manifest or not.
// (TODO) re-evaluate etcd downgrade story.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the meeting, we'll just skip the etcd downgrade procedure when downgrading from v1.9 to v1.8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -127,6 +127,12 @@ func (spm *KubeStaticPodPathManager) BackupEtcdDir() string {
}

func upgradeComponent(component string, waiter apiclient.Waiter, pathMgr StaticPodPathManager, cfg *kubeadmapi.MasterConfiguration, beforePodHash string, recoverManifests map[string]string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change the cfg parameter to just nodeName string as that is the only thing used in cfg
I'd prefer if you parameterized the rollback function to call here instead of baking in this recoverEtcd logic... can you do that please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the thing, if a separate function used for etcd manifest rollback then:

  1. We just end up duplicating the code
  2. componentUpgrade will needs to be changed and checked everytime when roll back is called to see which rollback needs to be called, normal or etcd specific, I think it will bring more confusion than gain from separating. Please let me know if you still wants me to change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, no need to change it

return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir)
}
// Since etcd cluster came back up with the old manifest, it is not a fatal failure in theory the uprgade can proceed further
return false, fmt.Errorf("non-fatal error upgrading local etcd cluster: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's consider this a fatal error (quit the general upgrade as the etcd version would be wrong if we can't upgrade it)
The purpose of the rollback here and re-check is to avoid disruption for the user (the upgrade failed but the state was preserved)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/non-fatal error upgrading local etcd cluster/fatal error when trying to upgrade the etcd cluster: %v. Rolled the state back to pre-upgrade state./

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, but then we will not have any non-fatal cases unless the complete upgrade is successful, now the question is why do we need fatal/non-fatal return?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first case is non-fatal

// would be rolled back. Currently downgrade for etcd is not working and this case needs to be prevented.
// rollbackOldManifests needs to be aware whether to restore etcd manifest or not.
// (TODO) re-evaluate etcd downgrade story.
func rollbackOldManifests(oldManifests map[string]string, origErr error, pathMgr StaticPodPathManager, restoreEtcd bool) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write you own rollback func for etcd instead for passing this extra flag for better readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment. with the current way upgrade is coded, having a separate rollback for etcd will make things more complex. Please re-consider. Please ping me on slack to discuss.

Copy link
Member

@luxas luxas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 28, 2017
@k8s-github-robot k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Nov 28, 2017
if _, err = etcdCluster.GetEtcdClusterStatus(); err != nil {
// Despite the fact that upgradeComponent was sucessfull, there is something wrong with etcd cluster
// First step is to rollback to the old etcd manifest
if err := rollbackOldManifests(recoverManifests, err, pathMgr, true); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this exist? If it started up you can't roll it back safely, without restoring the snapshot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is I address situation when upgradeComponent was successful, but upgradeComponent check ONLY if POD is running and its sha got changed. This additional check verifies if ETCD process is responsive. If it is not the case then we try to roll back.

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment.

@timothysc
Copy link
Member

When upgrading etcd we will need to do the following:

  • backup the etcd data (cp -r) and force a snapshot for good measure
  • roll the upgrade
  • if for any reason the upgrade is not smooth,
    • data must be restored
    • restore original manifests

@sbezverk
Copy link
Contributor Author

@luxas @timothysc Gents, please sync up between each other I get contradicting reviews. From one side we do not restore data, from the other we do. Could you please agree on one?

@timothysc
Copy link
Member

@sbezverk sorry about that. We've been chatting on slack about this, I'm a belt and suspenders person when it comes to etcd transitions... take all precautions.

@sbezverk
Copy link
Contributor Author

@luxas @timothysc I will restore Etcd Data restore func and call it before each rollback.

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments about the messages at the end of the block.

return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir)
}

return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a successful rollback point here? Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, fixed..

return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir)
}

return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 29, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: luxas, sbezverk, timothysc

Associated issue: 56499

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request Current

@dmmcquay @fabriziopandini @luxas @sbezverk @timothysc

Note: This pull request is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Pull Request Labels
  • sig/cluster-lifecycle: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 56497, 56500, 55018, 56544, 56425). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit b86569f into kubernetes:master Nov 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants