New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm etcd modifying recovery steps #56500
kubeadm etcd modifying recovery steps #56500
Conversation
/assign @luxas |
/test pull-kubernetes-e2e-gce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the SIG meeting, we won't downgrade etcd (as etcd doesn't support that) so the line should look sth like this instead:
return true, fmt.Errorf("the requested etcd version (%s) for Kubernetes v(%s) is lower than the currently running version (%s)", desiredEtcdVersion.String(), cfg.KubernetesVersion, currentEtcdVersion.String())
return false, fmt.Errorf("the requested etcd version (%s) for Kubernetes v(%s) is lower than the currently running version (%s). Proceeding with the Kubernetes downgrade but won't downgrade etcd", desiredEtcdVersion.String(), cfg.KubernetesVersion, currentEtcdVersion.String())
Also, we concluded to not try to automatically restore the etcd data, so you should remove the rollbackEtcdData
function.
// completed but api/controller-manager/scheduler experienced a problem as a result ALL manifests, etcd including | ||
// would be rolled back. Currently downgrade for etcd is not working and this case needs to be prevented. | ||
// rollbackOldManifests needs to be aware whether to restore etcd manifest or not. | ||
// (TODO) re-evaluate etcd downgrade story. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the meeting, we'll just skip the etcd downgrade procedure when downgrading from v1.9 to v1.8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -127,6 +127,12 @@ func (spm *KubeStaticPodPathManager) BackupEtcdDir() string { | |||
} | |||
|
|||
func upgradeComponent(component string, waiter apiclient.Waiter, pathMgr StaticPodPathManager, cfg *kubeadmapi.MasterConfiguration, beforePodHash string, recoverManifests map[string]string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change the cfg
parameter to just nodeName string
as that is the only thing used in cfg
I'd prefer if you parameterized the rollback function to call here instead of baking in this recoverEtcd
logic... can you do that please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the thing, if a separate function used for etcd manifest rollback then:
- We just end up duplicating the code
- componentUpgrade will needs to be changed and checked everytime when roll back is called to see which rollback needs to be called, normal or etcd specific, I think it will bring more confusion than gain from separating. Please let me know if you still wants me to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, no need to change it
return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir) | ||
} | ||
// Since etcd cluster came back up with the old manifest, it is not a fatal failure in theory the uprgade can proceed further | ||
return false, fmt.Errorf("non-fatal error upgrading local etcd cluster: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's consider this a fatal error (quit the general upgrade as the etcd version would be wrong if we can't upgrade it)
The purpose of the rollback here and re-check is to avoid disruption for the user (the upgrade failed but the state was preserved)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/non-fatal error upgrading local etcd cluster/fatal error when trying to upgrade the etcd cluster: %v. Rolled the state back to pre-upgrade state./
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, but then we will not have any non-fatal cases unless the complete upgrade is successful, now the question is why do we need fatal/non-fatal return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first case is non-fatal
// would be rolled back. Currently downgrade for etcd is not working and this case needs to be prevented. | ||
// rollbackOldManifests needs to be aware whether to restore etcd manifest or not. | ||
// (TODO) re-evaluate etcd downgrade story. | ||
func rollbackOldManifests(oldManifests map[string]string, origErr error, pathMgr StaticPodPathManager, restoreEtcd bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write you own rollback func for etcd instead for passing this extra flag for better readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment. with the current way upgrade is coded, having a separate rollback for etcd will make things more complex. Please re-consider. Please ping me on slack to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
if _, err = etcdCluster.GetEtcdClusterStatus(); err != nil { | ||
// Despite the fact that upgradeComponent was sucessfull, there is something wrong with etcd cluster | ||
// First step is to rollback to the old etcd manifest | ||
if err := rollbackOldManifests(recoverManifests, err, pathMgr, true); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this exist? If it started up you can't roll it back safely, without restoring the snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is I address situation when upgradeComponent was successful, but upgradeComponent check ONLY if POD is running and its sha got changed. This additional check verifies if ETCD process is responsive. If it is not the case then we try to roll back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment.
When upgrading etcd we will need to do the following:
|
@luxas @timothysc Gents, please sync up between each other I get contradicting reviews. From one side we do not restore data, from the other we do. Could you please agree on one? |
@sbezverk sorry about that. We've been chatting on slack about this, I'm a belt and suspenders person when it comes to etcd transitions... take all precautions. |
@luxas @timothysc I will restore Etcd Data restore func and call it before each rollback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments about the messages at the end of the block.
return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir) | ||
} | ||
|
||
return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a successful rollback point here? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, fixed..
return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir) | ||
} | ||
|
||
return true, fmt.Errorf("fatal error upgrading local etcd cluster: %v, the backup of etcd database is stored here:(%s)", err, backupEtcdDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: luxas, sbezverk, timothysc Associated issue: 56499 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
[MILESTONENOTIFIER] Milestone Pull Request Current @dmmcquay @fabriziopandini @luxas @sbezverk @timothysc Note: This pull request is marked as Example update:
Pull Request Labels
|
Automatic merge from submit-queue (batch tested with PRs 56497, 56500, 55018, 56544, 56425). If you want to cherry-pick this change to another branch, please follow the instructions here. |
Closes #56499