(5.5) Check last upgrade state when creating new upgrade operation #1731
Conversation
return nil | ||
} | ||
|
||
func (g *operationGroup) checkLastOperation(operation ops.SiteOperation) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this more or less generic on purpose, so we can potentially check for other operations too, not just last upgrade.
func GetLastUpdateOperation(siteKey SiteKey, operator Operator) (*SiteOperation, error) { | ||
lastOperation, _, err := GetLastOperation(siteKey, operator) | ||
// GetLastUpgradeOperation returns the most recent upgrade operation or NotFound. | ||
func GetLastUpgradeOperation(key SiteKey, operator Operator) (*SiteOperation, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function wasn't actually used so it's ok that its logic changed.
I'm curious, why this approach, instead of say, seeing that we're already upgrading and just resuming the previous upgrade attempt? |
@knisbet I think it might be confusing to the user if the command previously used to create a new upgrade operation would now change the behavior to resuming the last operation in certain cases. I think explicit warning and the ability to resume (there's still On the other hand, if what you suggest is the only mode of operation, it might be complicated to back out of an upgrade that cannot be resumed (i.e. when one of the nodes had failed permanently). |
Yes, exactly, this is to catch scenarios when someone started an upgrade, it failed and was abandoned/forgotten about, then cluster state was reset and then they upload a new version and try to upgrade again. This warning will remind them about an incomplete upgrade and prompt to either finish the last one of rollback. The decision will be explicit. |
59db11c
to
911683b
Compare
I didn't say this should work if someone is trying to upgrade a different version. But what we do see users do, is run upgrade, see it fail, and then run upgrade again not knowing or having awareness of what to do to resume the upgrade, inspect the plan, etc. |
911683b
to
067a73e
Compare
Description
This PR adds more sanity checks when creating a new upgrade operation.
Previously, we only checked for "active" cluster state, which can pose a problem if somebody abandons an upgrade operation midway and manually resets the cluster state using internal
status-reset
command (which unfortunately support folks don't hesitate to use - more on that below too).The additional checks (which can be bypassed with
--force
flag) are:Also, I updated the
gravity status-reset
command to display a large warning message to explain that the command can potentially lead to inconsistent cluster state and confirm their intent. The warning can be bypassed by providing--confirm
flag.Type of change
Linked tickets and other PRs
TODOs
Testing done