Skip to content
This repository has been archived by the owner on Feb 9, 2024. It is now read-only.

(5.5) Check last upgrade state when creating new upgrade operation #1731

Merged
merged 3 commits into from Jun 23, 2020

Conversation

r0mant
Copy link
Contributor

@r0mant r0mant commented Jun 18, 2020

Description

This PR adds more sanity checks when creating a new upgrade operation.

Previously, we only checked for "active" cluster state, which can pose a problem if somebody abandons an upgrade operation midway and manually resets the cluster state using internal status-reset command (which unfortunately support folks don't hesitate to use - more on that below too).

The additional checks (which can be bypassed with --force flag) are:

  • That previous upgrade operation (if any) successfully completed, or
  • That previous upgrade operation failed and its plan was fully rolled back.

Also, I updated the gravity status-reset command to display a large warning message to explain that the command can potentially lead to inconsistent cluster state and confirm their intent. The warning can be bypassed by providing --confirm flag.

Type of change

  • Internal change (not necessarily a bug fix or a new feature)
  • This change has a user-facing impact

Linked tickets and other PRs

TODOs

  • Self-review the change
  • Write tests
  • Perform manual testing
  • Address review feedback

Testing done

# Start operation in manual mode.
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual
Wed Jun 17 23:31:58 UTC	Upgrading cluster from 5.5.48-dev.8 to 5.5.49-dev.5

# Attempt to start next one right away fails.
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual
Wed Jun 17 23:32:06 UTC	Upgrading cluster from 5.5.48-dev.8 to 5.5.49-dev.5
[ERROR]: Upgrade operation can only be triggered for active clusters. This cluster is currently updating.
Use gravity status to see the cluster status and make sure that the cluster is active and healthy before retrying.

# Force-set active state and retry - also fails since the first operation is in progress.
ubuntu@node-1:~/upgrade$ sudo ./gravity status-reset --confirm
Cluster test state has been set to active
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual
Wed Jun 17 23:32:24 UTC	Upgrading cluster from 5.5.48-dev.8 to 5.5.49-dev.5
[ERROR]: The last operation_update operation (c4433ee8-2f72-4499-b112-79fe02765844) is still in progress.
Use gravity plan to view the operation plan, and either resume or rollback it before attempting to start another operation.
You can provide --force flag to override this check.

# Complete first operation - next one should launch fine.
ubuntu@node-1:~/upgrade$ sudo ./gravity plan complete --operation-id=c4433ee8-2f72-4499-b112-79fe02765844
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual
Wed Jun 17 23:32:59 UTC	Upgrading cluster from 5.5.48-dev.8 to 5.5.49-dev.5

# Execute a few phases, make one of them fail
ubuntu@node-1:~/upgrade$ sudo ./gravity plan execute --phase=/init
ubuntu@node-1:~/upgrade$ sudo ./gravity plan execute --phase=/bootstrap
Wed Jun 17 23:34:48 UTC	Executing "/bootstrap/node-1" locally
Wed Jun 17 23:34:49 UTC	Executing phase "/bootstrap" finished in 1 second
[ERROR]: rename /var/lib/gravity/site/update/394376198 /var/lib/gravity/site/update/gravity: file exists

# Complete the plan - next attempt should still fail since plan isn't rolled back.
ubuntu@node-1:~/upgrade$ sudo ./gravity plan complete --operation-id=2e28ade7-b450-47a3-a61a-b12720024cef
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual
Wed Jun 17 23:35:29 UTC	Upgrading cluster from 5.5.48-dev.8 to 5.5.49-dev.5
[ERROR]: The last operation_update operation (2e28ade7-b450-47a3-a61a-b12720024cef) is in a failed state but its operation plan isn't fully rolled back.
Use gravity plan to view the operation plan, and either resume or rollback it before attempting to start another operation.
You can provide --force flag to override this check.

# Make sure can be forced.
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual --force
Thu Jun 18 01:02:48 UTC	Upgrading cluster from 5.5.48-dev.8 to 5.5.49-dev.5

@r0mant r0mant requested a review from a team June 18, 2020 01:03
@r0mant r0mant self-assigned this Jun 18, 2020
return nil
}

func (g *operationGroup) checkLastOperation(operation ops.SiteOperation) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this more or less generic on purpose, so we can potentially check for other operations too, not just last upgrade.

func GetLastUpdateOperation(siteKey SiteKey, operator Operator) (*SiteOperation, error) {
lastOperation, _, err := GetLastOperation(siteKey, operator)
// GetLastUpgradeOperation returns the most recent upgrade operation or NotFound.
func GetLastUpgradeOperation(key SiteKey, operator Operator) (*SiteOperation, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function wasn't actually used so it's ok that its logic changed.

@knisbet
Copy link
Contributor

knisbet commented Jun 18, 2020

I'm curious, why this approach, instead of say, seeing that we're already upgrading and just resuming the previous upgrade attempt?

@a-palchikov
Copy link
Contributor

a-palchikov commented Jun 18, 2020

@knisbet I think it might be confusing to the user if the command previously used to create a new upgrade operation would now change the behavior to resuming the last operation in certain cases. I think explicit warning and the ability to resume (there's still gravity plan resume) is better.

On the other hand, if what you suggest is the only mode of operation, it might be complicated to back out of an upgrade that cannot be resumed (i.e. when one of the nodes had failed permanently).

@r0mant
Copy link
Contributor Author

r0mant commented Jun 18, 2020

Yes, exactly, this is to catch scenarios when someone started an upgrade, it failed and was abandoned/forgotten about, then cluster state was reset and then they upload a new version and try to upgrade again. This warning will remind them about an incomplete upgrade and prompt to either finish the last one of rollback. The decision will be explicit.

lib/ops/ops.go Outdated Show resolved Hide resolved
lib/ops/opsservice/clusterconfig.go Outdated Show resolved Hide resolved
lib/ops/opsservice/operationgroup.go Outdated Show resolved Hide resolved
lib/ops/opsservice/operationgroup.go Outdated Show resolved Hide resolved
lib/ops/opsservice/operationgroup.go Outdated Show resolved Hide resolved
@knisbet
Copy link
Contributor

knisbet commented Jun 22, 2020

Yes, exactly, this is to catch scenarios when someone started an upgrade, it failed and was abandoned/forgotten about, then cluster state was reset and then they upload a new version and try to upgrade again. This warning will remind them about an incomplete upgrade and prompt to either finish the last one of rollback. The decision will be explicit.

I didn't say this should work if someone is trying to upgrade a different version. But what we do see users do, is run upgrade, see it fail, and then run upgrade again not knowing or having awareness of what to do to resume the upgrade, inspect the plan, etc.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants