(5.5) Check last upgrade state when creating new upgrade operation #1731

r0mant · 2020-06-18T01:03:28Z

Description

This PR adds more sanity checks when creating a new upgrade operation.

Previously, we only checked for "active" cluster state, which can pose a problem if somebody abandons an upgrade operation midway and manually resets the cluster state using internal status-reset command (which unfortunately support folks don't hesitate to use - more on that below too).

The additional checks (which can be bypassed with --force flag) are:

That previous upgrade operation (if any) successfully completed, or
That previous upgrade operation failed and its plan was fully rolled back.

Also, I updated the gravity status-reset command to display a large warning message to explain that the command can potentially lead to inconsistent cluster state and confirm their intent. The warning can be bypassed by providing --confirm flag.

Type of change

Internal change (not necessarily a bug fix or a new feature)
This change has a user-facing impact

Linked tickets and other PRs

Refs [BUG] Do not allow to launch upgrade if last upgrade plan isn't completed/rolled back #1698.

TODOs

Self-review the change
Write tests
Perform manual testing
Address review feedback

Testing done

# Start operation in manual mode.
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual
Wed Jun 17 23:31:58 UTC	Upgrading cluster from 5.5.48-dev.8 to 5.5.49-dev.5

# Attempt to start next one right away fails.
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual
Wed Jun 17 23:32:06 UTC	Upgrading cluster from 5.5.48-dev.8 to 5.5.49-dev.5
[ERROR]: Upgrade operation can only be triggered for active clusters. This cluster is currently updating.
Use gravity status to see the cluster status and make sure that the cluster is active and healthy before retrying.

# Force-set active state and retry - also fails since the first operation is in progress.
ubuntu@node-1:~/upgrade$ sudo ./gravity status-reset --confirm
Cluster test state has been set to active
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual
Wed Jun 17 23:32:24 UTC	Upgrading cluster from 5.5.48-dev.8 to 5.5.49-dev.5
[ERROR]: The last operation_update operation (c4433ee8-2f72-4499-b112-79fe02765844) is still in progress.
Use gravity plan to view the operation plan, and either resume or rollback it before attempting to start another operation.
You can provide --force flag to override this check.

# Complete first operation - next one should launch fine.
ubuntu@node-1:~/upgrade$ sudo ./gravity plan complete --operation-id=c4433ee8-2f72-4499-b112-79fe02765844
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual
Wed Jun 17 23:32:59 UTC	Upgrading cluster from 5.5.48-dev.8 to 5.5.49-dev.5

# Execute a few phases, make one of them fail
ubuntu@node-1:~/upgrade$ sudo ./gravity plan execute --phase=/init
ubuntu@node-1:~/upgrade$ sudo ./gravity plan execute --phase=/bootstrap
Wed Jun 17 23:34:48 UTC	Executing "/bootstrap/node-1" locally
Wed Jun 17 23:34:49 UTC	Executing phase "/bootstrap" finished in 1 second
[ERROR]: rename /var/lib/gravity/site/update/394376198 /var/lib/gravity/site/update/gravity: file exists

# Complete the plan - next attempt should still fail since plan isn't rolled back.
ubuntu@node-1:~/upgrade$ sudo ./gravity plan complete --operation-id=2e28ade7-b450-47a3-a61a-b12720024cef
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual
Wed Jun 17 23:35:29 UTC	Upgrading cluster from 5.5.48-dev.8 to 5.5.49-dev.5
[ERROR]: The last operation_update operation (2e28ade7-b450-47a3-a61a-b12720024cef) is in a failed state but its operation plan isn't fully rolled back.
Use gravity plan to view the operation plan, and either resume or rollback it before attempting to start another operation.
You can provide --force flag to override this check.

# Make sure can be forced.
ubuntu@node-1:~/upgrade$ sudo ./gravity upgrade --manual --force
Thu Jun 18 01:02:48 UTC	Upgrading cluster from 5.5.48-dev.8 to 5.5.49-dev.5

r0mant · 2020-06-18T01:05:36Z

lib/ops/opsservice/operationgroup.go

+	return nil
+}
+
+func (g *operationGroup) checkLastOperation(operation ops.SiteOperation) error {


Made this more or less generic on purpose, so we can potentially check for other operations too, not just last upgrade.

r0mant · 2020-06-18T01:06:21Z

lib/ops/utils.go

-func GetLastUpdateOperation(siteKey SiteKey, operator Operator) (*SiteOperation, error) {
-	lastOperation, _, err := GetLastOperation(siteKey, operator)
+// GetLastUpgradeOperation returns the most recent upgrade operation or NotFound.
+func GetLastUpgradeOperation(key SiteKey, operator Operator) (*SiteOperation, error) {


This function wasn't actually used so it's ok that its logic changed.

knisbet · 2020-06-18T05:11:03Z

I'm curious, why this approach, instead of say, seeing that we're already upgrading and just resuming the previous upgrade attempt?

a-palchikov · 2020-06-18T15:42:54Z

@knisbet I think it might be confusing to the user if the command previously used to create a new upgrade operation would now change the behavior to resuming the last operation in certain cases. I think explicit warning and the ability to resume (there's still gravity plan resume) is better.

On the other hand, if what you suggest is the only mode of operation, it might be complicated to back out of an upgrade that cannot be resumed (i.e. when one of the nodes had failed permanently).

r0mant · 2020-06-18T16:58:41Z

Yes, exactly, this is to catch scenarios when someone started an upgrade, it failed and was abandoned/forgotten about, then cluster state was reset and then they upload a new version and try to upgrade again. This warning will remind them about an incomplete upgrade and prompt to either finish the last one of rollback. The decision will be explicit.

lib/ops/ops.go

lib/ops/opsservice/clusterconfig.go

lib/ops/opsservice/operationgroup.go

knisbet · 2020-06-22T17:31:11Z

Yes, exactly, this is to catch scenarios when someone started an upgrade, it failed and was abandoned/forgotten about, then cluster state was reset and then they upload a new version and try to upgrade again. This warning will remind them about an incomplete upgrade and prompt to either finish the last one of rollback. The decision will be explicit.

I didn't say this should work if someone is trying to upgrade a different version. But what we do see users do, is run upgrade, see it fail, and then run upgrade again not knowing or having awareness of what to do to resume the upgrade, inspect the plan, etc.

r0mant requested a review from a team June 18, 2020 01:03

r0mant self-assigned this Jun 18, 2020

r0mant requested review from a-palchikov, knisbet and bernardjkim June 18, 2020 01:03

r0mant commented Jun 18, 2020

View reviewed changes

a-palchikov reviewed Jun 18, 2020

View reviewed changes

r0mant force-pushed the roman/5.5/plancheck branch from 59db11c to 911683b Compare June 18, 2020 23:00

a-palchikov approved these changes Jun 19, 2020

View reviewed changes

lib/ops/opsservice/operationgroup.go Show resolved Hide resolved

r0mant added 3 commits June 23, 2020 09:06

Check last upgrade state when creating new upgrade operation

fef7786

Address review comments

755fbf4

Fix tests

067a73e

r0mant force-pushed the roman/5.5/plancheck branch from 911683b to 067a73e Compare June 23, 2020 16:10

r0mant merged commit 3febd1c into version/5.5.x Jun 23, 2020

r0mant deleted the roman/5.5/plancheck branch June 23, 2020 21:41

r0mant added a commit that referenced this pull request Jun 26, 2020

Check last upgrade state when creating new upgrade operation (#1731)

70369b5

r0mant mentioned this pull request Jun 26, 2020

(6.1) Forward-port upgrade prechecks and uid fix #1802

Merged

3 tasks

r0mant added a commit that referenced this pull request Jun 26, 2020

Check last upgrade state when creating new upgrade operation (#1731)

29dbac2

r0mant mentioned this pull request Jun 26, 2020

(7.0) Forward-port upgrade prechecks and uid fix #1803

Merged

3 tasks

r0mant added a commit that referenced this pull request Jun 26, 2020

Check last upgrade state when creating new upgrade operation (#1731)

1360a12

r0mant mentioned this pull request Jun 26, 2020

(7.1) Forward-port upgrade prechecks and uid fix #1804

Merged

3 tasks

a-palchikov pushed a commit that referenced this pull request Jun 26, 2020

Check last upgrade state when creating new upgrade operation (#1731)

7008c29

aelkugia mentioned this pull request Jul 30, 2020

[WIP] Add Lele's part of the training about logs and status gravitational/workshop#64

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(5.5) Check last upgrade state when creating new upgrade operation #1731

(5.5) Check last upgrade state when creating new upgrade operation #1731

r0mant commented Jun 18, 2020 •

edited

r0mant Jun 18, 2020

r0mant Jun 18, 2020

knisbet commented Jun 18, 2020

a-palchikov commented Jun 18, 2020 •

edited

r0mant commented Jun 18, 2020

knisbet commented Jun 22, 2020

(5.5) Check last upgrade state when creating new upgrade operation #1731

(5.5) Check last upgrade state when creating new upgrade operation #1731

Conversation

r0mant commented Jun 18, 2020 • edited

Description

Type of change

Linked tickets and other PRs

TODOs

Testing done

r0mant Jun 18, 2020

Choose a reason for hiding this comment

r0mant Jun 18, 2020

Choose a reason for hiding this comment

knisbet commented Jun 18, 2020

a-palchikov commented Jun 18, 2020 • edited

r0mant commented Jun 18, 2020

knisbet commented Jun 22, 2020

r0mant commented Jun 18, 2020 •

edited

a-palchikov commented Jun 18, 2020 •

edited