virt-handler canary upgrade #6702

acardace · 2021-10-29T15:01:40Z

What this PR does / why we need it:

This PR introduces a new upgrade mechanism to first safely upgrade one virt-handler (one node) and then proceed with rest in batches, making the upgrade procedure faster (but still safe) on large clusters.

This upgrade mechanism ensures that virt-handler is safe to upgrade
by launching one (updated) canary virt-handler and making sure that the pod is
fully functional, after that we set MaxUnavailable to 10% and start
the real rollout that will execute on batches of 10% of nodes at a time
thus making the daemonset rollout on large cluster (> 20 nodes) 10 times faster.

If the canary-pod fails the virt-handler daemonset is rolledback to a
previous working revision and one warning event is created.

The canary-pod upgrade procedure in steps:

update virt-handler with maxUnavailable=1
patch daemonSet with new version
wait for a new virt-handler to be ready
set maxUnavailable=10%
start the rollout of the new virt-handler again
wait for all nodes to complete the rollout
set maxUnavailable back to 1

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

implement virt-handler canary upgrade and rollback for faster and safer rollouts

acardace · 2021-10-29T15:03:17Z

/cc @davidvossel @rmohr
/hold

Please have a first pass, I still have to implement e2e tests but wanted to post this anyway to receive some initial feedback as I'm sure this will take some time and modifications before getting merged, thanks!

davidvossel

The steps you have in the description look accurate however I think the execution can be simplified. For example, I don't think we need rollback, and we shouldn't need to an annotation to signify a "canary" pod either.

pkg/util/rollback/rollback.go

pkg/virt-operator/util/canary.go

pkg/virt-operator/resource/apply/apps.go

acardace · 2021-11-04T08:59:15Z

@davidvossel I've simplified everything by removing the rollback, please take a look.

davidvossel · 2021-11-04T21:04:05Z

/retest

davidvossel · 2021-11-10T21:52:56Z

it looks like unit tests are failing

pkg/virt-operator/resource/apply/apps.go

acardace · 2021-11-11T09:18:17Z

it looks like unit tests are failing

Thanks, they should be fixed now.

acardace · 2021-11-11T15:47:37Z

/retest

davidvossel

/approve

excellent work! thanks for sticking with this.

I made one minor comment about consistency of how the initial maxUnavailable value is set, but that's about it from a feature standpoint.

From a functional test standpoint, it would be nice if we could at least observe the daemonset going from 1 to 10% back to 1. What would you think about making a small change to the virt-handler using the Customize Patch logic, and having an informer in the test catch those various stages as the daemonset rolls out?

davidvossel · 2021-11-12T18:11:48Z

pkg/virt-operator/resource/apply/apps.go

+	case updatedAndReadyPods == 0:
+		if !isDaemonSetUpdated {
+			// start canary upgrade
+			setMaxUnavailable(newDS, intstr.FromInt(1))


for consistency, should this be setMaxUnavailable(newDS, daemonSetDefaultMaxUnavailable) instead of 1?

Will change it in the next push, thanks for spotting this!

kubevirt-bot · 2021-11-12T18:29:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [davidvossel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xpivarc

Granting LGTM based on David's review and a fact that the last push contains only correction of serial -> Serial

xpivarc

/hold
@acardace I think there is a probability that this would flake.
We fixed the serial "bug" but we didn't address that this test can affect the following test.

Last question: Should this be part of the operator lane?

tests/canary_upgrade_test.go

Wait for all virt-handler pods to be ready before continuing with the control-plane rollout, previously this function would assume the daemonSet rollout to be complete as soon as N pods were updated and N-1 ready. Signed-off-by: Antonio Cardace <acardace@redhat.com>

xpivarc

Thanks for your work! I have only a few suggestions that I leave up to you.

xpivarc · 2021-12-06T12:11:00Z

pkg/virt-operator/util/readycheck.go

@@ -75,7 +75,7 @@ func DaemonsetIsReady(kv *v1.KubeVirt, daemonset *appsv1.DaemonSet, stores Store
 		return false
 	}

-	return true
+	return podsReady == daemonset.Status.DesiredNumberScheduled


xpivarc · 2021-12-06T12:44:09Z

pkg/virt-operator/resource/apply/apps.go

+	return getMaxUnavailable(daemonSet) == daemonSetDefaultMaxUnavailable.IntValue()
+}
+
+func (r *Reconciler) processCanaryUpgrade(cachedDaemonSet, newDS *appsv1.DaemonSet, forceUpdate bool) (bool, error, CanaryUpgradeStatus) {


I would consider named return variables here.

I'd keep it like this, IMO the code gets a tiny bit less readable with named variables in this case.

xpivarc · 2021-12-06T13:50:58Z

pkg/virt-operator/resource/apply/apps.go

+	if update.MaxUnavailable != nil {
+		return update.MaxUnavailable.IntValue()
+	}
+	return 1


Wouldn't daemonSetDefaultMaxUnavailable be more precise?

xpivarc · 2021-12-06T13:54:04Z

pkg/virt-operator/resource/apply/apps.go

+			}
+		} else {
+			// check for a crashed canary pod
+			canary := r.getCanaryPod(cachedDaemonSet)


I would break the assumption here that there will be only one matching pod.

This upgrade mechanism ensures that virt-handler is safe to upgrade by launching one (updated) canary virt-handler and making sure that the pod is fully functional, after that we set `MaxUnavailable` to 10% and start the real rollout that will execute on batches of 10% of nodes at a time thus making the daemonset rollout on large cluster (> 20 nodes) 10 times faster. The canary-pod upgrade procedure in steps: - update virt-handler with maxUnavailable=1 - patch daemonSet with new version - wait for a new virt-handler to be ready - set maxUnavailable=10% - start the rollout of the new virt-handler - wait for all nodes to complete the rollout - set maxUnavailable back to 1 Signed-off-by: Antonio Cardace <acardace@redhat.com>

Signed-off-by: Antonio Cardace <acardace@redhat.com>

the addAll() was always adding resources to the defaultConfig rather than using the one supplied in the function's parameter. Signed-off-by: Antonio Cardace <acardace@redhat.com>

Signed-off-by: Antonio Cardace <acardace@redhat.com>

xpivarc · 2021-12-09T09:12:40Z

/hold cancel
/lgtm

acardace · 2021-12-09T14:03:44Z

/retest

acardace · 2021-12-09T17:04:55Z

/retest

kubevirt-commenter-bot · 2021-12-09T19:37:04Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-commenter-bot · 2021-12-09T22:36:39Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-commenter-bot · 2021-12-10T06:36:40Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

acardace · 2021-12-10T11:52:56Z

/retest

kubevirt-bot · 2021-12-10T14:05:47Z

@acardace: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubevirt-check-tests-for-flakes	`ac051b4`	link	false	`/test pull-kubevirt-check-tests-for-flakes`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/XXL labels Oct 29, 2021

kubevirt-bot requested review from davidvossel and maiqueb October 29, 2021 15:02

kubevirt-bot requested a review from rmohr October 29, 2021 15:03

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2021

acardace force-pushed the canary-upgrade branch from 29f0af0 to 2314b47 Compare October 29, 2021 15:05

davidvossel reviewed Nov 1, 2021

View reviewed changes

pkg/util/rollback/rollback.go Outdated Show resolved Hide resolved

pkg/virt-operator/util/canary.go Outdated Show resolved Hide resolved

pkg/virt-operator/resource/apply/apps.go Outdated Show resolved Hide resolved

rmohr mentioned this pull request Nov 3, 2021

Allow configuration of virt-handler DaemonSet spec #6561

Closed

acardace force-pushed the canary-upgrade branch 2 times, most recently from 85b58ba to 3131abb Compare November 4, 2021 08:58

kubevirt-bot added size/XL and removed size/XXL labels Nov 4, 2021

acardace force-pushed the canary-upgrade branch 3 times, most recently from 6766b1a to d7dc372 Compare November 5, 2021 17:25

acardace changed the title ~~virt-handler canary upgrade and rollback~~ virt-handler canary upgrade Nov 10, 2021

acardace force-pushed the canary-upgrade branch 2 times, most recently from 1b9b462 to bacc114 Compare November 10, 2021 10:14

davidvossel reviewed Nov 10, 2021

View reviewed changes

pkg/virt-operator/resource/apply/apps.go Outdated Show resolved Hide resolved

acardace force-pushed the canary-upgrade branch from bacc114 to 54a82eb Compare November 11, 2021 09:17

davidvossel reviewed Nov 12, 2021

View reviewed changes

acardace force-pushed the canary-upgrade branch from cb89c22 to 3df3573 Compare December 2, 2021 14:39

xpivarc approved these changes Dec 2, 2021

View reviewed changes

kubevirt-bot assigned xpivarc Dec 2, 2021

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2021

xpivarc reviewed Dec 2, 2021

View reviewed changes

tests/canary_upgrade_test.go Show resolved Hide resolved

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2021

acardace force-pushed the canary-upgrade branch from 3df3573 to 0edf316 Compare December 6, 2021 11:09

kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 6, 2021

acardace force-pushed the canary-upgrade branch from 0edf316 to 0e19af9 Compare December 6, 2021 12:57

xpivarc approved these changes Dec 6, 2021

View reviewed changes

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 6, 2021

acardace added 5 commits December 9, 2021 10:02

apps: add canary upgrade unit tests

2c4fe66

Signed-off-by: Antonio Cardace <acardace@redhat.com>

kubevirt_test: fix addAll() config

d91e65b

the addAll() was always adding resources to the defaultConfig rather than using the one supplied in the function's parameter. Signed-off-by: Antonio Cardace <acardace@redhat.com>

operator_test: wrap Update() with RetryOnConflict() to prevent flakiness

cb71a8d

Signed-off-by: Antonio Cardace <acardace@redhat.com>

tests: add canary upgrade e2e test

ac051b4

Signed-off-by: Antonio Cardace <acardace@redhat.com>

acardace force-pushed the canary-upgrade branch from 0e19af9 to ac051b4 Compare December 9, 2021 09:02

kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 9, 2021

kubevirt-bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Dec 9, 2021

kubevirt-bot merged commit 54d9964 into kubevirt:main Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

virt-handler canary upgrade #6702

virt-handler canary upgrade #6702

acardace commented Oct 29, 2021 •

edited

acardace commented Oct 29, 2021

davidvossel left a comment

acardace commented Nov 4, 2021

davidvossel commented Nov 4, 2021

davidvossel commented Nov 10, 2021

acardace commented Nov 11, 2021

acardace commented Nov 11, 2021

davidvossel left a comment

davidvossel Nov 12, 2021

acardace Nov 15, 2021

kubevirt-bot commented Nov 12, 2021

xpivarc left a comment

xpivarc left a comment

xpivarc left a comment

xpivarc Dec 6, 2021

xpivarc Dec 6, 2021

acardace Dec 9, 2021

xpivarc Dec 6, 2021

acardace Dec 9, 2021

xpivarc Dec 6, 2021

acardace Dec 9, 2021

xpivarc commented Dec 9, 2021

acardace commented Dec 9, 2021

acardace commented Dec 9, 2021

kubevirt-commenter-bot commented Dec 9, 2021

kubevirt-commenter-bot commented Dec 9, 2021

kubevirt-commenter-bot commented Dec 10, 2021

acardace commented Dec 10, 2021

kubevirt-bot commented Dec 10, 2021 •

edited

virt-handler canary upgrade #6702

virt-handler canary upgrade #6702

Conversation

acardace commented Oct 29, 2021 • edited

acardace commented Oct 29, 2021

davidvossel left a comment

Choose a reason for hiding this comment

acardace commented Nov 4, 2021

davidvossel commented Nov 4, 2021

davidvossel commented Nov 10, 2021

acardace commented Nov 11, 2021

acardace commented Nov 11, 2021

davidvossel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kubevirt-bot commented Nov 12, 2021

xpivarc left a comment

Choose a reason for hiding this comment

xpivarc left a comment

Choose a reason for hiding this comment

xpivarc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xpivarc commented Dec 9, 2021

acardace commented Dec 9, 2021

acardace commented Dec 9, 2021

kubevirt-commenter-bot commented Dec 9, 2021

kubevirt-commenter-bot commented Dec 9, 2021

kubevirt-commenter-bot commented Dec 10, 2021

acardace commented Dec 10, 2021

kubevirt-bot commented Dec 10, 2021 • edited

acardace commented Oct 29, 2021 •

edited

kubevirt-bot commented Dec 10, 2021 •

edited