Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

virt-handler canary upgrade #6702

Merged
merged 6 commits into from Dec 10, 2021
Merged

Conversation

acardace
Copy link
Member

@acardace acardace commented Oct 29, 2021

What this PR does / why we need it:

This PR introduces a new upgrade mechanism to first safely upgrade one virt-handler (one node) and then proceed with rest in batches, making the upgrade procedure faster (but still safe) on large clusters.

This upgrade mechanism ensures that virt-handler is safe to upgrade
by launching one (updated) canary virt-handler and making sure that the pod is
fully functional, after that we set MaxUnavailable to 10% and start
the real rollout that will execute on batches of 10% of nodes at a time
thus making the daemonset rollout on large cluster (> 20 nodes) 10 times faster.

If the canary-pod fails the virt-handler daemonset is rolledback to a
previous working revision and one warning event is created.

The canary-pod upgrade procedure in steps:

  • update virt-handler with maxUnavailable=1
  • patch daemonSet with new version
  • wait for a new virt-handler to be ready
  • set maxUnavailable=10%
  • start the rollout of the new virt-handler again
  • wait for all nodes to complete the rollout
  • set maxUnavailable back to 1

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

implement virt-handler canary upgrade and rollback for faster and safer rollouts

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/XXL labels Oct 29, 2021
@acardace
Copy link
Member Author

/cc @davidvossel @rmohr
/hold

Please have a first pass, I still have to implement e2e tests but wanted to post this anyway to receive some initial feedback as I'm sure this will take some time and modifications before getting merged, thanks!

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2021
Copy link
Member

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The steps you have in the description look accurate however I think the execution can be simplified. For example, I don't think we need rollback, and we shouldn't need to an annotation to signify a "canary" pod either.

pkg/util/rollback/rollback.go Outdated Show resolved Hide resolved
pkg/virt-operator/util/canary.go Outdated Show resolved Hide resolved
pkg/virt-operator/resource/apply/apps.go Outdated Show resolved Hide resolved
@acardace
Copy link
Member Author

acardace commented Nov 4, 2021

@davidvossel I've simplified everything by removing the rollback, please take a look.

@davidvossel
Copy link
Member

/retest

@acardace acardace force-pushed the canary-upgrade branch 3 times, most recently from 6766b1a to d7dc372 Compare November 5, 2021 17:25
@acardace acardace changed the title virt-handler canary upgrade and rollback virt-handler canary upgrade Nov 10, 2021
@acardace acardace force-pushed the canary-upgrade branch 2 times, most recently from 1b9b462 to bacc114 Compare November 10, 2021 10:14
@davidvossel
Copy link
Member

it looks like unit tests are failing

@acardace
Copy link
Member Author

it looks like unit tests are failing

Thanks, they should be fixed now.

@acardace
Copy link
Member Author

/retest

Copy link
Member

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

excellent work! thanks for sticking with this.

I made one minor comment about consistency of how the initial maxUnavailable value is set, but that's about it from a feature standpoint.

From a functional test standpoint, it would be nice if we could at least observe the daemonset going from 1 to 10% back to 1. What would you think about making a small change to the virt-handler using the Customize Patch logic, and having an informer in the test catch those various stages as the daemonset rolls out?

case updatedAndReadyPods == 0:
if !isDaemonSetUpdated {
// start canary upgrade
setMaxUnavailable(newDS, intstr.FromInt(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency, should this be setMaxUnavailable(newDS, daemonSetDefaultMaxUnavailable) instead of 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change it in the next push, thanks for spotting this!

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@xpivarc xpivarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Granting LGTM based on David's review and a fact that the last push contains only correction of serial -> Serial

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2021
Copy link
Member

@xpivarc xpivarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
@acardace I think there is a probability that this would flake.
We fixed the serial "bug" but we didn't address that this test can affect the following test.

Last question: Should this be part of the operator lane?

tests/canary_upgrade_test.go Show resolved Hide resolved
@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2021
Wait for all virt-handler pods to be ready before continuing with the
control-plane rollout, previously this function would assume the
daemonSet rollout to be complete as soon as N pods were updated and N-1
ready.

Signed-off-by: Antonio Cardace <acardace@redhat.com>
@kubevirt-bot kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 6, 2021
Copy link
Member

@xpivarc xpivarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work! I have only a few suggestions that I leave up to you.

@@ -75,7 +75,7 @@ func DaemonsetIsReady(kv *v1.KubeVirt, daemonset *appsv1.DaemonSet, stores Store
return false
}

return true
return podsReady == daemonset.Status.DesiredNumberScheduled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return getMaxUnavailable(daemonSet) == daemonSetDefaultMaxUnavailable.IntValue()
}

func (r *Reconciler) processCanaryUpgrade(cachedDaemonSet, newDS *appsv1.DaemonSet, forceUpdate bool) (bool, error, CanaryUpgradeStatus) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider named return variables here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep it like this, IMO the code gets a tiny bit less readable with named variables in this case.

if update.MaxUnavailable != nil {
return update.MaxUnavailable.IntValue()
}
return 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't daemonSetDefaultMaxUnavailable be more precise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}
} else {
// check for a crashed canary pod
canary := r.getCanaryPod(cachedDaemonSet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would break the assumption here that there will be only one matching pod.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 6, 2021
This upgrade mechanism ensures that virt-handler is safe to upgrade
by launching one (updated) canary virt-handler and making sure that the pod is
fully functional, after that we set `MaxUnavailable` to 10% and start
the real rollout that will execute on batches of 10% of nodes at a time
thus making the daemonset rollout on large cluster (> 20 nodes) 10 times faster.

The canary-pod upgrade procedure in steps:
- update virt-handler with maxUnavailable=1
- patch daemonSet with new version
- wait for a new virt-handler to be ready
- set maxUnavailable=10%
- start the rollout of the new virt-handler
- wait for all nodes to complete the rollout
- set maxUnavailable back to 1

Signed-off-by: Antonio Cardace <acardace@redhat.com>
Signed-off-by: Antonio Cardace <acardace@redhat.com>
the addAll() was always adding resources to the defaultConfig rather
than using the one supplied in the function's parameter.

Signed-off-by: Antonio Cardace <acardace@redhat.com>
Signed-off-by: Antonio Cardace <acardace@redhat.com>
Signed-off-by: Antonio Cardace <acardace@redhat.com>
@kubevirt-bot kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 9, 2021
@xpivarc
Copy link
Member

xpivarc commented Dec 9, 2021

/hold cancel
/lgtm

@kubevirt-bot kubevirt-bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Dec 9, 2021
@acardace
Copy link
Member Author

acardace commented Dec 9, 2021

/retest

1 similar comment
@acardace
Copy link
Member Author

acardace commented Dec 9, 2021

/retest

@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

2 similar comments
@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@acardace
Copy link
Member Author

/retest

@kubevirt-bot
Copy link
Contributor

kubevirt-bot commented Dec 10, 2021

@acardace: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubevirt-check-tests-for-flakes ac051b4 link false /test pull-kubevirt-check-tests-for-flakes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kubevirt-bot kubevirt-bot merged commit 54d9964 into kubevirt:main Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants