New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
virt-handler canary upgrade #6702
Conversation
/cc @davidvossel @rmohr Please have a first pass, I still have to implement e2e tests but wanted to post this anyway to receive some initial feedback as I'm sure this will take some time and modifications before getting merged, thanks! |
29f0af0
to
2314b47
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The steps you have in the description look accurate however I think the execution can be simplified. For example, I don't think we need rollback, and we shouldn't need to an annotation to signify a "canary" pod either.
85b58ba
to
3131abb
Compare
@davidvossel I've simplified everything by removing the rollback, please take a look. |
/retest |
6766b1a
to
d7dc372
Compare
1b9b462
to
bacc114
Compare
it looks like unit tests are failing |
bacc114
to
54a82eb
Compare
Thanks, they should be fixed now. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
excellent work! thanks for sticking with this.
I made one minor comment about consistency of how the initial maxUnavailable value is set, but that's about it from a feature standpoint.
From a functional test standpoint, it would be nice if we could at least observe the daemonset going from 1
to 10%
back to 1
. What would you think about making a small change to the virt-handler using the Customize Patch logic, and having an informer in the test catch those various stages as the daemonset rolls out?
case updatedAndReadyPods == 0: | ||
if !isDaemonSetUpdated { | ||
// start canary upgrade | ||
setMaxUnavailable(newDS, intstr.FromInt(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for consistency, should this be setMaxUnavailable(newDS, daemonSetDefaultMaxUnavailable)
instead of 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change it in the next push, thanks for spotting this!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: davidvossel The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
cb89c22
to
3df3573
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Granting LGTM based on David's review and a fact that the last push contains only correction of serial
-> Serial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
@acardace I think there is a probability that this would flake.
We fixed the serial
"bug" but we didn't address that this test can affect the following test.
Last question: Should this be part of the operator lane?
Wait for all virt-handler pods to be ready before continuing with the control-plane rollout, previously this function would assume the daemonSet rollout to be complete as soon as N pods were updated and N-1 ready. Signed-off-by: Antonio Cardace <acardace@redhat.com>
3df3573
to
0edf316
Compare
0edf316
to
0e19af9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work! I have only a few suggestions that I leave up to you.
@@ -75,7 +75,7 @@ func DaemonsetIsReady(kv *v1.KubeVirt, daemonset *appsv1.DaemonSet, stores Store | |||
return false | |||
} | |||
|
|||
return true | |||
return podsReady == daemonset.Status.DesiredNumberScheduled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
return getMaxUnavailable(daemonSet) == daemonSetDefaultMaxUnavailable.IntValue() | ||
} | ||
|
||
func (r *Reconciler) processCanaryUpgrade(cachedDaemonSet, newDS *appsv1.DaemonSet, forceUpdate bool) (bool, error, CanaryUpgradeStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider named return variables here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep it like this, IMO the code gets a tiny bit less readable with named variables in this case.
if update.MaxUnavailable != nil { | ||
return update.MaxUnavailable.IntValue() | ||
} | ||
return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't daemonSetDefaultMaxUnavailable
be more precise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
} | ||
} else { | ||
// check for a crashed canary pod | ||
canary := r.getCanaryPod(cachedDaemonSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would break the assumption here that there will be only one matching pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
This upgrade mechanism ensures that virt-handler is safe to upgrade by launching one (updated) canary virt-handler and making sure that the pod is fully functional, after that we set `MaxUnavailable` to 10% and start the real rollout that will execute on batches of 10% of nodes at a time thus making the daemonset rollout on large cluster (> 20 nodes) 10 times faster. The canary-pod upgrade procedure in steps: - update virt-handler with maxUnavailable=1 - patch daemonSet with new version - wait for a new virt-handler to be ready - set maxUnavailable=10% - start the rollout of the new virt-handler - wait for all nodes to complete the rollout - set maxUnavailable back to 1 Signed-off-by: Antonio Cardace <acardace@redhat.com>
Signed-off-by: Antonio Cardace <acardace@redhat.com>
the addAll() was always adding resources to the defaultConfig rather than using the one supplied in the function's parameter. Signed-off-by: Antonio Cardace <acardace@redhat.com>
Signed-off-by: Antonio Cardace <acardace@redhat.com>
Signed-off-by: Antonio Cardace <acardace@redhat.com>
0e19af9
to
ac051b4
Compare
/hold cancel |
/retest |
1 similar comment
/retest |
/retest |
2 similar comments
/retest |
/retest |
/retest |
@acardace: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What this PR does / why we need it:
This PR introduces a new upgrade mechanism to first safely upgrade one virt-handler (one node) and then proceed with rest in batches, making the upgrade procedure faster (but still safe) on large clusters.
This upgrade mechanism ensures that virt-handler is safe to upgrade
by launching one (updated) canary virt-handler and making sure that the pod is
fully functional, after that we set
MaxUnavailable
to 10% and startthe real rollout that will execute on batches of 10% of nodes at a time
thus making the daemonset rollout on large cluster (> 20 nodes) 10 times faster.
If the canary-pod fails the virt-handler daemonset is rolledback to a
previous working revision and one warning event is created.
The canary-pod upgrade procedure in steps:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Release note: