-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EvictionStrategy: "External" #7493
EvictionStrategy: "External" #7493
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few comments but everything looks good!
One other thing, can you please copy-paste what you wrote in the PR description into the first commit message? That would help a lot in understanding where this change originates when going through the git history.
pkg/virt-api/webhooks/validating-webhook/admitters/pod-eviction-admitter.go
Show resolved
Hide resolved
pkg/virt-api/webhooks/validating-webhook/admitters/vmi-create-admitter.go
Outdated
Show resolved
Hide resolved
pkg/virt-api/webhooks/validating-webhook/admitters/vmi-create-admitter.go
Outdated
Show resolved
Hide resolved
pkg/virt-controller/watch/drain/disruptionbudget/disruptionbudget_test.go
Show resolved
Hide resolved
pkg/virt-controller/watch/drain/disruptionbudget/disruptionbudget.go
Outdated
Show resolved
Hide resolved
39c4b16
to
bd2c061
Compare
/retest |
1 similar comment
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks David!
Q: why not use termination grace period and single drain internally when ACPI shutdown is detected?
A: We need to support the node drain process and timeouts that the cluster-api controllers execute today
I don't really understand the answer here. I mean, why can't the "shutdown process" change the VMI status (like you did), then get stuck until the cluster-api controllers (somehow) notify that the draining is done? Thanks.
Request: &admissionv1.AdmissionRequest{ | ||
Name: pod.Name, | ||
Namespace: pod.Namespace, | ||
DryRun: &dryRun, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be replaced with pointer.Bool(false)
, it's both more readable and better in terms of performance (only one pointer for all false/true pointers out there)
@davidvossel could it be that https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown will help us to address this in a different way? If the shutdown is signaled properly to the launcher, then libvirt/qemu will properly signal (via acpi or GA) a shutdown to the guest, giving it time to gracefully shut down. |
@fabiand @iholder-redhat Since your questions are similar I'll answer them both here.
Anything involving triggering something within the guest OS needs to stay 100% out of the equation here. From a cluster level, we need to know that a eviction is occurring on a guest without the guest being touched by kubevirt at all, then allow our external cluster controllers (capi controllers) to perform the shutdown according to the policies defined there. |
Here's a document [1] that outlines the cluster api provider related process that this EvictionStrategy: External will be used in |
@davidvossel the PR looks good, can you just improve the first commit message by giving some context on why this change is required (a copy-paste of the PR description would be enough)? |
If this is the line you draw, then indeed https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown is of no help and the external orchestration is needed. However, if we assume that there is some awareness around ACPI signals in the guest, then the regular (mgmt cluster level) VM evicition flow should suffice (IMHO):
My last 2cts are that I wonder if "manual" instead of "external" would provide more context of how the eviction is taking place. |
/hold I want to test this eviction strategy in capk end to end before we commit to this new api field.
The issue I'm sorting through here is around safety and observation of pdbs within the tenant cluster. As soon as we mark a VMI for deletion (DeletionTimestamp!=nil) then the VMI is definitely going down eventually. Maybe that's necessary though to prevent the mgmt cluster from blocking on tenant node shutdown. I did some more research and i see that the kubelet (as of 1.21) has a graceful shutdown feature (https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/) I want to test this in the wild a little more before committing to this new |
@@ -712,29 +712,38 @@ func podNetworkRequiredStatusCause(field *k8sfield.Path) metav1.StatusCause { | |||
} | |||
} | |||
|
|||
func isValidEvictionStrategy(evictionStrategy *v1.EvictionStrategy) bool { | |||
if evictionStrategy == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just suggestion.
return evictionStrategy == nil ||
*evictionStrategy == v1.EvictionStrategyLiveMigrate ||
*evictionStrategy == v1.EvictionStrategyNone ||
*evictionStrategy == v1.EvictionStrategyExternal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -154,6 +154,7 @@ var _ = Describe("Validating VMICreate Admitter", func() { | |||
}, | |||
Entry("migration policy to be set to LiveMigrate", v1.EvictionStrategyLiveMigrate), | |||
Entry("migration policy to be set None", v1.EvictionStrategyNone), | |||
Entry("migration policy to be set External", v1.EvictionStrategyExternal), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this isn't exclusively related to your PR but could you also add an entry for nil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
return migrations.VMIMigratableOnEviction(c.clusterConfig, vmi) | ||
|
||
evictionStrategy := migrations.VMIEvictionStrategy(c.clusterConfig, vmi) | ||
if evictionStrategy == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just suggestion
return evictionStrategy != nil && (*evictionStrategy == virtv1.EvictionStrategyLiveMigrate || *evictionStrategy == virtv1.EvictionStrategyExternal )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i prefer the switch statement for readability
@@ -205,6 +206,39 @@ var _ = Describe("[rfe_id:273][crit:high][arm64][vendor:cnv-qe@redhat.com][level | |||
Expect(pod.Annotations).To(HaveKey("kubernetes.io/test"), "kubernetes annotation should not be carried to the pod") | |||
}) | |||
|
|||
It("Should prevent eviction when EvictionStratgy: External", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it makes sense to also simulate the external shutdown of the VM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're only interested in ensuring that the VMI is marked with the EvacuationNodeName when this occurs so the external controller can manage the situation.
All we're guaranteed with this new feature is that KubeVirt will provide the signal vmi.Status.EvacuationNodeName
when eviction is blocked. It's up to the external controller to determian what that means... it might mean shutdown, it might mean something else.
Expect(pod).ToNot(BeNil()) | ||
|
||
By("calling evict on VMI's pod") | ||
err = virtClient.CoreV1().Pods(vmi.Namespace).EvictV1beta1(context.Background(), &policyv1beta1.Eviction{ObjectMeta: metav1.ObjectMeta{Name: pod.Name}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases, this can be actually flaky. Can you issue a retry here just to be sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expect this to work under normal operation. I'm not seeing anywhere else in the test code base where we re-attempt eviction except in the migration tests where we're ensuring the pods stay protected the entire duration of the migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I didn't read through the whole test and realized you would not see the EvacuationNodeName
if it fails.
That is a matter of policy. It may not be a correct assumption on some clusters. |
bd2c061
to
6f19675
Compare
…o react to VMI eviction The cluster-api-provider-kubevirt (capk) project needs a way for VMI's to be blocked from eviction, yet signal capk that eviction has been called on the VMI so the capk controller can handle tearing the VMI down. Here's why. When a VMI is being evicted, we need a way to have that eviction blocked so the cluster api related controllers can properly drain the k8s node running within the VMI before the VMI is torn down. We already have all the mechanisms in place in the cluster-api controllers to coordinate this, we just need a way to detect the VMI needs to go down (without actually taking the VMI down) With `EvictionStrategy: External`, kubevirt will create a PDB for the VMI which blocks eviction, and it will also set the `vmi.Status.EvacuationNodeName` on the vmi's status. When the capk controllers see that `vmi.Status.EvacuationNodeName` we'll start the process of draining and tearing down the VMI gracefully from our side. Q: why not use termination grace period and single drain internally when ACPI shutdown is detected? A: We need to support the node drain process and timeouts that the cluster-api controllers execute today Q: Should we be concerned that this feature could block node drain indefinitely? A: Users can already create a PDB today for their VMIs to block node drain indefinitely so we're not doing anything here that a user couldn't achieve on their own. This feature primarily just adds a way to detect that eviction was called on the VMI (via vmi.Status.EvacuationNodeName) ```release-note Adds new EvictionStrategy "External" for blocking eviction which is handled by an external controller ``` Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
6f19675
to
03c8b08
Compare
/hold cancel @fabiand I spent some time considering the suggestion to name this eviction strategy This feature is meant for automation. It's for when another controller creates a VM and controls the lifecycle of that VM. The For end users who want to protect their VMs for a
done |
/lgtm |
@davidvossel you also copied the release note line in the first commit message, it's not a big deal but you might want to remove it. |
/retest |
The automation focus makes a lot of sense to me. :+1 The content and the use-case make sense to me. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rmohr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
That is true. The PDB logic is however not new and we create them for quite some time (as kind of a workaround to not needing a required webhook on pods, in case that kubevirt goes down). Would be interesting to see if we can somehow properly deal with cases where people don't have access to PDBs. It may have indeed unexpected side-effects of admins don't know about our drain handling. |
/retest |
/retest-required |
3 similar comments
/retest-required |
/retest-required |
/retest-required |
The cluster-api-provider-kubevirt (capk) project needs a way for VMI's to be blocked from eviction, yet signal capk that eviction has been called on the VMI so the capk controller can handle tearing the VMI down.
Here's why.
When a VMI is being evicted, we need a way to have that eviction blocked so the cluster api related controllers can properly drain the k8s node running within the VMI before the VMI is torn down. We already have all the mechanisms in place in the cluster-api controllers to coordinate this, we just need a way to detect the VMI needs to go down (without actually taking the VMI down)
With
EvictionStrategy: External
, kubevirt will create a PDB for the VMI which blocks eviction, and it will also set thevmi.Status.EvacuationNodeName
on the vmi's status. When the capk controllers see thatvmi.Status.EvacuationNodeName
we'll start the process of draining and tearing down the VMI gracefully from our side.Q&A
Q: why not use termination grace period and single drain internally when ACPI shutdown is detected?
A: We need to support the node drain process and timeouts that the cluster-api controllers execute today
Q: Should we be concerned that this feature could block node drain indefinitely?
A: Users can already create a PDB today for their VMIs to block node drain indefinitely so we're not doing anything here that a user couldn't achieve on their own. This feature primarily just adds a way to detect that eviction was called on the VMI (via vmi.Status.EvacuationNodeName)