EvictionStrategy: "External" #7493

davidvossel · 2022-03-30T19:58:16Z

The cluster-api-provider-kubevirt (capk) project needs a way for VMI's to be blocked from eviction, yet signal capk that eviction has been called on the VMI so the capk controller can handle tearing the VMI down.

Here's why.

When a VMI is being evicted, we need a way to have that eviction blocked so the cluster api related controllers can properly drain the k8s node running within the VMI before the VMI is torn down. We already have all the mechanisms in place in the cluster-api controllers to coordinate this, we just need a way to detect the VMI needs to go down (without actually taking the VMI down)

With EvictionStrategy: External, kubevirt will create a PDB for the VMI which blocks eviction, and it will also set the vmi.Status.EvacuationNodeName on the vmi's status. When the capk controllers see that vmi.Status.EvacuationNodeName we'll start the process of draining and tearing down the VMI gracefully from our side.

Q&A

Q: why not use termination grace period and single drain internally when ACPI shutdown is detected?
A: We need to support the node drain process and timeouts that the cluster-api controllers execute today

Q: Should we be concerned that this feature could block node drain indefinitely?
A: Users can already create a PDB today for their VMIs to block node drain indefinitely so we're not doing anything here that a user couldn't achieve on their own. This feature primarily just adds a way to detect that eviction was called on the VMI (via vmi.Status.EvacuationNodeName)

Adds new EvictionStrategy "External" for blocking eviction which is handled by an external controller

acardace

just a few comments but everything looks good!

One other thing, can you please copy-paste what you wrote in the PR description into the first commit message? That would help a lot in understanding where this change originates when going through the git history.

pkg/virt-api/webhooks/validating-webhook/admitters/pod-eviction-admitter.go

pkg/virt-api/webhooks/validating-webhook/admitters/vmi-create-admitter.go

pkg/virt-controller/watch/drain/disruptionbudget/disruptionbudget_test.go

tests/vmi_lifecycle_test.go

pkg/virt-controller/watch/drain/disruptionbudget/disruptionbudget.go

davidvossel · 2022-04-01T15:23:33Z

/retest

davidvossel · 2022-04-01T21:31:49Z

/retest

iholder101

Thanks David!

Q: why not use termination grace period and single drain internally when ACPI shutdown is detected?
A: We need to support the node drain process and timeouts that the cluster-api controllers execute today

I don't really understand the answer here. I mean, why can't the "shutdown process" change the VMI status (like you did), then get stuck until the cluster-api controllers (somehow) notify that the draining is done? Thanks.

iholder101 · 2022-04-03T15:03:37Z

pkg/virt-api/webhooks/validating-webhook/admitters/pod-eviction-admitter_test.go

+				Request: &admissionv1.AdmissionRequest{
+					Name:      pod.Name,
+					Namespace: pod.Namespace,
+					DryRun:    &dryRun,


This can be replaced with pointer.Bool(false), it's both more readable and better in terms of performance (only one pointer for all false/true pointers out there)

fabiand · 2022-04-04T11:55:31Z

@davidvossel could it be that https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown will help us to address this in a different way?

If the shutdown is signaled properly to the launcher, then libvirt/qemu will properly signal (via acpi or GA) a shutdown to the guest, giving it time to gracefully shut down.

davidvossel · 2022-04-04T16:14:48Z

I mean, why can't the "shutdown process" change the VMI status (like you did), then get stuck until the cluster-api controllers (somehow) notify that the draining is done? Thanks.

If the shutdown is signaled properly to the launcher, then libvirt/qemu will properly signal (via acpi or GA) a shutdown to the guest, giving it time to gracefully shut down.

@fabiand @iholder-redhat Since your questions are similar I'll answer them both here.

We don't control what is in the guest in our scenario. There may or may not be a guest agent and we shouldn't enforce that as a requirement.
The graceful shutdown is actually controlled by another cluster level controller (cluster-api's machine controller). That controller is using the k8s api to drain the node living within the guest, then telling the machine to go away (which ultimately deletes the VM/VMI through a series of owner references)
We can't commit to the deletion of the VMI pod in this process by marking it for deletion (DeletionTimestamp != nil). There is no appropriate termination grace period here that we can rely on. It's up to the cluster api controllers to determine if the VM has been drained successfully and can shutdown, not a termination grace period timeout.

Anything involving triggering something within the guest OS needs to stay 100% out of the equation here. From a cluster level, we need to know that a eviction is occurring on a guest without the guest being touched by kubevirt at all, then allow our external cluster controllers (capi controllers) to perform the shutdown according to the policies defined there.

davidvossel · 2022-04-05T16:58:27Z

Here's a document [1] that outlines the cluster api provider related process that this EvictionStrategy: External will be used in

https://docs.google.com/document/d/1ePTdleIxBPYU52cV00P55myOIKVjyH9aCJc10WbJmms/edit#

acardace · 2022-04-06T08:53:45Z

@davidvossel the PR looks good, can you just improve the first commit message by giving some context on why this change is required (a copy-paste of the PR description would be enough)?

fabiand · 2022-04-06T09:46:42Z

We don't control what is in the guest in our scenario. There may or may not be a guest agent and we shouldn't enforce that as a requirement.

If this is the line you draw, then indeed https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown is of no help and the external orchestration is needed.

However, if we assume that there is some awareness around ACPI signals in the guest, then the regular (mgmt cluster level) VM evicition flow should suffice (IMHO):

mgmt cluster drains node A
kubelet on node A gracefully tries to gracefully terminate pods
launcher signals shutdown to guest A in VMI A via ACPI
systemd in guest A receives ACPI signal
kubelet in guest A tries to gracefully terminate pods
kubelet inguest A eventually succeds
guest A is shutting down, is shut down
VMI A runs to completion
node A is drained

My last 2cts are that I wonder if "manual" instead of "external" would provide more context of how the eviction is taking place.

davidvossel · 2022-04-06T12:45:53Z

/hold

I want to test this eviction strategy in capk end to end before we commit to this new api field.

However, if we assume that there is some awareness around ACPI signals in the guest, then the regular (mgmt cluster level) VM evicition flow should suffice (IMHO):

mgmt cluster drains node A
kubelet on node A gracefully tries to gracefully terminate pods
launcher signals shutdown to guest A in VMI A via ACPI
systemd in guest A receives ACPI signal
kubelet in guest A tries to gracefully terminate pods
kubelet inguest A eventually succeds
guest A is shutting down, is shut down
VMI A runs to completion
node A is drained

The issue I'm sorting through here is around safety and observation of pdbs within the tenant cluster.

As soon as we mark a VMI for deletion (DeletionTimestamp!=nil) then the VMI is definitely going down eventually. Maybe that's necessary though to prevent the mgmt cluster from blocking on tenant node shutdown.

I did some more research and i see that the kubelet (as of 1.21) has a graceful shutdown feature (https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/) I want to test this in the wild a little more before committing to this new External strategy field.

xpivarc · 2022-04-06T08:38:53Z

pkg/virt-api/webhooks/validating-webhook/admitters/vmi-create-admitter.go

@@ -712,29 +712,38 @@ func podNetworkRequiredStatusCause(field *k8sfield.Path) metav1.StatusCause {
 	}
 }

+func isValidEvictionStrategy(evictionStrategy *v1.EvictionStrategy) bool {
+	if evictionStrategy == nil {


Just suggestion.

return evictionStrategy == nil || *evictionStrategy == v1.EvictionStrategyLiveMigrate || *evictionStrategy == v1.EvictionStrategyNone || *evictionStrategy == v1.EvictionStrategyExternal

xpivarc · 2022-04-06T08:41:46Z

pkg/virt-api/webhooks/validating-webhook/admitters/vmi-create-admitter_test.go

@@ -154,6 +154,7 @@ var _ = Describe("Validating VMICreate Admitter", func() {
 		},
 			Entry("migration policy to be set to LiveMigrate", v1.EvictionStrategyLiveMigrate),
 			Entry("migration policy to be set None", v1.EvictionStrategyNone),
+			Entry("migration policy to be set External", v1.EvictionStrategyExternal),


I know this isn't exclusively related to your PR but could you also add an entry for nil?

xpivarc · 2022-04-06T08:45:11Z

pkg/virt-controller/watch/drain/disruptionbudget/disruptionbudget.go

-	return migrations.VMIMigratableOnEviction(c.clusterConfig, vmi)
+
+	evictionStrategy := migrations.VMIEvictionStrategy(c.clusterConfig, vmi)
+	if evictionStrategy == nil {


Just suggestion

return evictionStrategy != nil && (*evictionStrategy == virtv1.EvictionStrategyLiveMigrate || *evictionStrategy == virtv1.EvictionStrategyExternal )

i prefer the switch statement for readability

xpivarc · 2022-04-06T08:47:23Z

tests/vmi_lifecycle_test.go

@@ -205,6 +206,39 @@ var _ = Describe("[rfe_id:273][crit:high][arm64][vendor:cnv-qe@redhat.com][level
 			Expect(pod.Annotations).To(HaveKey("kubernetes.io/test"), "kubernetes annotation should not be carried to the pod")
 		})

+		It("Should prevent eviction when EvictionStratgy: External", func() {


Would it makes sense to also simulate the external shutdown of the VM?

we're only interested in ensuring that the VMI is marked with the EvacuationNodeName when this occurs so the external controller can manage the situation.

All we're guaranteed with this new feature is that KubeVirt will provide the signal vmi.Status.EvacuationNodeName when eviction is blocked. It's up to the external controller to determian what that means... it might mean shutdown, it might mean something else.

xpivarc · 2022-04-06T08:51:08Z

tests/vmi_lifecycle_test.go

+			Expect(pod).ToNot(BeNil())
+
+			By("calling evict on VMI's pod")
+			err = virtClient.CoreV1().Pods(vmi.Namespace).EvictV1beta1(context.Background(), &policyv1beta1.Eviction{ObjectMeta: metav1.ObjectMeta{Name: pod.Name}})


In some cases, this can be actually flaky. Can you issue a retry here just to be sure?

We expect this to work under normal operation. I'm not seeing anywhere else in the test code base where we re-attempt eviction except in the migration tests where we're ensuring the pods stay protected the entire duration of the migration.

You are right. I didn't read through the whole test and realized you would not see the EvacuationNodeName if it fails.

kfox1111 · 2022-04-06T17:56:29Z

A: Users can already create a PDB today for their VMIs to block node drain indefinitely so we're not doing anything here that a user couldn't achieve on their own. This feature primarily just adds a way to detect that eviction was called on the VMI (via vmi.Status.EvacuationNodeName)

That is a matter of policy. It may not be a correct assumption on some clusters.

…o react to VMI eviction The cluster-api-provider-kubevirt (capk) project needs a way for VMI's to be blocked from eviction, yet signal capk that eviction has been called on the VMI so the capk controller can handle tearing the VMI down. Here's why. When a VMI is being evicted, we need a way to have that eviction blocked so the cluster api related controllers can properly drain the k8s node running within the VMI before the VMI is torn down. We already have all the mechanisms in place in the cluster-api controllers to coordinate this, we just need a way to detect the VMI needs to go down (without actually taking the VMI down) With `EvictionStrategy: External`, kubevirt will create a PDB for the VMI which blocks eviction, and it will also set the `vmi.Status.EvacuationNodeName` on the vmi's status. When the capk controllers see that `vmi.Status.EvacuationNodeName` we'll start the process of draining and tearing down the VMI gracefully from our side. Q: why not use termination grace period and single drain internally when ACPI shutdown is detected? A: We need to support the node drain process and timeouts that the cluster-api controllers execute today Q: Should we be concerned that this feature could block node drain indefinitely? A: Users can already create a PDB today for their VMIs to block node drain indefinitely so we're not doing anything here that a user couldn't achieve on their own. This feature primarily just adds a way to detect that eviction was called on the VMI (via vmi.Status.EvacuationNodeName) ```release-note Adds new EvictionStrategy "External" for blocking eviction which is handled by an external controller ``` Signed-off-by: David Vossel <davidvossel@gmail.com>

Signed-off-by: David Vossel <davidvossel@gmail.com>

davidvossel · 2022-04-07T15:04:28Z

/hold cancel

@fabiand I spent some time considering the suggestion to name this eviction strategy Manual instead of External. I'd prefer to stick with External. Here's why.

This feature is meant for automation. It's for when another controller creates a VM and controls the lifecycle of that VM. The External eviction strategy provides that controller a signal vmi.Status.EvacuationNodeName when the external controller needs to process the VMI's eviction with custom logic... The alternative here is that the external controller would need to register an eviction webhook itself and create/manage a pdb to protect the VMI's pod. KubeVirt already exposes this behavior for the EvictionStrategy: LiveMigrate so we have the opportunity to leverage mechanisms that already exist in KubeVirt for custom eviction handling.

For end users who want to protect their VMs for a Manual shutdown, my expectation is that they use their own PDBs to do this. This aligns with how users protect normal pod workloads.

@davidvossel the PR looks good, can you just improve the first commit message by giving some context on why this change is required (a copy-paste of the PR description would be enough)?

done

acardace · 2022-04-07T15:46:49Z

/lgtm

acardace · 2022-04-07T15:47:47Z

@davidvossel you also copied the release note line in the first commit message, it's not a big deal but you might want to remove it.

davidvossel · 2022-04-07T19:06:06Z

/retest

rmohr · 2022-04-12T13:52:25Z

I'd prefer to stick with External. Here's why.

This feature is meant for automation

The automation focus makes a lot of sense to me. :+1

The content and the use-case make sense to me.

/approve

kubevirt-bot · 2022-04-12T13:52:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rmohr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rmohr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rmohr · 2022-04-12T13:56:51Z

A: Users can already create a PDB today for their VMIs to block node drain indefinitely so we're not doing anything here that a user couldn't achieve on their own. This feature primarily just adds a way to detect that eviction was called on the VMI (via vmi.Status.EvacuationNodeName)

That is a matter of policy. It may not be a correct assumption on some clusters.

That is true. The PDB logic is however not new and we create them for quite some time (as kind of a workaround to not needing a required webhook on pods, in case that kubevirt goes down). Would be interesting to see if we can somehow properly deal with cases where people don't have access to PDBs. It may have indeed unexpected side-effects of admins don't know about our drain handling.

acardace · 2022-04-12T16:14:20Z

/retest

kubevirt-commenter-bot · 2022-04-13T00:21:12Z

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-commenter-bot · 2022-04-13T06:21:12Z

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-commenter-bot · 2022-04-13T19:22:47Z

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-commenter-bot · 2022-04-14T02:22:47Z

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L labels Mar 30, 2022

kubevirt-bot requested review from jean-edouard and kbidarkar March 30, 2022 19:58

davidvossel requested a review from acardace March 30, 2022 19:58

acardace requested changes Mar 31, 2022

View reviewed changes

kubevirt-bot assigned acardace Mar 31, 2022

acardace reviewed Mar 31, 2022

View reviewed changes

pkg/virt-controller/watch/drain/disruptionbudget/disruptionbudget.go Outdated Show resolved Hide resolved

davidvossel force-pushed the external-eviction-strategy branch from 39c4b16 to bd2c061 Compare March 31, 2022 13:59

iholder101 reviewed Apr 3, 2022

View reviewed changes

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 6, 2022

xpivarc reviewed Apr 6, 2022

View reviewed changes

davidvossel mentioned this pull request Apr 6, 2022

MachineHealthCheck and node remediation integration kubernetes-sigs/cluster-api-provider-kubevirt#130

Merged

davidvossel force-pushed the external-eviction-strategy branch from bd2c061 to 6f19675 Compare April 7, 2022 14:39

davidvossel added 3 commits April 7, 2022 10:42

EvictionStrategy: External unit tests

3609da6

Signed-off-by: David Vossel <davidvossel@gmail.com>

EvictionStrategy External functional test

03c8b08

Signed-off-by: David Vossel <davidvossel@gmail.com>

davidvossel force-pushed the external-eviction-strategy branch from 6f19675 to 03c8b08 Compare April 7, 2022 14:43

kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 7, 2022

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 7, 2022

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2022

kubevirt-bot merged commit a684a6f into kubevirt:main Apr 14, 2022

EvictionStrategy: "External" #7493

EvictionStrategy: "External" #7493

Conversation

davidvossel commented Mar 30, 2022

Q&A

acardace left a comment

Choose a reason for hiding this comment

davidvossel commented Apr 1, 2022

davidvossel commented Apr 1, 2022

iholder101 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabiand commented Apr 4, 2022

davidvossel commented Apr 4, 2022

davidvossel commented Apr 5, 2022

acardace commented Apr 6, 2022

fabiand commented Apr 6, 2022

davidvossel commented Apr 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfox1111 commented Apr 6, 2022

davidvossel commented Apr 7, 2022

acardace commented Apr 7, 2022

acardace commented Apr 7, 2022

davidvossel commented Apr 7, 2022

rmohr commented Apr 12, 2022

kubevirt-bot commented Apr 12, 2022

rmohr commented Apr 12, 2022

acardace commented Apr 12, 2022

kubevirt-commenter-bot commented Apr 13, 2022

kubevirt-commenter-bot commented Apr 13, 2022

kubevirt-commenter-bot commented Apr 13, 2022

kubevirt-commenter-bot commented Apr 14, 2022