Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EvictionStrategy: "External" #7493

Merged
merged 3 commits into from Apr 14, 2022

Conversation

davidvossel
Copy link
Member

The cluster-api-provider-kubevirt (capk) project needs a way for VMI's to be blocked from eviction, yet signal capk that eviction has been called on the VMI so the capk controller can handle tearing the VMI down.

Here's why.

When a VMI is being evicted, we need a way to have that eviction blocked so the cluster api related controllers can properly drain the k8s node running within the VMI before the VMI is torn down. We already have all the mechanisms in place in the cluster-api controllers to coordinate this, we just need a way to detect the VMI needs to go down (without actually taking the VMI down)

With EvictionStrategy: External, kubevirt will create a PDB for the VMI which blocks eviction, and it will also set the vmi.Status.EvacuationNodeName on the vmi's status. When the capk controllers see that vmi.Status.EvacuationNodeName we'll start the process of draining and tearing down the VMI gracefully from our side.

Q&A

Q: why not use termination grace period and single drain internally when ACPI shutdown is detected?
A: We need to support the node drain process and timeouts that the cluster-api controllers execute today

Q: Should we be concerned that this feature could block node drain indefinitely?
A: Users can already create a PDB today for their VMIs to block node drain indefinitely so we're not doing anything here that a user couldn't achieve on their own. This feature primarily just adds a way to detect that eviction was called on the VMI (via vmi.Status.EvacuationNodeName)

Adds new EvictionStrategy "External" for blocking eviction which is handled by an external controller

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L labels Mar 30, 2022
Copy link
Member

@acardace acardace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few comments but everything looks good!

One other thing, can you please copy-paste what you wrote in the PR description into the first commit message? That would help a lot in understanding where this change originates when going through the git history.

@davidvossel
Copy link
Member Author

/retest

1 similar comment
@davidvossel
Copy link
Member Author

/retest

Copy link
Contributor

@iholder101 iholder101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks David!

Q: why not use termination grace period and single drain internally when ACPI shutdown is detected?
A: We need to support the node drain process and timeouts that the cluster-api controllers execute today

I don't really understand the answer here. I mean, why can't the "shutdown process" change the VMI status (like you did), then get stuck until the cluster-api controllers (somehow) notify that the draining is done? Thanks.

Request: &admissionv1.AdmissionRequest{
Name: pod.Name,
Namespace: pod.Namespace,
DryRun: &dryRun,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced with pointer.Bool(false), it's both more readable and better in terms of performance (only one pointer for all false/true pointers out there)

@fabiand
Copy link
Member

fabiand commented Apr 4, 2022

@davidvossel could it be that https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown will help us to address this in a different way?

If the shutdown is signaled properly to the launcher, then libvirt/qemu will properly signal (via acpi or GA) a shutdown to the guest, giving it time to gracefully shut down.

@davidvossel
Copy link
Member Author

I mean, why can't the "shutdown process" change the VMI status (like you did), then get stuck until the cluster-api controllers (somehow) notify that the draining is done? Thanks.

If the shutdown is signaled properly to the launcher, then libvirt/qemu will properly signal (via acpi or GA) a shutdown to the guest, giving it time to gracefully shut down.

@fabiand @iholder-redhat Since your questions are similar I'll answer them both here.

  1. We don't control what is in the guest in our scenario. There may or may not be a guest agent and we shouldn't enforce that as a requirement.
  2. The graceful shutdown is actually controlled by another cluster level controller (cluster-api's machine controller). That controller is using the k8s api to drain the node living within the guest, then telling the machine to go away (which ultimately deletes the VM/VMI through a series of owner references)
  3. We can't commit to the deletion of the VMI pod in this process by marking it for deletion (DeletionTimestamp != nil). There is no appropriate termination grace period here that we can rely on. It's up to the cluster api controllers to determine if the VM has been drained successfully and can shutdown, not a termination grace period timeout.

Anything involving triggering something within the guest OS needs to stay 100% out of the equation here. From a cluster level, we need to know that a eviction is occurring on a guest without the guest being touched by kubevirt at all, then allow our external cluster controllers (capi controllers) to perform the shutdown according to the policies defined there.

@davidvossel
Copy link
Member Author

Here's a document [1] that outlines the cluster api provider related process that this EvictionStrategy: External will be used in

  1. https://docs.google.com/document/d/1ePTdleIxBPYU52cV00P55myOIKVjyH9aCJc10WbJmms/edit#

@acardace
Copy link
Member

acardace commented Apr 6, 2022

@davidvossel the PR looks good, can you just improve the first commit message by giving some context on why this change is required (a copy-paste of the PR description would be enough)?

@fabiand
Copy link
Member

fabiand commented Apr 6, 2022

We don't control what is in the guest in our scenario. There may or may not be a guest agent and we shouldn't enforce that as a requirement.

If this is the line you draw, then indeed https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown is of no help and the external orchestration is needed.

However, if we assume that there is some awareness around ACPI signals in the guest, then the regular (mgmt cluster level) VM evicition flow should suffice (IMHO):

  1. mgmt cluster drains node A
  2. kubelet on node A gracefully tries to gracefully terminate pods
  3. launcher signals shutdown to guest A in VMI A via ACPI
  4. systemd in guest A receives ACPI signal
  5. kubelet in guest A tries to gracefully terminate pods
  6. kubelet inguest A eventually succeds
  7. guest A is shutting down, is shut down
  8. VMI A runs to completion
  9. node A is drained

My last 2cts are that I wonder if "manual" instead of "external" would provide more context of how the eviction is taking place.

@davidvossel
Copy link
Member Author

/hold

I want to test this eviction strategy in capk end to end before we commit to this new api field.

However, if we assume that there is some awareness around ACPI signals in the guest, then the regular (mgmt cluster level) VM evicition flow should suffice (IMHO):

mgmt cluster drains node A
kubelet on node A gracefully tries to gracefully terminate pods
launcher signals shutdown to guest A in VMI A via ACPI
systemd in guest A receives ACPI signal
kubelet in guest A tries to gracefully terminate pods
kubelet inguest A eventually succeds
guest A is shutting down, is shut down
VMI A runs to completion
node A is drained

The issue I'm sorting through here is around safety and observation of pdbs within the tenant cluster.

As soon as we mark a VMI for deletion (DeletionTimestamp!=nil) then the VMI is definitely going down eventually. Maybe that's necessary though to prevent the mgmt cluster from blocking on tenant node shutdown.

I did some more research and i see that the kubelet (as of 1.21) has a graceful shutdown feature (https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/) I want to test this in the wild a little more before committing to this new External strategy field.

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 6, 2022
@@ -712,29 +712,38 @@ func podNetworkRequiredStatusCause(field *k8sfield.Path) metav1.StatusCause {
}
}

func isValidEvictionStrategy(evictionStrategy *v1.EvictionStrategy) bool {
if evictionStrategy == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just suggestion.

return evictionStrategy == nil ||
           *evictionStrategy == v1.EvictionStrategyLiveMigrate ||
	   *evictionStrategy == v1.EvictionStrategyNone ||
	   *evictionStrategy == v1.EvictionStrategyExternal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -154,6 +154,7 @@ var _ = Describe("Validating VMICreate Admitter", func() {
},
Entry("migration policy to be set to LiveMigrate", v1.EvictionStrategyLiveMigrate),
Entry("migration policy to be set None", v1.EvictionStrategyNone),
Entry("migration policy to be set External", v1.EvictionStrategyExternal),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't exclusively related to your PR but could you also add an entry for nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

return migrations.VMIMigratableOnEviction(c.clusterConfig, vmi)

evictionStrategy := migrations.VMIEvictionStrategy(c.clusterConfig, vmi)
if evictionStrategy == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just suggestion

return evictionStrategy != nil && (*evictionStrategy == virtv1.EvictionStrategyLiveMigrate || *evictionStrategy == virtv1.EvictionStrategyExternal  )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i prefer the switch statement for readability

@@ -205,6 +206,39 @@ var _ = Describe("[rfe_id:273][crit:high][arm64][vendor:cnv-qe@redhat.com][level
Expect(pod.Annotations).To(HaveKey("kubernetes.io/test"), "kubernetes annotation should not be carried to the pod")
})

It("Should prevent eviction when EvictionStratgy: External", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it makes sense to also simulate the external shutdown of the VM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're only interested in ensuring that the VMI is marked with the EvacuationNodeName when this occurs so the external controller can manage the situation.

All we're guaranteed with this new feature is that KubeVirt will provide the signal vmi.Status.EvacuationNodeName when eviction is blocked. It's up to the external controller to determian what that means... it might mean shutdown, it might mean something else.

Expect(pod).ToNot(BeNil())

By("calling evict on VMI's pod")
err = virtClient.CoreV1().Pods(vmi.Namespace).EvictV1beta1(context.Background(), &policyv1beta1.Eviction{ObjectMeta: metav1.ObjectMeta{Name: pod.Name}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases, this can be actually flaky. Can you issue a retry here just to be sure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expect this to work under normal operation. I'm not seeing anywhere else in the test code base where we re-attempt eviction except in the migration tests where we're ensuring the pods stay protected the entire duration of the migration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I didn't read through the whole test and realized you would not see the EvacuationNodeName if it fails.

@kfox1111
Copy link
Contributor

kfox1111 commented Apr 6, 2022

A: Users can already create a PDB today for their VMIs to block node drain indefinitely so we're not doing anything here that a user couldn't achieve on their own. This feature primarily just adds a way to detect that eviction was called on the VMI (via vmi.Status.EvacuationNodeName)

That is a matter of policy. It may not be a correct assumption on some clusters.

…o react to VMI eviction

The cluster-api-provider-kubevirt (capk) project needs a way for VMI's to be blocked from eviction, yet signal capk that eviction has been called on the VMI so the capk controller can handle tearing the VMI down.

Here's why.

When a VMI is being evicted, we need a way to have that eviction blocked so the cluster api related controllers can properly drain the k8s node running within the VMI before the VMI is torn down. We already have all the mechanisms in place in the cluster-api controllers to coordinate this, we just need a way to detect the VMI needs to go down (without actually taking the VMI down)

With `EvictionStrategy: External`, kubevirt will create a PDB for the VMI which blocks eviction, and it will also set the `vmi.Status.EvacuationNodeName` on the vmi's status. When the capk controllers see that `vmi.Status.EvacuationNodeName` we'll start the process of draining and tearing down the VMI gracefully from our side.

Q: why not use termination grace period and single drain internally when ACPI shutdown is detected?
A: We need to support the node drain process and timeouts that the cluster-api controllers execute today

Q: Should we be concerned that this feature could block node drain indefinitely?
A: Users can already create a PDB today for their VMIs to block node drain indefinitely so we're not doing anything here that a user couldn't achieve on their own. This feature primarily just adds a way to detect that eviction was called on the VMI (via vmi.Status.EvacuationNodeName)

```release-note
Adds new EvictionStrategy "External" for blocking eviction which is handled by an external controller
```

Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
@davidvossel
Copy link
Member Author

/hold cancel

@fabiand I spent some time considering the suggestion to name this eviction strategy Manual instead of External. I'd prefer to stick with External. Here's why.

This feature is meant for automation. It's for when another controller creates a VM and controls the lifecycle of that VM. The External eviction strategy provides that controller a signal vmi.Status.EvacuationNodeName when the external controller needs to process the VMI's eviction with custom logic... The alternative here is that the external controller would need to register an eviction webhook itself and create/manage a pdb to protect the VMI's pod. KubeVirt already exposes this behavior for the EvictionStrategy: LiveMigrate so we have the opportunity to leverage mechanisms that already exist in KubeVirt for custom eviction handling.

For end users who want to protect their VMs for a Manual shutdown, my expectation is that they use their own PDBs to do this. This aligns with how users protect normal pod workloads.

@davidvossel the PR looks good, can you just improve the first commit message by giving some context on why this change is required (a copy-paste of the PR description would be enough)?

done

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 7, 2022
@acardace
Copy link
Member

acardace commented Apr 7, 2022

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 7, 2022
@acardace
Copy link
Member

acardace commented Apr 7, 2022

@davidvossel you also copied the release note line in the first commit message, it's not a big deal but you might want to remove it.

@davidvossel
Copy link
Member Author

/retest

@rmohr
Copy link
Member

rmohr commented Apr 12, 2022

I'd prefer to stick with External. Here's why.

This feature is meant for automation

The automation focus makes a lot of sense to me. :+1

The content and the use-case make sense to me.

/approve

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rmohr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2022
@rmohr
Copy link
Member

rmohr commented Apr 12, 2022

A: Users can already create a PDB today for their VMIs to block node drain indefinitely so we're not doing anything here that a user couldn't achieve on their own. This feature primarily just adds a way to detect that eviction was called on the VMI (via vmi.Status.EvacuationNodeName)

That is a matter of policy. It may not be a correct assumption on some clusters.

That is true. The PDB logic is however not new and we create them for quite some time (as kind of a workaround to not needing a required webhook on pods, in case that kubevirt goes down). Would be interesting to see if we can somehow properly deal with cases where people don't have access to PDBs. It may have indeed unexpected side-effects of admins don't know about our drain handling.

@acardace
Copy link
Member

/retest

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

3 similar comments
@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-bot kubevirt-bot merged commit a684a6f into kubevirt:main Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants