Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster-Wide CPU Feature Disablement in KubeVirt #11407

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Barakmor1
Copy link
Member

@Barakmor1 Barakmor1 commented Mar 3, 2024

This PR introduces a Kubevirt CR Configuration for cluster-wide CPU feature disablement.

Why is this Feature needed?

  1. In the past, we encountered cases where libvirt's validation process failed with a CPU model, even though it is usable according to the Libvirt API. This typically happens when a CPU feature associated with the CPU model gets deprecated.
    Examples of such features are "mpx" and "tsx".
    Explicit disablement of these features in the domain XML prevents such validation failures.

  2. Currently, we rely on reading Libvirt's internal data to prevent the failures mentioned in point 1. However, Libvirt strongly advises against accessing its internal data. Additionally, it prevents us from supporting some CPU models when a deprecated CPU feature is not supported on the node.

  3. CPU vulnerabilities might appear in the future, and administrators need a way to mitigate these vulnerabilities at the cluster level quickly, even before they are fixed in the libvirt/qemu level.

How to use this Feature:
To disable features across the entire cluster, a field named "cpuFeaturesToDisable" should be set in Kubevirt.spec.configuration.cpuFeaturesToDisable. For instance:

spec:
  configuration:
    cpuFeaturesToDisable:
      Skylake-Client-noTSX-IBRS:
        - aes
        - mpx
      Opteron_G2"
        - svm

This configuration will result in the disablement of aes and mpx whenever the CPU Model is Skylake-Client-noTSX-IBRS, and also result in the disablement of svm whenever the CPU Model is Opteron_G2. Note that setting this in Kubevirt.spec.configuration.cpuFeaturesToDisable won't be reflected in the VMI Spec.

How users benefit from this solution?

Before this PR, if a CPU feature needed to be disabled for a CPU model due to a bug in qemu/libvirt or due to CPU vulnerabilities, users might have had to take multiple actions:

  1. Modify all existing VMs to disable the problematic CPU feature
  2. Restart all existing VMs with the problematic CPU feature to disable it.
  3. Create all new VMs with the problematic CPU feature disabled.

After this PR, the required actions will be reduced to:

  1. The admin would have to use the Kubevirt.spec.configuration.cpuFeaturesToDisable field to disable the problematic CPU feature.
  2. Restart VMs with the problematic CPU feature.

Furthermore, this solution enables broader CPU model support because it allows us to avoid reading Libvirt internal data. For example, after this PR, KubeVirt will support Skylake on nodes that lack the mpx feature (which is currently not supported)

The following alternatives were considered:

  • Using a ConfigMap: Admins could create a ConfigMap with a fixed name to describe features to disable, which is defined similar to the Kubevirt.spec.configuration.cpuFeaturesToDisable field. However, this makes configuration harder as it involves setting multiple objects to configure KubeVirt's behavior.

  • Manual Feature Disabling: Users would manually disable problematic features each time they create a new VM and restart all VMs with the problematic feature disabled until the issue is fixed, instead of introducing a new API. However, it might take a significant amount of time until it's fixed in QEMU/libvirt, and KubeVirt will need to be upgraded to propagate the fix. During this time, the burden would fall on the VM creator, who would have to manually configure feature disablement each time they create a VM.

How does this feature affect new VMs?
Kubevirt.spec.configuration.cpuFeaturesToDisable only affects the libvirt domain XML input provided by KubeVirt, resulting in feature disablement at the QEMU level.

For example, if the cpuFeaturesToDisable map contains the key and value Skylake-Client-noTSX-IBRS: "aes,mpx", the libvirt domain XML input will include the following:

...
<cpu mode='custom' match='exact' check='full'>
  <model fallback='forbid'>Skylake-Client-noTSX-IBRS</model>
  <topology sockets='1' dies='1' cores='1' threads='1'/>
  <feature policy='disable' name='aes'/>
  <feature policy='disable' name='mpx'/>
...

With this new configuration, existing guest VMs won't be affected because it only changes the domain XML, which doesn't change once the VMs are running, even during live migrations. Additionally, the VMI CRs at the Kubernetes level won't be affected in any way by the configuration.

For more information, please refer to the provided link.

How does this feature affect running VMs and live migration?
With this new feature, existing VMs won't be affected because it only changes the domain XML, which doesn't change once the VMs are running, even during live migrations.

Handling Conflicts Between cpuFeaturesToDisable and CPU Features API:
If Kubevirt.spec.configuration.cpuFeaturesToDisable conflicts with the CPU features specified in VMI.Spec.Domain.CPU.Features, the configuration policy specified in VMI.Spec.Domain.CPU.Features will determine the behavior.

Impact on scheduling:
Kubevirt.spec.configuration.cpuFeaturesToDisable does not affect scheduling. Neither launcher affinity nor label selectors are influenced by it. Additionally, node labels remain unchanged regardless of the features included in the cpuFeaturesToDisable.

Changes to Kubevirt.spec.configuration.cpuFeaturesToDisable:

  • After removing cpuFeaturesToDisable from KubeVirt's configuration, restarting existing VMs or creating new ones will no longer disable features in the guest VMs.
  • After modifying cpuFeaturesToDisable in KubeVirt's configuration, restarting existing VMs or creating new ones will disable features in the guest VMs according to the modified configuration.

In both cases, existing VMs will remain unchanged even when live migrating them to different nodes, so the disabled features will remain disabled.

PR for documentation in the user-guide:
kubevirt/user-guide#771

Fixes #

Why we need it and why it was done in this way

allowing broader CPU model support when svm or mpx features are absent.

Special notes for your reviewer

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note

Introduce Cluster-Wide CPU Feature Disablement option

@kubevirt-bot kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/XL labels Mar 3, 2024
@Barakmor1
Copy link
Member Author

@dankenigsberg i linked a bug for Libvirt in the PR description:
https://gitlab.com/libvirt/libvirt/-/issues/608

@fabiand
Copy link
Member

fabiand commented Mar 4, 2024

/hold

To avoid the validation error mentioned above we will disable
some of the features in the /usr/share/libvirt/cpu_map/[CPU Model].xml files:

disable svm for Opteron_G2 to avoid:
https://bugzilla.redhat.com/show_bug.cgi?id=2122283
disable mpx for Skylake, Cascadelake, and Icelake to avoid:
https://gitlab.com/libvirt/libvirt/-/issues/304

Same for

In addition, for future issues we allow setting
a config map called "cpu-model-to-features-to-disable"

The PR description does not explain why KubeVirt has to do this expensive task of manually tracking certain CPU features.
We should request 2 things

  1. Why does libvirt not do this?
  2. Why should KubeVirt do this?

Going forward this will be an ever growing list of cpu fetaures which need special treatment - why shoulda high level mgmt layer (KubeVirt) manage thos, and not something like a VMM runtime daemon (libvirt)?

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2024
@Barakmor1
Copy link
Member Author

/hold

To avoid the validation error mentioned above we will disable
some of the features in the /usr/share/libvirt/cpu_map/[CPU Model].xml files:

disable svm for Opteron_G2 to avoid:
https://bugzilla.redhat.com/show_bug.cgi?id=2122283
disable mpx for Skylake, Cascadelake, and Icelake to avoid:
https://gitlab.com/libvirt/libvirt/-/issues/304

I didn't open these bugs these are old bugs describing why we had to read the /usr/share/libvirt/cpu_map/[CPU Model].xml files.

Same for

In addition, for future issues we allow setting
a config map called "cpu-model-to-features-to-disable"

This is meant for fixing similar issues in the future or when we encounter problematic features. It's not part of the kubevirt API, and only administrators can use it since it is in the kubevirt installation namespace.

The PR description does not explain why KubeVirt has to do this expensive task of manually tracking certain CPU features. We should request 2 things

  1. Why does libvirt not do this?
  2. Why should KubeVirt do this?

Going forward this will be an ever growing list of cpu fetaures which need special treatment - why shoulda high level mgmt layer (KubeVirt) manage thos, and not something like a VMM runtime daemon (libvirt)?

Hey Fabian.
This is a temporary workaround to avoid failures until the bug is fixed.
We also included a way for a quick fix whever we hit the problem again.

I modifed the pr description to be clear on that and also added a bug i created for libvirt: https://gitlab.com/libvirt/libvirt/-/issues/608

Thanks for the notes

@fabiand
Copy link
Member

fabiand commented Mar 4, 2024

Can't we then wait for libvirt to fix the bug instead of working around it?

@Barakmor1
Copy link
Member Author

Can't we then wait for libvirt to fix the bug instead of working around it?

It might take a while for libvirt to fix the issue. From what I understand, there's a compatibility problem with QEMU, but I'm not sure about the details.

The new workaround replaces the old one, which is removed in the first commit of this PR. However, using the old workaround is not recommended by libvirt.

These are the advantages of the new approach:

  • allowing broader CPU model(Cascadelake,Icelake,Skylake and more) support when svm or mpx features are absent.
  • Included a way for a quick fix whever we hit the problem again.

@Barakmor1
Copy link
Member Author

It might take a while for libvirt to fix the issue. From what I understand, there's a compatibility problem with QEMU, but I'm not sure about the details.

The new workaround replaces the old one, which is removed in the first commit of this PR. However, using the old workaround is not recommended by libvirt.

These are the advantages of the new approach:

  • allowing broader CPU model(Cascadelake,Icelake,Skylake and more) support when svm or mpx features are absent.
  • Included a way for a quick fix whever we hit the problem again.

@fabiand I'll also add this to the PR description to make it more clear

@vladikr
Copy link
Member

vladikr commented Mar 5, 2024

Can't we then wait for libvirt to fix the bug instead of working around it?

@fabiand As I answered here

This is not directly related to the bug. This PR should allow KubeVirt to react and mitigate issues with CPU models quickly instead of waiting for fixes from libvirt and QEMU.
There can be all kinds of CPU vulnerabilities that can theoretically be mitigated on the cluster level quickly (Specter / Meltdown).
That's in addition to a known inconsistency between libvirt/QEMU wrt svm in a certain model.
mpx is just another case that could have been mitigated cluster-wide until a relevant fix arrives.

By default, no action is required from the cluster admin.

@dankenigsberg
Copy link
Member

There can be all kinds of CPU vulnerabilities that can theoretically be mitigated on the cluster level quickly (Specter / Meltdown).

@vladikr is the root approver so he makes the call.

I take a different approach: these sorts of issues should be fixed in the relevant component (libvirt, qemu) and then consumed by KubeVirt via a rebuild. Exposing this in our API surface makes KubeVirt complicated, adding a burden on our users, tests and documentation. @Barakmor1 I'd appreciation if you keep cpu-model-to-features-to-disable in its separate commit, and include a documentation for it somewhere.

@fabiand
Copy link
Member

fabiand commented Mar 5, 2024

I take a different approach

That#s fine, but please recognize that this is a joint effort, @dankenigsberg

@fabiand
Copy link
Member

fabiand commented Mar 5, 2024

Thanks for the link @vladikr , it would just be good (and now it is) to have this in this pr description as well

@Barakmor1 if this is a mitigation - What is the complete workflow for an admin in order to safely use this feature?

The timeline

  1. There is a bug detected in a CPU model (qemu/libvirt)
  2. This feature is used to work around it
  3. qemu/libvirt have fixed the issue
  4. KubeVirt adopts the fix

I think it's clear how an admin would use the feature in order to workaround the issue, now some questions:

  • How would VMs look that go created after step 2
  • How would VMs work after step 3
  • How would the admin remove the workaround again? What happens with the VMs impacted by this workaround?
  • How high or low is the uncertainty that this workaround will or will not work?

I'm trying to widen the view in order to get us to understand what burden we are putting onto the admin.

@Barakmor1 can you help to find some answres to the questions above?
And what other questions should we look at?

@fabiand
Copy link
Member

fabiand commented Mar 5, 2024

@Barakmor1 do we know when https://gitlab.com/libvirt/libvirt/-/issues/608 will see some attention?
What are we doing to get this resolved by libvirt/qemu?

@fabiand
Copy link
Member

fabiand commented Mar 5, 2024

@Barakmor1 @vladikr FWIW wouldn't an alernative be to be able to define custom CPU models?
I.e. a user would specify a base model, and then add or remove flags and give it a custom name.

The diff to this proposal (IIUIC) would be: A custom CPU is independent of what libvirt/qemu provides. Aka the upgrade path would be clearer.

@Barakmor1
Copy link
Member Author

There can be all kinds of CPU vulnerabilities that can theoretically be mitigated on the cluster level quickly (Specter / Meltdown).

@vladikr is the root approver so he makes the call.

I take a different approach: these sorts of issues should be fixed in the relevant component (libvirt, qemu) and then consumed by KubeVirt via a rebuild. Exposing this in our API surface makes KubeVirt complicated, adding a burden on our users, tests and documentation. @Barakmor1 I'd appreciation if you keep cpu-model-to-features-to-disable in its separate commit, and include a documentation for it somewhere.

Hi @dankenigsberg ,

I appreciate the important notes.

I've added a link to Libvirt Issue in the comments.
The addition of the config map is now in a separate commit.

I'll also include a clear explanation in the documentation at https://github.com/kubevirt/user-guide. I'm currently working on a PR and will notify you once it's published.

@Barakmor1
Copy link
Member Author

Barakmor1 commented Mar 6, 2024

Thanks for the link @vladikr , it would just be good (and now it is) to have this in this pr description as well

@Barakmor1 if this is a mitigation - What is the complete workflow for an admin in order to safely use this feature?

The timeline

  1. There is a bug detected in a CPU model (qemu/libvirt)
  2. This feature is used to work around it
  3. qemu/libvirt have fixed the issue
  4. KubeVirt adopts the fix

Yes, that's the process. Just a heads up: during the second stage, some existing VMs may need to be restarted.

This process is simpler compared to the current one, which involves disabling Node Labeller, adding labels to nodes, modifying VMs to disable the feature, restarting them, and manually disabling the feature for new VMs.

I think it's clear how an admin would use the feature in order to workaround the issue, now some questions:

  • How would VMs look that go created after step 2
  • How would VMs work after step 3

With this new workaround, the VMs would appear the same in all the steps. The only thing that needs to be done is to create a config map called "cpu-model-to-features-to-disable" in the installation namespace with the problematic feature and relevant models.

With the current(without that PR changes) workaround:

  • The VMs of the second step would have to manually disable the problematic feature.
  • The VMs of the third step would look the same as before the issue (the problematic feature would be disabled implicitly by default).
  • How would the admin remove the workaround again? What happens with the VMs impacted by this workaround?
  • How high or low is the uncertainty that this workaround will or will not work?

With the workaround, we will list features that are deprecated and should not be enabled, such as mpx in a ConfigMap.
If the VM explicitly requires mpx in the VM spec, like in this example:

spec:
  domain:
    cpu:
      model: Skylake-Client-noTSX-IBRS
      features:
      - name: "mpx"
        policy: "require"

The workaround will not affect anything, and the VM will enable mpx even though it exists in the map. The workaround will only help admins disable deprecated features that should not be used anyway. The libvirt/qemu/kubevirt fix will ensure that these features are disabled by default, even without the workaround.

So there is no risk if an admin forgets to remove the ConfigMap, as this won't affect anything.

@Barakmor1
Copy link
Member Author

@Barakmor1 do we know when https://gitlab.com/libvirt/libvirt/-/issues/608 will see some attention? What are we doing to get this resolved by libvirt/qemu?

There is a discussion regarding this issue with the Libvirt team in a mailing list. Hopefully, this will be addressed soon :) .

@Barakmor1
Copy link
Member Author

@Barakmor1 @vladikr FWIW wouldn't an alernative be to be able to define custom CPU models? I.e. a user would specify a base model, and then add or remove flags and give it a custom name.

The diff to this proposal (IIUIC) would be: A custom CPU is independent of what libvirt/qemu provides. Aka the upgrade path would be clearer.

This is similar to what we are doing here: the features in the map will be disabled automatically for some CPU models. There is no need to add features if the user didn't request additional features in the VMI's spec.

@Barakmor1
Copy link
Member Author

@dankenigsberg here is the documentation PR:
kubevirt/user-guide#771

@fabiand can we remove the hold label?

@vladikr
Copy link
Member

vladikr commented Mar 6, 2024

@Barakmor1 @vladikr FWIW wouldn't an alernative be to be able to define custom CPU models? I.e. a user would specify a base model, and then add or remove flags and give it a custom name.

@fabiand partially this is possible today already. https://kubevirt.io/user-guide/virtual_machines/virtual_hardware/#features
We're only missing a way to disable a specific CPU feature, which can be added.

I don't see how this is an alternative to a cluster-wide approach. This approach puts a burden on the guest user who may not be away for CVEs and cluster wide issues.

This PR is about allowing the admin to react quickly to known issues until fixes from lower layered projects will come.

The diff to this proposal (IIUIC) would be: A custom CPU is independent of what libvirt/qemu provides. Aka the upgrade path would be clearer.

@Barakmor1
Copy link
Member Author

/unhold

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 11, 2024
@alaypatel07
Copy link
Contributor

@Barakmor1 I still think it's worth splitting it into two separate PRs so we can have time to discuss API changes instead of rushing it in.

@vladikr
Copy link
Member

vladikr commented Apr 9, 2024

@Barakmor1 I still think it's worth splitting it into two separate PRs so we can have time to discuss API changes instead of rushing it in.

I thought we already agreed that these API changes make sense.

Kubevirt needs to have a way to act quickly on CPU vulnerabilities and be able to disable features, cluster-wide for newly created/restarted VMs.
There is enough evidence that this is needed. The description of this PR has enough details on the issues as well.

If there are no new suggestions on how this API should look, I'd prefer to go ahead with the currently suggested one.

@alaypatel07
Copy link
Contributor

Is there a reason why it was decided to use have this KV configurationAPI only for feature disabling and not for feature enabling? There are some usecases where we are enabling certain features cluster wide for performance. I am wondering if the following API makes sense:

kubevirt.spec.kubevirtConfiguration:

  defaultCPUFeatures:
  - name: featurefoo
    enabled: false
  - name: featuarebar
    enabled: true

This is already present in the vmi spec here https://github.com/kubevirt/kubevirt/blob/main/staging/src/kubevirt.io/api/core/v1/schema.go#L365

Functionally it would mean that if these features are not defined by user in vmi spec, then mutating webhook will add defaults to vmi.spec.domain.cpu.features. @fabiand This will handle the case where the state of a cpu feature is not reflected in the object.

@EdDev
Copy link
Member

EdDev commented Apr 9, 2024

@Barakmor1 I still think it's worth splitting it into two separate PRs so we can have time to discuss API changes instead of rushing it in.

I thought we already agreed that these API changes make sense.

Kubevirt needs to have a way to act quickly on CPU vulnerabilities and be able to disable features, cluster-wide for newly created/restarted VMs. There is enough evidence that this is needed. The description of this PR has enough details on the issues as well.

If there are no new suggestions on how this API should look, I'd prefer to go ahead with the currently suggested one.

Suggestions have been raised and they are still coming.
For now, I would like to understand why this cannot be resolved by instancetypes [1].

[1] #11407 (comment)

The notion of creating a new API to try and resolve an intermediate problem until formal release is surprising to me.
Do we have examples from other projects that do stuff like this?

@vladikr
Copy link
Member

vladikr commented Apr 15, 2024

Is there a reason why it was decided to use have this KV configurationAPI only for feature disabling and not for feature enabling? There are some usecases where we are enabling certain features cluster wide for performance. I am wondering if the following API makes sense:

Yes, the reason is that there is no use case nor a justification for an admin to request specific low level CPU features cluster wide level on behalf of VM owners.

On the other hand, there are use cases for making specific hardware unavailable/disabled for users cluster-wide to mitigate security issues or critical bugs.
The PR description talks about this.

kubevirt.spec.kubevirtConfiguration:

  defaultCPUFeatures:
  - name: featurefoo
    enabled: false
  - name: featuarebar
    enabled: true

This is already present in the vmi spec here https://github.com/kubevirt/kubevirt/blob/main/staging/src/kubevirt.io/api/core/v1/schema.go#L365

Functionally it would mean that if these features are not defined by user in vmi spec, then mutating webhook will add defaults to vmi.spec.domain.cpu.features. @fabiand This will handle the case where the state of a cpu feature is not reflected in the object.

@vladikr
Copy link
Member

vladikr commented Apr 15, 2024

@Barakmor1 I still think it's worth splitting it into two separate PRs so we can have time to discuss API changes instead of rushing it in.

I thought we already agreed that these API changes make sense.
Kubevirt needs to have a way to act quickly on CPU vulnerabilities and be able to disable features, cluster-wide for newly created/restarted VMs. There is enough evidence that this is needed. The description of this PR has enough details on the issues as well.
If there are no new suggestions on how this API should look, I'd prefer to go ahead with the currently suggested one.

Suggestions have been raised and they are still coming. For now, I would like to understand why this cannot be resolved by instancetypes [1].

I suppose for the same reason why device plugins are not managed via instance-types.

[1] #11407 (comment)

The notion of creating a new API to try and resolve an intermediate problem until formal release is surprising to me. Do we have examples from other projects that do stuff like this?

As Barak has repeatedly explained, this PR was not created to address a bug.
This PR allows KubeVirt to manage the availability/enablement of CPU features cluster-side to mitigate either security issues or bugs.

In general, CPU features and models are managed by QEMU. libvirt has a layer to verify its usage and rationalize these.
In most cases, KubeVirt relies on libvirt for that matter, however, in some cases, KubeVirt needs to have a way to deviate from this approach temporarily. (re: description)

@EdDev
Copy link
Member

EdDev commented Apr 15, 2024

Suggestions have been raised and they are still coming. For now, I would like to understand why this cannot be resolved by instancetypes [1].

I suppose for the same reason why device plugins are not managed via instance-types.

Sorry, I do not see the relation.
IIUC we can use instancetypes to share a common configuration over a group of VMs.
The only missing peace is to have a "default" one that we do not need to specify per VM but per cluster (or per a group of them). If we could express that we want an instancetype to be added for all VMs through Kubevirt CR, then we could set whatever fields we want, including these CPU features settings.

Is this something doable?

[1] #11407 (comment)
The notion of creating a new API to try and resolve an intermediate problem until formal release is surprising to me. Do we have examples from other projects that do stuff like this?

As Barak has repeatedly explained, this PR was not created to address a bug. This PR allows KubeVirt to manage the availability/enablement of CPU features cluster-side to mitigate either security issues or bugs.

"mitigate either security issues or bugs" sounds exactly like treating something intermediate to me.

In general, CPU features and models are managed by QEMU. libvirt has a layer to verify its usage and rationalize these. In most cases, KubeVirt relies on libvirt for that matter, however, in some cases, KubeVirt needs to have a way to deviate from this approach temporarily. (re: description)

That is what I understood as well.
We need this for a temporary period of time until it is treated correctly.
I guess my problem here is that IMO introducing an API just to solve an intermediate need like this is not worth it.
The solution is to release a fix quickly and until then to find a mitigating solution that is ugly but works.
Several such solutions have been given.

My main issue with the suggested API is its too specific without an option to leverage it for other reasonable uses.
It is down to creating another method to "template" parameters over the whole cluster VMs. But this time it is focused on CPU features being disabled.

@fabiand
Copy link
Member

fabiand commented Apr 18, 2024

Hi.

@Barakmor1 @vladikr the topic got started because of the mpx flag, iiiuic.
Ignoring this work, and Barak's fix in the other PR: Could we have also mitigated the issue by setting

features:
- name: mpx
  policy: disable

?

This PR introduces a Kubevirt CR Configuration for cluster-wide CPU feature disablement.

****Why is this Feature needed?****

- In the past, we encountered cases where libvirt's validation process failed with a CPU model, even though it is usable according to the Libvirt API. This typically happens when a CPU feature associated with the CPU model gets deprecated.
Examples of such features are "mpx" and "tsx".
Explicit disablement of these features in the domain XML prevents such validation failures.

- Currently, we rely on reading Libvirt's internal data to prevent the failures mentioned in point 1. However, Libvirt strongly advises against accessing its internal data. Additionally, it prevents us from supporting some CPU models when a deprecated CPU feature is not supported on the node.

- CPU vulnerabilities might appear in the future, and administrators need a way to mitigate these vulnerabilities at the cluster level quickly, even before they are fixed in the libvirt/qemu level.

****How to use this Feature:****
To disable features across the entire cluster, a field named "cpuFeaturesToDisable" should be set in Kubevirt.spec.configuration.cpuFeaturesToDisable. For instance:
spec:
  configuration:
    cpuFeaturesToDisable:
      Skylake-Client-noTSX-IBRS:
        - aes
        - mpx
      Opteron_G2"
        - svm

This configuration will result in the disablement of aes and mpx whenever the CPU Model is Skylake-Client-noTSX-IBRS, and also result in the disablement of svm whenever the CPU Model is Opteron_G2. Note that setting this in Kubevirt.spec.configuration.cpuFeaturesToDisable won't be reflected in the VMI Spec.

****Why this solution is the preferred solution?****

Before this PR, if a CPU feature needed to be disabled for a CPU model due to a bug in qemu/libvirt or due to CPU vulnerabilities, users might have had to take multiple actions:

Disable the Node Labeller.
Add CPU model labels to nodes.
Restart VMs with the problematic CPU feature to disable it.
Create VMs with the problematic CPU feature disabled.
After this PR, the required actions will be reduced to:

Restart VMs with the problematic CPU feature.
Furthermore, this solution enables broader CPU model support. For example, after this PR, KubeVirt will support Skylake on nodes that lack the mpx feature (which is currently not supported).

****How does this feature affect new VMs?****

Kubevirt.spec.configuration.cpuFeaturesToDisable only affects the libvirt domain XML input provided by KubeVirt, resulting in feature disablement at the QEMU level.

For example, if the cpuFeaturesToDisable map contains the key and value Skylake-Client-noTSX-IBRS: "aes,mpx", the libvirt domain XML input will include the following:

...
<cpu mode='custom' match='exact' check='full'>
  <model fallback='forbid'>Skylake-Client-noTSX-IBRS</model>
  <topology sockets='1' dies='1' cores='1' threads='1'/>
  <feature policy='disable' name='aes'/>
  <feature policy='disable' name='mpx'/>
...
With this new configuration, existing guest VMs won't be affected because it only changes the domain XML, which doesn't change once the VMs are running, even during live migrations. Additionally, the VMI CRs at the Kubernetes level won't be affected in any way by the configuration.

For more information, please refer to:
https://libvirt.org/formatdomain.html#cpu-model-and-topology

****How does this feature affect running VMs?****
With this new feature, existing VMs won't be affected because it only changes the domain XML, which doesn't change once the VMs are running, even during live migrations.

****Handling Conflicts Between cpuFeaturesToDisable and CPU Features API:****
If Kubevirt.spec.configuration.cpuFeaturesToDisable conflicts with the CPU features specified in VMI.Spec.Domain.CPU.Features, the configuration policy specified in VMI.Spec.Domain.CPU.Features will determine the behavior.

****Impact on scheduling:****
Kubevirt.spec.configuration.cpuFeaturesToDisable does not affect scheduling. Neither launcher affinity nor label selectors are influenced by it. Additionally, node labels remain unchanged regardless of the features included in the cpuFeaturesToDisable.

****Changes to Kubevirt.spec.configuration.cpuFeaturesToDisable:****
After removing cpuFeaturesToDisable from KubeVirt's configuration, restarting existing VMs or creating new ones will no longer disable features in the guest VMs.
After modifying cpuFeaturesToDisable in KubeVirt's configuration, restarting existing VMs or creating new ones will disable features in the guest VMs according to the modified configuration.

In both cases, existing VMs will remain unchanged even when live migrating them to different nodes, so the disabled features will remain disabled.

Signed-off-by: bmordeha <bmordeha@redhat.com>
Signed-off-by: bmordeha <bmordeha@redhat.com>
using cpuFeaturesToDisable configuration

Signed-off-by: bmordeha <bmordeha@redhat.com>
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 18, 2024
@Barakmor1
Copy link
Member Author

Hi.

@Barakmor1 @vladikr the topic got started because of the mpx flag, iiiuic. Ignoring this work, and Barak's fix in the other PR: Could we have also mitigated the issue by setting

features:
- name: mpx
  policy: disable

?

Yes, for each new VM, we'll need to disable mpx through the vmi.spec . Additionally, we'll need to update the existing ones to disable mpx and then restart them.

@xpivarc
Copy link
Member

xpivarc commented Apr 18, 2024

Suggestions have been raised and they are still coming. For now, I would like to understand why this cannot be resolved by instancetypes [1].

I suppose for the same reason why device plugins are not managed via instance-types.

Sorry, I do not see the relation. IIUC we can use instancetypes to share a common configuration over a group of VMs. The only missing peace is to have a "default" one that we do not need to specify per VM but per cluster (or per a group of them). If we could express that we want an instancetype to be added for all VMs through Kubevirt CR, then we could set whatever fields we want, including these CPU features settings.

The instancetypes's goal is different and they come with limitations, you can't use them in all cases.

Is this something doable?

[1] #11407 (comment)
The notion of creating a new API to try and resolve an intermediate problem until formal release is surprising to me. Do we have examples from other projects that do stuff like this?

As Barak has repeatedly explained, this PR was not created to address a bug. This PR allows KubeVirt to manage the availability/enablement of CPU features cluster-side to mitigate either security issues or bugs.

"mitigate either security issues or bugs" sounds exactly like treating something intermediate to me.

I am not sure what you mean here. Is an API designated to help with intermediate API not allowed? Is Kubevirt not allowed to provide better user/admin experience? I would argue that faster security patches or bug fixes are worth the additional (low) maintenance.

We need to take into account that Kubevirt is just one layer on top of many and each layer can take considerable time to provide fixes.

In general, CPU features and models are managed by QEMU. libvirt has a layer to verify its usage and rationalize these. In most cases, KubeVirt relies on libvirt for that matter, however, in some cases, KubeVirt needs to have a way to deviate from this approach temporarily. (re: description)

That is what I understood as well. We need this for a temporary period of time until it is treated correctly. I guess my problem here is that IMO introducing an API just to solve an intermediate need like this is not worth it. The solution is to release a fix quickly and until then to find a mitigating solution that is ugly but works. Several such solutions have been given.

My main issue with the suggested API is its too specific without an option to leverage it for other reasonable uses. It is down to creating another method to "template" parameters over the whole cluster VMs. But this time it is focused on CPU features being disabled.

@kubevirt-bot
Copy link
Contributor

kubevirt-bot commented Apr 18, 2024

@Barakmor1: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubevirt-e2e-k8s-1.29-sig-monitoring 2b5979a link true /test pull-kubevirt-e2e-k8s-1.29-sig-monitoring
pull-kubevirt-unit-test-arm64 90a4662 link false /test pull-kubevirt-unit-test-arm64
pull-kubevirt-conformance-arm64 90a4662 link false /test pull-kubevirt-conformance-arm64

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fabiand
Copy link
Member

fabiand commented Apr 18, 2024

@Barakmor1 thanks.

Now, if we divide our small KubeVirt world into two phases: Env with a problematic CPU model , and an Env with a fixed CPU model, then we are saying

  1. This PR is needed to target the "BEFORE" environment
  2. We assume that eventually qemu + libvirt + us will deliver a fix which will take us to the "AFTER" environment

Thus we look for a temporary measure to fix one or more (possibly not all) VMs on a cluster to get through the time before "AFTER" appears.

I have my concerns - raised above - with the mangling of CPU models (now sharpening my concerns) on our KubeVirt level. I really wonder - and eventually worry - how we keep track of it, and how to mitigate or rather foresee the impact of this.
This being said we now know, that the manual reconfiguration of a VM is addressing the "BEFORE" case and can be applied to one or more VMs. And we know that a specific fix like #11671 would allow KubeVirt to quickly target a specific CPU issue.

For multiple VMs a webhook, a tool like Kyverno, or something like the patch-operator) could help to do the job.
Having it in the API makes this also very explicit, and we do not need to have any additional measures to protect ourselfs, or guide admins how to uprgade/remove these flags again.

TLDR We have two strategies to address critical situations (kubevirt fix and CPU features on VMI API), automation tools could be used to address multiple VMs. This implementation in this PR willr equire a mainteinance burden, additional knowledge by admins, and possibly other things we don't know of yet.
Due to this I wonder if we should simply settle with what we have (VMI CPU API, kubevirt fixes as needed) and close this topic and this pr.

@vladikr
Copy link
Member

vladikr commented Apr 18, 2024

@fabiand For me, relying on external tools to set a cluster wide policy that only affects VMs is not good enough.
This way, we could say that managing device plugins should also be done outside of KubeVirt.

This API is specific because it should enforce a temporary policy for all VMs without affecting the VM/VMI spec.

Also, it could be that admins would want to disable certain CPU features and make these unavailable. For explain, SVM was part of the QEMU CPU models.

TLDR We have two strategies to address critical situations (kubevirt fix and CPU features on VMI API), automation tools could be used to address multiple VMs. This implementation in this PR willr equire a mainteinance burden, additional knowledge by admins, and possibly other things we don't know of yet.

Managing CPU flags on a VM level cannot be considered a cluster-wide strategy or an admins policy.

In addition, I'm not sure whether CPU

features:
- name: mpx
  policy: disable

can work with instance types.

@Barakmor1
Copy link
Member Author

Barakmor1 commented Apr 18, 2024

@fabiand @vladikr BTW another intresting usecase:
harvester/harvester#3015 (comment)

@EdDev
Copy link
Member

EdDev commented Apr 18, 2024

The instancetypes's goal is different and they come with limitations, you can't use them in all cases.

My understanding is that instancetypes are a kind of template, allowing a user to share a configuration snippet across a group of VMs. In my view, this is exactly what you are trying to do here with the CPU feature disablement.
In any case, this is an example of how one can share a central configuration with many VM/VMIs.

"mitigate either security issues or bugs" sounds exactly like treating something intermediate to me.

I am not sure what you mean here. Is an API designated to help with intermediate API not allowed?

It is allowed, but it is not recommended.
Creating an API to solve intermediate glitches is a sign of a problem.

Is Kubevirt not allowed to provide better user/admin experience? I would argue that faster security patches or bug fixes are worth the additional (low) maintenance.

I guess this is exactly where we see things differently.
As an Admin, I would not want you to fix things using configuration, but using a release. Quickly and efficiently without me needing to manage any configuration.

We need to take into account that Kubevirt is just one layer on top of many and each layer can take considerable time to provide fixes.

My expectation is for Kubevirt to ideally provide the mitigation through a hotfix release and not through configuration. Then, it is up to Kubevirt to pressure the layered drivers it uses to deliver a proper fix.
The management is done internally and the Admin is not bothered with configuration burden.

I was unable to understand why a configuration API is better than a quick, hotfix release with the same logic embedded into Kubevirt and managed by it until a proper resolution is found.

Deferring the problem to the Admin using a configuration workaround sounds problematic to me.

@vladikr
Copy link
Member

vladikr commented Apr 18, 2024

My expectation is for Kubevirt to ideally provide the mitigation through a hotfix release and not through configuration. Then, it is up to Kubevirt to pressure the layered drivers it uses to deliver a proper fix. The management is done internally and the Admin is not bothered with configuration burden.

I was unable to understand why a configuration API is better than a quick, hotfix release with the same logic embedded into Kubevirt and managed by it until a proper resolution is found.

Deferring the problem to the Admin using a configuration workaround sounds problematic to me.

I find it had to agree with this. As an admin who is dealing with a security issue or an inconsistency that affects the cluster, I'd prefer to have a simple configuration rather than wait for an update and be exposed while the hotfix arrives.

Issues that are described in this PR have been waiting for a solution for many months. Even though some of these issues are fixed in some layers, we still cannot get a sufficient release.
KubeVirt can control its layer. And this API aims to let KubeVirt provide the flexibility needed to address issues as they appear.
Other platforms have this API and even more extensive (per node) configuration - which solved issues in the past.

@vladikr
Copy link
Member

vladikr commented Apr 18, 2024

My understanding is that instancetypes are a kind of template, allowing a user to share a configuration snippet across a group of VMs. In my view, this is exactly what you are trying to do here with the CPU feature disablement.
In any case, this is an example of how one can share a central configuration with many VM/VMIs.

Not all VMIs are started from a VM that references an instance type.

@EdDev
Copy link
Member

EdDev commented Apr 18, 2024

My understanding is that instancetypes are a kind of template, allowing a user to share a configuration snippet across a group of VMs. In my view, this is exactly what you are trying to do here with the CPU feature disablement.
In any case, this is an example of how one can share a central configuration with many VM/VMIs.

Not all VMIs are started from a VM that references an instance type.

I understand it is not perfect.
However we have many features that depend on a VM and not VMI, so it is not that terrible IMO.

Personally, if VMIs need to be treated as a must, I would prefer templating to be extended to VMIs over adding a new specific CPU-feature disabling API. But I do not think it is desired.

@lyarwood
Copy link
Member

lyarwood commented Apr 19, 2024

tl;dr instance types (well preferences) are not a valid solution to this issue

Chiming in late with my $0.05 on instance types (well preferences) somehow being another way to fix this. IMHO they are not. AFAICT we would end up in the middle ground between the admin driven workflow proposed by this PR and the current purely VM owner driven workflow.

Instance types are resource containers and do not model CPU features [1]. Preferences can provide preferred CPU Features but these can be overridden within the VM definition [2]. Both only apply to the VM and have a 1:1 point in time relationship with the VM making it awkward to apply cluster-wide updates to already deployed VMs.

Introducing preferences into this mix means that an admin would need to mutate all VirtualMachineClusterPreferences to disable certain CPUFeatures and then move all existing VMs to use the new version of the preference. That in itself might move the VM from a much older generation of the resource to something totally new that itself might be undesirable. VM owners might also have their own namespaced VirtualMachinePreferences where they also need to repeat these steps.

Hopefully this brings the idea of using instance types and preferences to solve this particular use cases to a close.

I've not been following this PR closely enough to provide a complete review but at a high level providing an admin driven workflow over something involving a VM owner is IMHO the way to go.

[1] https://kubevirt.io/api-reference/main/definitions.html#_v1beta1_cpuinstancetype
[2] https://kubevirt.io/api-reference/main/definitions.html#_v1beta1_cpupreferences

@lyarwood
Copy link
Member

/cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/monitoring area/virtctl dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/build-change Categorizes PRs as related to changing build files of virt-* components release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/buildsystem Denotes an issue or PR that relates to changes in the build system. sig/compute sig/network sig/scale sig/storage size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet