Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding and fixing support for host-model migration in heterogeneous cluster #7672

Merged

Conversation

Barakmor1
Copy link
Member

@Barakmor1 Barakmor1 commented May 3, 2022

What this PR does / why we need it:

  1. Vmi can't migrate back to older nodes (only to nodes with the same or more features than the current one) with host-model cpu
  2. vmi with host-model can't migrate if the target node does't have the same host model even if the target node support the host-model of the source node and has all the required features.
    Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
    For instance:
  • if vmi start with hosmodel-cpu in node01 that doesn't have feature A and than migrate to node02 that has feature A we won't be able to migrate back to node01 because in the case virt-launcher will be schedule with nodeSelector for feature A even due it
    can run on node01 witout any problem.
  • if vmi start with host-model cpuModel in node01 that has host-model A try to migrate to node02 with host-model B that has all the required features and support host-model A cpuModel the migration fails(that should not happen).
    Fixes #
    Added new field to know if the vmi already migrated and if it did use the old node selectors(of the initial node).
    Special notes for your reviewer:
    I noticed that there are more bugs related to host-model cpuModel for instance:
    vmi with host-model can't migrate if the target node does't have the same host model even if the target node support the host-model of the source node and has all the required features.
    I will publish more PRs soon.
    Release note:
NONE

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L labels May 3, 2022
@Barakmor1
Copy link
Member Author

/cc @vladikr @iholder-redhat @davidvossel

@Barakmor1 Barakmor1 force-pushed the allowing_migrating_back_to_old_node branch 3 times, most recently from 15dbfcd to 5b5d25d Compare May 3, 2022 14:17
@Barakmor1
Copy link
Member Author

/retest-required

2 similar comments
@Barakmor1
Copy link
Member Author

/retest-required

@Barakmor1
Copy link
Member Author

/retest-required

Copy link
Member

@vladikr vladikr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking on this issue @Barakmor1 !
I have a suggestion in line.

@@ -1533,19 +1533,28 @@ func (c *MigrationController) alertIfHostModelIsUnschedulable(vmi *virtv1.Virtua
}
}

func prepareNodeSelectorForHostCpuModel(node *k8sv1.Node, pod *k8sv1.Pod) error {
func prepareNodeSelectorForHostCpuModel(node *k8sv1.Node, pod *k8sv1.Pod, vmi *virtv1.VirtualMachineInstance, sourcePod *k8sv1.Pod) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can make this function look like or something similar...

func prepareNodeSelectorForHostCpuModel(node *k8sv1.Node, pod *k8sv1.Pod, sourcePod *k8sv1.Pod) error {                
        var hostCpuModel, cpuModelLabelKey, hostModelLabelValue string                                                 
        haveCPUNodeSelector := false                                                                                   
                                                                                                                       
        for key, value := range sourcePod.Spec.NodeSelector {                                                          
                if strings.Contains(key, virtv1.CPUFeatureLabel) || strings.Contains(key, virtv1.CPUModelLabel) {      
                        pod.Spec.NodeSelector[key] = value
                        haveCPUNodeSelector := true
                }               
        }               
        if !haveCPUNodeSelector {                                                                                      
                for key, value := range node.Labels {                                                                  
                        if strings.HasPrefix(key, virtv1.HostModelCPULabel) {                                          
                                hostCpuModel = strings.TrimPrefix(key, virtv1.HostModelCPULabel)                       
                                hostModelLabelValue = value
                        }                                                                                              
        
                        if strings.HasPrefix(key, virtv1.HostModelRequiredFeaturesLabel) {                             
                                requiredFeature := strings.TrimPrefix(key, virtv1.HostModelRequiredFeaturesLabel)      
                                pod.Spec.NodeSelector[virtv1.CPUFeatureLabel+requiredFeature] = value
                        }                                                                                              
                }

                if hostCpuModel == "" {                                                                                
                        return fmt.Errorf("node does not contain labal \"%s\" with information about host cpu model", virtv1.HostModelCPULabel)                                                                                               
                }
        
                cpuModelLabelKey = virtv1.CPUModelLabel + hostCpuModel                                                 
                pod.Spec.NodeSelector[cpuModelLabelKey] = hostModelLabelValue                                          
        
                log.Log.Object(pod).Infof("cpu model label selector (\"%s\") defined for migration target pod", cpuModelLabelKey)
        }               
        return nil      
} 

var hostCpuModel, hostCpuModelLabelKey, hostModelLabelValue string

for key, value := range node.Labels {
if strings.HasPrefix(key, virtv1.HostModelCPULabel) {
hostCpuModel = strings.TrimPrefix(key, virtv1.HostModelCPULabel)
hostModelLabelValue = value
}

if strings.HasPrefix(key, virtv1.HostModelRequiredFeaturesLabel) {
if vmi.Status.MigratedAtLeastOnce == false && strings.HasPrefix(key, virtv1.HostModelRequiredFeaturesLabel) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we don't need vmi.Status.MigratedAtLeastOnce field at all?
I think we can simply rely on the presence of the previous node selector and just copy it to the destination pod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key here is that previously we were searching by the HostModelCPULabel prefix, but we should have been searching by the CPUModelLabel prefix

So this

hostCpuModelLabelKey = virtv1.HostModelCPULabel + hostCpuModel
pod.Spec.NodeSelector[hostCpuModelLabelKey] = hostModelLabelValue

would become

 cpuModelLabelKey = virtv1.CPUModelLabel + hostCpuModel                                                 
pod.Spec.NodeSelector[cpuModelLabelKey] = hostModelLabelValue     

Copy link
Member Author

@Barakmor1 Barakmor1 May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Vladik,
Thanks for reviewing.
Your idea sounds good but what if the initial node doesn't have any required feature labels i think that in that case if we won't use vmi.Status.MigratedAtLeastOnce than we will be coupled to the target node required features.
On second thought we can search for CPUModelLabel selector instead of CPUFeatureLabel that is always in virt-launcher after the first migration to solve this, i will update the PR soon.

and about the cpuModelLabelKey i have a similar branch to solve this issue, i'm waiting for this one to be merged to publish it because i use this new test's Context to test this issue modifications and i prefer to keep this PR small and more cohesive to make it easier to review in the future.
just to be clear the next PR is to solve this issue:
vmi with host-model can't migrate if the target node does't have the same host model even if the target node support the host-model of the source node and has all the required features.
and it will be publish immediately after this one will get merged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your idea sounds good but what if the initial node doesn't have any required feature labels i think that in that case if we won't use vmi.Status.MigratedAtLeastOnce than we will be coupled to the target node required features.

If the problem we're trying to solve here involves knowledge of the first node a VMI runs on, can we store all that information in the VMI status after the VMI starts so we don't have to keep track of whether migrations have already occurred or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Vladik, Thanks for reviewing. Your idea sounds good but what if the initial node doesn't have any required feature labels i think that in that case if we won't use vmi.Status.MigratedAtLeastOnce than we will be coupled to the target node required features. On second thought we can search for CPUModelLabel selector instead of CPUFeatureLabel that is always in virt-launcher after the first migration to solve this, i will update the PR soon.

The function I've posted above searches for either CPUFeatureLabel or CPUModelLabel if one of these labels exists in the source pod nodeSelector then we should propagate this to the destination pod nodeSelector.
Otherwise, we will create a new nodeSelector for the destination pod that is based on the translation of the host-model CPU, as this is probably the first time we are migrating this VMI.

        for key, value := range sourcePod.Spec.NodeSelector {                                                          
                if strings.Contains(key, virtv1.CPUFeatureLabel) || strings.Contains(key, virtv1.CPUModelLabel) {      
                        pod.Spec.NodeSelector[key] = value
                        haveCPUNodeSelector := true
                }               
        }               
        if !haveCPUNodeSelector {         

and about the cpuModelLabelKey i have a similar branch to solve this issue, i'm waiting for this one to be merged to publish it because i use this new test's Context to test this issue modifications and i prefer to keep this PR small and more cohesive to make it easier to review in the future. just to be clear the next PR is to solve this issue: vmi with host-model can't migrate if the target node does't have the same host model even if the target node support the host-model of the source node and has all the required features. and it will be publish immediately after this one will get merged

In my view, these two changes should happen in the same PR.
Previously, we were generating the nodeSelector using the HostModelCPULabel field. Without changing the cpuModelLabelKey, I don't see how can we test this PR.

Does it make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidvossel we probably could, but perhaps we can get away without updating the vmi status.
On the first migration, the translated CPU model will be stored in the destination pod nodeSelector.
Any following migration could simply reuse this nodeSelector.

My concern with storing this information in the VMI status is that it won't be backward compatible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function I've posted above searches for either CPUFeatureLabel or CPUModelLabel if one of these labels exists in the source pod nodeSelector then we should propagate this to the destination pod nodeSelector.
Otherwise, we will create a new nodeSelector for the destination pod that is based on the translation of the host-model CPU, as this is probably the first time we are migrating this VMI.

Right, I missed the OR operator || sorry about that.

Copy link
Member Author

@Barakmor1 Barakmor1 May 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladikr i updated the PR to not include the new field in vmi.status, updated the description of this PR.

In my view, these two changes should happen in the same PR.
Previously, we were generating the nodeSelector using the HostModelCPULabel field. Without changing the cpuModelLabelKey, I don't see how can we test this PR.

I also added the modification of my second branch to this PR including a test that disable the target node labeller and simulate a target node with fake host-model that support the source node host-model.
(if it is possible to simulate it in the cluster otherwise i skip the test just like the first test)

regarding the changes to the hostCpuModelLabelKey :
This modification is necessary but we also need to include the host-model cpuModel of a node as part of the supported cpuModels in the node(using CPUModelLabel + <host-model> : true ) , i included this modification as part of the fourth commit of this PR.
The new functional test in the third commit of this PR doesn't pass without both modification.

Honestly i think that keeping PR as cohesive as possible is best practice.
From my point of view we should deal with one bug in each PR and add a test for this bug to make sure it doesn't happen again,
this way it is easier to follow and understand in future reviews.

@Barakmor1 Barakmor1 force-pushed the allowing_migrating_back_to_old_node branch 2 times, most recently from 5960cbd to 797dd92 Compare May 8, 2022 11:50
@kubevirt-bot kubevirt-bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 8, 2022
@Barakmor1 Barakmor1 force-pushed the allowing_migrating_back_to_old_node branch 2 times, most recently from 180f5b8 to 4ebc7f2 Compare May 8, 2022 14:13
@Barakmor1
Copy link
Member Author

/retest-required

@Barakmor1 Barakmor1 force-pushed the allowing_migrating_back_to_old_node branch 3 times, most recently from dc4b4a9 to 65c5cf6 Compare May 9, 2022 06:36
@Barakmor1
Copy link
Member Author

/retest-required

Copy link
Member

@vladikr vladikr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I think it's almost there, from my side.
@jean-edouard can you please take a look as well?

func prepareNodeSelectorForHostCpuModel(node *k8sv1.Node, pod *k8sv1.Pod) error {
var hostCpuModel, hostCpuModelLabelKey, hostModelLabelValue string
func prepareNodeSelectorForHostCpuModel(node *k8sv1.Node, pod *k8sv1.Pod, sourcePod *k8sv1.Pod) error {
var hostCpuModel, nodeSelectorKeyForHostModel, hostModelLabelValue string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like the following is needed here.

if cpu := vmi.Spec.Domain.CPU; cpu == nil || cpu.Model != virtv1.CPUModeHostModel {
		return
	}

Copy link
Member Author

@Barakmor1 Barakmor1 May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already check this before we call the function.
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, perfect, thanks, I missed it.

@kubevirt-bot
Copy link
Contributor

@Barakmor1: new pull request created: #7853

In response to this:

/cherrypick release-0.53

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Barakmor1
Copy link
Member Author

/cherrypick release-0.52

@kubevirt-bot
Copy link
Contributor

@Barakmor1: #7672 failed to apply on top of branch "release-0.52":

Applying: Add functional test to prove that migrating back to source node from newer node is not allowed when using host-model which is a sneaky bug.
Using index info to reconstruct a base tree...
M	tests/migration_test.go
M	tests/utils.go
Falling back to patching base and 3-way merge...
Auto-merging tests/utils.go
CONFLICT (content): Merge conflict in tests/utils.go
Auto-merging tests/migration_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Add functional test to prove that migrating back to source node from newer node is not allowed when using host-model which is a sneaky bug.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-0.52

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 6, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 8, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 14, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 15, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 20, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 20, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 20, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 20, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 20, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 20, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 20, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 20, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 21, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Barakmor1 pushed a commit to Barakmor1/kubevirt that referenced this pull request Jun 21, 2022
… features,

Allow vmi with host-model to migrate when target node does't have the same host-model cpuModel.
Manual back port kubevirt#7672,kubevirt#7770 from main

Signed-off-by: bmordeha <bmodeha@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants