New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix double migration during node drain #4542
Fix double migration during node drain #4542
Conversation
This bug originates from a race condition between the time we finish a migration and remove the PDB that used to protect the original launcher and the eviction requests that are issued by the client every 5 seconds. To fix the issue, a check was added in the eviction webhook that prevents it from operating on a VMI that was already migrated to another node. This PR also includes closing of loose ends on the evacuation controller to prevent a race between the time we mark the VMI's migration as completed and the time the evacuation sync loop runs. https://bugzilla.redhat.com/show_bug.cgi?id=1888790 Signed-off-by: Daniel Belenky <dbelenky@redhat.com>
| tests.WaitAgentConnected(virtClient, vmi) | ||
|
|
||
| // Mark the masters as schedulable so we can migrate there | ||
| setMastersUnschedulable(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lesson from stabilization efforts that we were doing to the migration tests. Because we have only 1 master on the test environment, we first ensure that it will not be the node that we drain by tainting it. Then, we need to remove the taint to allow the VM to migrate there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might actually cause unexpected results. In some clusters I'm aware of masters don't have /dev/kvm (e.g. nested virt is not enabled) so can't be scheduled on even with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stu-gott if the master doesn't have /dev/kvm so in order to execute the migration tests you need another node anyway.
But how would it cause unexpected results? The only thing that is done here is ensuring that we don't drain one of the masters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed passing false to this function would render the master schedulable? If I'm wrong, then nevermind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should stop doing real "drains" in our test suite. Repeatedly calling the eviction endpoint until the virt-launcher pod is gone and no other migration gets triggered should do. Full "drain" is in my oppinion too error prone on such small clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
| @@ -707,6 +709,8 @@ func (d *VirtualMachineController) updateVMIStatus(vmi *v1.VirtualMachineInstanc | |||
| // the target node has seen the domain event. | |||
| vmi.Labels[v1.NodeNameLabel] = migrationHost | |||
| vmi.Status.NodeName = migrationHost | |||
| // clean the evacuation node name since have already migrated to a new node | |||
| vmi.Status.EvacuationNodeName = "" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha, good improvement.
|
Ok, looks good to me. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vladikr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/retest |
|
/retest |
|
/cherry-pick release-0.34 |
|
@stu-gott: once the present PR merges, I will cherry-pick it on top of release-0.34 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test pull-kubevirt-e2e-k8s-1.19 |
|
@stu-gott: No presubmit jobs available for kubevirt/kubevirt@master In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
1 similar comment
|
/retest |
|
/override pull-kubevirt-e2e-k8s-1.17 the failed test in this lane is unrelated |
|
@stu-gott: Overrode contexts on behalf of stu-gott: pull-kubevirt-e2e-k8s-1.17 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
@stu-gott: #4542 failed to apply on top of branch "release-0.34": In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This bug originates from a race condition between the time we finish a
migration and remove the PDB that used to protect the original launcher
and the eviction requests that are issued by the client every 5 seconds.
To fix the issue, a check was added in the eviction webhook that
prevents it from operating on a VMI that was already migrated to another
node.
This PR also includes closing of loose ends on the evacuation
controller to prevent a race between the time we mark the VMI's
migration as completed and the time the evacuation sync loop runs.
https://bugzilla.redhat.com/show_bug.cgi?id=1888790
Signed-off-by: Daniel Belenky dbelenky@redhat.com