Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double migration during node drain #4542

Merged
merged 1 commit into from Nov 14, 2020

Conversation

danielBelenky
Copy link
Member

This bug originates from a race condition between the time we finish a
migration and remove the PDB that used to protect the original launcher
and the eviction requests that are issued by the client every 5 seconds.

To fix the issue, a check was added in the eviction webhook that
prevents it from operating on a VMI that was already migrated to another
node.

This PR also includes closing of loose ends on the evacuation
controller to prevent a race between the time we mark the VMI's
migration as completed and the time the evacuation sync loop runs.

https://bugzilla.redhat.com/show_bug.cgi?id=1888790

Signed-off-by: Daniel Belenky dbelenky@redhat.com

Fix double migration on node evacuation

This bug originates from a race condition between the time we finish a
migration and remove the PDB that used to protect the original launcher
and the eviction requests that are issued by the client every 5 seconds.

To fix the issue, a check was added in the eviction webhook that
prevents it from operating on a VMI that was already migrated to another
node.

This PR also includes closing of loose ends on the evacuation
controller to prevent a race between the time we mark the VMI's
migration as completed and the time the evacuation sync loop runs.

https://bugzilla.redhat.com/show_bug.cgi?id=1888790

Signed-off-by: Daniel Belenky <dbelenky@redhat.com>
@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L labels Nov 11, 2020
tests.WaitAgentConnected(virtClient, vmi)

// Mark the masters as schedulable so we can migrate there
setMastersUnschedulable(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have to do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lesson from stabilization efforts that we were doing to the migration tests. Because we have only 1 master on the test environment, we first ensure that it will not be the node that we drain by tainting it. Then, we need to remove the taint to allow the VM to migrate there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might actually cause unexpected results. In some clusters I'm aware of masters don't have /dev/kvm (e.g. nested virt is not enabled) so can't be scheduled on even with this approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stu-gott if the master doesn't have /dev/kvm so in order to execute the migration tests you need another node anyway.
But how would it cause unexpected results? The only thing that is done here is ensuring that we don't drain one of the masters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed passing false to this function would render the master schedulable? If I'm wrong, then nevermind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should stop doing real "drains" in our test suite. Repeatedly calling the eviction endpoint until the virt-launcher pod is gone and no other migration gets triggered should do. Full "drain" is in my oppinion too error prone on such small clusters.

Copy link
Member

@stu-gott stu-gott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 11, 2020
@@ -707,6 +709,8 @@ func (d *VirtualMachineController) updateVMIStatus(vmi *v1.VirtualMachineInstanc
// the target node has seen the domain event.
vmi.Labels[v1.NodeNameLabel] = migrationHost
vmi.Status.NodeName = migrationHost
// clean the evacuation node name since have already migrated to a new node
vmi.Status.EvacuationNodeName = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, good improvement.

@vladikr
Copy link
Member

vladikr commented Nov 11, 2020

Ok, looks good to me.
/approve

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vladikr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 11, 2020
@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@danielBelenky
Copy link
Member Author

/retest

@stu-gott
Copy link
Member

/cherry-pick release-0.34

@kubevirt-bot
Copy link
Contributor

@stu-gott: once the present PR merges, I will cherry-pick it on top of release-0.34 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stu-gott
Copy link
Member

/test pull-kubevirt-e2e-k8s-1.19

@openshift-ci-robot
Copy link
Collaborator

@stu-gott: No presubmit jobs available for kubevirt/kubevirt@master

In response to this:

/test pull-kubevirt-e2e-k8s-1.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@stu-gott
Copy link
Member

/override pull-kubevirt-e2e-k8s-1.17

the failed test in this lane is unrelated

@kubevirt-bot
Copy link
Contributor

@stu-gott: Overrode contexts on behalf of stu-gott: pull-kubevirt-e2e-k8s-1.17

In response to this:

/override pull-kubevirt-e2e-k8s-1.17

the failed test in this lane is unrelated

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@stu-gott
Copy link
Member

/retest

@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@stu-gott
Copy link
Member

/retest

@kubevirt-bot kubevirt-bot merged commit 25181c9 into kubevirt:master Nov 14, 2020
@kubevirt-bot
Copy link
Contributor

@stu-gott: #4542 failed to apply on top of branch "release-0.34":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	pkg/virt-api/webhooks/validating-webhook/admitters/pod-eviction-admitter.go
M	pkg/virt-handler/vm.go
M	tests/migration_test.go
Falling back to patching base and 3-way merge...
Auto-merging tests/migration_test.go
Auto-merging pkg/virt-handler/vm.go
Auto-merging pkg/virt-api/webhooks/validating-webhook/admitters/pod-eviction-admitter.go
CONFLICT (content): Merge conflict in pkg/virt-api/webhooks/validating-webhook/admitters/pod-eviction-admitter.go
Patch failed at 0001 Fix double migration during node drain

In response to this:

/cherry-pick release-0.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants