[virt-controller]: consider the ParallelOutboundMigrationsPerNode when evicting VMs #8701

enp0s3 · 2022-10-31T14:59:14Z

Signed-off-by: Igor Bezukh ibezukh@redhat.com

What this PR does / why we need it:
If not considering the maximum per-source-node active migrations we can have a scenario where we reach the maximum capacity of concurrent migrations in a cluster, where most of the migrations are pending because all the migrations are from the same source node.

This can be problematic in a case where multiple drains occur at the same time, therefore evictions from different source nodes can start simultaneously, but it won't happen because the migraiton queue was occupied with VMs from the same source node.

The solution is to observe migration queue with active migrations from same source node and limit the creation of new migrations up to the ParallelOutboundMigrationsPerNode.

Which issue(s) this PR fixes
https://bugzilla.redhat.com/show_bug.cgi?id=2069098

Special notes for your reviewer:
In this PR I'm not aiming to re-factor or improve the current eviction flow, but only concentrate on the fix.

Release note:

Consider the ParallelOutboundMigrationsPerNode when evicting VMs

enp0s3 · 2022-11-02T07:58:14Z

/cc @acardace @iholder-redhat Hi, can you please take a look?

iholder101

Great work @enp0s3! Thanks very much!
Left some mostly minor comments

pkg/virt-controller/watch/drain/evacuation/evacuation.go

pkg/virt-controller/watch/drain/evacuation/evacuation_test.go

bbenshab · 2022-11-07T10:16:38Z

Just FYI, I tested a temporary walkaround for this issue by setting
MaxParallelMigrationsPerCluster == MaxParallelMigrationsPerOutboundNode
I hoped that when this happens we would drain all the VMs from one node at a time and then we can hit max VMs migration in parallel, unfortunately, it didn't work.
so I'm really looking forward to testing this fix.

iholder101 · 2022-11-07T10:56:27Z

Just FYI, I tested a temporary walkaround for this issue by setting MaxParallelMigrationsPerCluster == MaxParallelMigrationsPerOutboundNode I hoped that when this happens we would drain all the VMs from one node at a time and then we can hit max VMs migration in parallel, unfortunately, it didn't work. so I'm really looking forward to testing this fix.

Hey!
This makes sense. With the current behavior the first node that would try to evacuate would create MaxParallelMigrationsPerCluster vmim objects, therefore it would be the only one migrating (at least at the beginning). I believe @enp0s3's changes address that problem 🤞

iholder101 · 2022-11-07T14:49:29Z

Thanks a lot @enp0s3! Great job!
/lgtm

@xpivarc PTAL

pkg/virt-controller/watch/drain/evacuation/evacuation.go

pkg/virt-controller/watch/drain/evacuation/evacuation_test.go

enp0s3 · 2022-11-14T16:40:02Z

@xpivarc Hey. I think I addressed everything, can you PTAL?

xpivarc

/approve
I have just a few suggestions to improve this.

xpivarc · 2022-11-15T16:14:59Z

pkg/virt-controller/watch/drain/evacuation/evacuation.go

-		if len(migrationCandidates) > 0 || len(nonMigrateable) > 0 {
+
+	// No free slots, need to wait till migrations will finish. Re-enqueue and return
+	if len(migrationCandidates) > 0 || len(nonMigrateable) > 0 {


Don't we handle the if len(migrationCandidates) > 0 || len(nonMigrateable) > 0 on line 372?

vmisToMigrate := vmisToMigrate(node, vmisOnNode, taint) if len(vmisToMigrate) == 0 { return nil }

@xpivarc vmisToMigrate is a list of VMs marked for eviction because of their eviction strategy. migrationCandidates are VMs that marked and meet the conditions for live migration. nonMigrateable are VMs that were marked but don't meet the requirements for live migration

pkg/virt-controller/watch/drain/evacuation/evacuation.go

kubevirt-bot · 2022-11-15T16:38:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xpivarc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [xpivarc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

If not considering the maximum per-source-node active migrations we can have a scenario where we reach the maximum capacity of concurrent migrations in a cluster, where most of the migrations are pending because all the migrations are from the same source node. This can be problematic in a case where multiple drains occur at the same time, therefore evictions from different source nodes can start simultaneously, but it won't happen because the migraiton queue was occupied with VMs from the same source node. The solution is to observe migration queue with active migrations from same source node and limit the creation of new migrations up to the ParallelOutboundMigrationsPerNode. Signed-off-by: Igor Bezukh <ibezukh@redhat.com>

enp0s3 · 2022-11-16T13:25:14Z

/hold
looking at unit tests

Some of the tests didn't consider the ParallelOutboundMigrationsPerNode value, whose default is 2, and its lower than the default of ParallelMigrationsPerCluster (5) Signed-off-by: enp0s3 <ibezukh@redhat.com>

enp0s3 · 2022-11-16T15:28:11Z

/unhold
Addressed unit test issues

iholder101

Thanks @enp0s3! Great stuff!
I have a tiny nit, feel free to save it to follow-up PRs
/lgtm

iholder101 · 2022-11-16T16:27:14Z

pkg/virt-controller/watch/drain/evacuation/evacuation.go

+	vmisOnNode []*virtv1.VirtualMachineInstance,
+	activeMigrations []*virtv1.VirtualMachineInstanceMigration) (activeMigrationsFromThisSourceNode int) {
+
+	vmiMap := make(map[string]bool)


nit: better to use make(map[string]struct{}) which will allocate less space

enp0s3 · 2022-11-16T17:02:16Z

/retest-required

kubevirt-bot · 2022-11-16T19:16:11Z

@enp0s3: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubevirt-e2e-k8s-1.22-sig-storage	`6116051`	link	true	`/test pull-kubevirt-e2e-k8s-1.22-sig-storage`
pull-kubevirt-e2e-k8s-1.24-sig-storage-nonroot	`6116051`	link	true	`/test pull-kubevirt-e2e-k8s-1.24-sig-storage-nonroot`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

acardace · 2022-11-16T20:45:57Z

/retest-required

enp0s3 · 2022-11-17T06:51:02Z

/cherry-pick release-0.58

kubevirt-bot · 2022-11-17T06:51:50Z

@enp0s3: new pull request created: #8806

In response to this:

/cherry-pick release-0.58

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bbenshab · 2022-12-08T11:30:09Z

Results are here, and they are showing 533% faster migration, as we can see on the charts, on 4.11 we were almost exclusively migrating 1-2 VMs in parallel while on 4.12, its 9-16 VMs while we were migrating 11 VMs in parallel most of the time bringing down out total migration time from 2522 seconds down to 473.
tested on a 30 nodes cluster with 1000 VMs , and 500 VMs migrations.

liveMigrationConfig:
parallelMigrationsPerCluster: 20
parallelOutboundMigrationsPerNode: 4

iholder101 · 2022-12-14T13:22:12Z

Results are here, and they are showing 533% faster migration

Awesome job @enp0s3 and @bbenshab!

kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Oct 31, 2022

kubevirt-bot requested review from AlonaKaplan and maiqueb October 31, 2022 14:59

kubevirt-bot added the size/L label Oct 31, 2022

iholder101 reviewed Nov 6, 2022

View reviewed changes

enp0s3 force-pushed the max-mig branch 2 times, most recently from ff0496c to 6116051 Compare November 7, 2022 14:47

kubevirt-bot assigned iholder101 Nov 7, 2022

kubevirt-bot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels Nov 7, 2022

iholder101 approved these changes Nov 7, 2022

View reviewed changes

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 7, 2022

xpivarc reviewed Nov 7, 2022

View reviewed changes

enp0s3 marked this pull request as draft November 13, 2022 15:36

kubevirt-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 13, 2022

enp0s3 force-pushed the max-mig branch from 6116051 to 7d5cd33 Compare November 14, 2022 16:37

kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 14, 2022

enp0s3 force-pushed the max-mig branch from 7d5cd33 to 79cb12f Compare November 14, 2022 16:38

xpivarc reviewed Nov 15, 2022

View reviewed changes

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 15, 2022

enp0s3 marked this pull request as ready for review November 16, 2022 10:33

kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 16, 2022

kubevirt-bot requested a review from alicefr November 16, 2022 10:36

kubevirt-bot requested a review from iholder101 November 16, 2022 10:36

enp0s3 force-pushed the max-mig branch from 79cb12f to b2099c4 Compare November 16, 2022 12:30

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 16, 2022

virt-controller: fix evacuation controller unit tests

f45b2c1

Some of the tests didn't consider the ParallelOutboundMigrationsPerNode value, whose default is 2, and its lower than the default of ParallelMigrationsPerCluster (5) Signed-off-by: enp0s3 <ibezukh@redhat.com>

kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 16, 2022

iholder101 reviewed Nov 16, 2022

View reviewed changes

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 16, 2022

kubevirt-bot merged commit ad0a6e1 into kubevirt:main Nov 17, 2022

kubevirt-bot mentioned this pull request Nov 17, 2022

[release-0.58] [virt-controller]: consider the ParallelOutboundMigrationsPerNode when evicting VMs #8806

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[virt-controller]: consider the ParallelOutboundMigrationsPerNode when evicting VMs #8701

[virt-controller]: consider the ParallelOutboundMigrationsPerNode when evicting VMs #8701

enp0s3 commented Oct 31, 2022 •

edited

enp0s3 commented Nov 2, 2022

iholder101 left a comment

bbenshab commented Nov 7, 2022

iholder101 commented Nov 7, 2022

iholder101 commented Nov 7, 2022

enp0s3 commented Nov 14, 2022

xpivarc left a comment

xpivarc Nov 15, 2022

enp0s3 Nov 16, 2022

kubevirt-bot commented Nov 15, 2022

enp0s3 commented Nov 16, 2022

enp0s3 commented Nov 16, 2022

iholder101 left a comment

iholder101 Nov 16, 2022

enp0s3 commented Nov 16, 2022

kubevirt-bot commented Nov 16, 2022 •

edited

acardace commented Nov 16, 2022

enp0s3 commented Nov 17, 2022

kubevirt-bot commented Nov 17, 2022

bbenshab commented Dec 8, 2022 •

edited

iholder101 commented Dec 14, 2022

[virt-controller]: consider the ParallelOutboundMigrationsPerNode when evicting VMs #8701

[virt-controller]: consider the ParallelOutboundMigrationsPerNode when evicting VMs #8701

Conversation

enp0s3 commented Oct 31, 2022 • edited

enp0s3 commented Nov 2, 2022

iholder101 left a comment

Choose a reason for hiding this comment

bbenshab commented Nov 7, 2022

iholder101 commented Nov 7, 2022

iholder101 commented Nov 7, 2022

enp0s3 commented Nov 14, 2022

xpivarc left a comment

Choose a reason for hiding this comment

xpivarc Nov 15, 2022

Choose a reason for hiding this comment

enp0s3 Nov 16, 2022

Choose a reason for hiding this comment

kubevirt-bot commented Nov 15, 2022

enp0s3 commented Nov 16, 2022

enp0s3 commented Nov 16, 2022

iholder101 left a comment

Choose a reason for hiding this comment

iholder101 Nov 16, 2022

Choose a reason for hiding this comment

enp0s3 commented Nov 16, 2022

kubevirt-bot commented Nov 16, 2022 • edited

acardace commented Nov 16, 2022

enp0s3 commented Nov 17, 2022

kubevirt-bot commented Nov 17, 2022

bbenshab commented Dec 8, 2022 • edited

iholder101 commented Dec 14, 2022

enp0s3 commented Oct 31, 2022 •

edited

kubevirt-bot commented Nov 16, 2022 •

edited

bbenshab commented Dec 8, 2022 •

edited