New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[virt-controller]: consider the ParallelOutboundMigrationsPerNode when evicting VMs #8701
Conversation
/cc @acardace @iholder-redhat Hi, can you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @enp0s3! Thanks very much!
Left some mostly minor comments
Just FYI, I tested a temporary walkaround for this issue by setting |
Hey! |
ff0496c
to
6116051
Compare
@xpivarc Hey. I think I addressed everything, can you PTAL? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
I have just a few suggestions to improve this.
if len(migrationCandidates) > 0 || len(nonMigrateable) > 0 { | ||
|
||
// No free slots, need to wait till migrations will finish. Re-enqueue and return | ||
if len(migrationCandidates) > 0 || len(nonMigrateable) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we handle the if len(migrationCandidates) > 0 || len(nonMigrateable) > 0
on line 372?
vmisToMigrate := vmisToMigrate(node, vmisOnNode, taint)
if len(vmisToMigrate) == 0 {
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xpivarc vmisToMigrate
is a list of VMs marked for eviction because of their eviction strategy. migrationCandidates
are VMs that marked and meet the conditions for live migration. nonMigrateable
are VMs that were marked but don't meet the requirements for live migration
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: xpivarc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
If not considering the maximum per-source-node active migrations we can have a scenario where we reach the maximum capacity of concurrent migrations in a cluster, where most of the migrations are pending because all the migrations are from the same source node. This can be problematic in a case where multiple drains occur at the same time, therefore evictions from different source nodes can start simultaneously, but it won't happen because the migraiton queue was occupied with VMs from the same source node. The solution is to observe migration queue with active migrations from same source node and limit the creation of new migrations up to the ParallelOutboundMigrationsPerNode. Signed-off-by: Igor Bezukh <ibezukh@redhat.com>
/hold |
Some of the tests didn't consider the ParallelOutboundMigrationsPerNode value, whose default is 2, and its lower than the default of ParallelMigrationsPerCluster (5) Signed-off-by: enp0s3 <ibezukh@redhat.com>
/unhold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @enp0s3! Great stuff!
I have a tiny nit, feel free to save it to follow-up PRs
/lgtm
vmisOnNode []*virtv1.VirtualMachineInstance, | ||
activeMigrations []*virtv1.VirtualMachineInstanceMigration) (activeMigrationsFromThisSourceNode int) { | ||
|
||
vmiMap := make(map[string]bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: better to use make(map[string]struct{})
which will allocate less space
/retest-required |
@enp0s3: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest-required |
/cherry-pick release-0.58 |
@enp0s3: new pull request created: #8806 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Results are here, and they are showing 533% faster migration, as we can see on the charts, on 4.11 we were almost exclusively migrating 1-2 VMs in parallel while on 4.12, its 9-16 VMs while we were migrating 11 VMs in parallel most of the time bringing down out total migration time from 2522 seconds down to 473. liveMigrationConfig: |
Signed-off-by: Igor Bezukh ibezukh@redhat.com
What this PR does / why we need it:
If not considering the maximum per-source-node active migrations we can have a scenario where we reach the maximum capacity of concurrent migrations in a cluster, where most of the migrations are pending because all the migrations are from the same source node.
This can be problematic in a case where multiple drains occur at the same time, therefore evictions from different source nodes can start simultaneously, but it won't happen because the migraiton queue was occupied with VMs from the same source node.
The solution is to observe migration queue with active migrations from same source node and limit the creation of new migrations up to the ParallelOutboundMigrationsPerNode.
Which issue(s) this PR fixes
https://bugzilla.redhat.com/show_bug.cgi?id=2069098
Special notes for your reviewer:
In this PR I'm not aiming to re-factor or improve the current eviction flow, but only concentrate on the fix.
Release note: