Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SR-IOV Migration: Move attach SRIOV devices to virt-handler #6581

Merged
merged 4 commits into from Jan 24, 2022

Conversation

ormergi
Copy link
Contributor

@ormergi ormergi commented Oct 12, 2021

What this PR does / why we need it:

Currently when SR-IOV VM is migrated we detach its SR-IOV network devices just before migration
starts and attach similar devices to VM on the target when migration is finished successfully.
Attaching the SR-IOV devices is one-shot operation and is performed one time only at post migration.

Due to the fact that it is a one-shot operation the VM might end up in incomplete state (missing SR-IOV devices)
when a SR-IOV device is disconnected manually or due to aborted migration.
Also the current implementation doesn't follow Kubernetes desire state design that is followed all over the project.

With this PR changes, virt-handler VMController is now aware of the SR-IOV network devices state and reconciles
them which means they will be attached to the VM when needed.
Since attaching SR-IOV host-device is an intrusive operation it will be rate-limited and stops after a while.

Also With this change, when a migration is aborted (due to failure or client request) the SR-IOV devices there were detached will be attached again to the source VM, and in case SR-IOV devices are disconnected from the guest they will be attached
again.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Special notes for your reviewer:

Release note:

SRIOV network interfaces are now hot-plugged when disconnected manually or due to aborted migrations.

@kubevirt-bot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kubevirt-bot kubevirt-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/S labels Oct 12, 2021
@ormergi
Copy link
Contributor Author

ormergi commented Oct 12, 2021

/uncc @enp0s3 @vatsalparekh

@kubevirt-bot kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 12, 2021
@ormergi ormergi changed the title sriov migration: Move reattach SRIOV devices logic to virt-launcher SR-IOV Migration: Attach SRIOV devices as part of virt-handler reconcile loop Nov 1, 2021
@kubevirt-bot kubevirt-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL and removed size/M labels Nov 1, 2021
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 16, 2021
@ormergi
Copy link
Contributor Author

ormergi commented Nov 16, 2021

/test pull-kubevirt-e2e-kind-1.19-sriov pull-kubevirt-unit-test

@ormergi ormergi changed the title SR-IOV Migration: Attach SRIOV devices as part of virt-handler reconcile loop SR-IOV Migration: Move attach SRIOV devices to virt-handler Nov 16, 2021
Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the first commit and except the inline comments, it will be nice if you could extract the addition of the new command to a separate commit. The next commit will just user it.
(should help with the review focus)

pkg/virt-handler/vm.go Outdated Show resolved Hide resolved
pkg/virt-handler/vm.go Outdated Show resolved Hide resolved
@ormergi
Copy link
Contributor Author

ormergi commented Nov 22, 2021

/test pull-kubevirt-unit-test pull-kubevirt-e2e-kind-1.19-sriov

@ormergi
Copy link
Contributor Author

ormergi commented Nov 22, 2021

Rebased

/test pull-kubevirt-unit-test pull-kubevirt-e2e-kind-1.19-sriov

@@ -2560,6 +2565,13 @@ func (d *VirtualMachineController) hotplugSriovInterfaces(vmi *v1.VirtualMachine
return nil
}

rateLimitedExecutor := d.sriovHotplugExecutorPool.LoadOrStore(vmi.UID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use - vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go?
same as the controller.
workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "sriov")

The workqueue.DefaultControllerRateLimiter is ItemExponentialFailureRateLimiter.

Copy link
Contributor Author

@ormergi ormergi Jan 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed its implementation and tried to change the code to use it, I came up to a point where it's basically a controller that requires another goroutine.
With more details, using the workqueue, in this case, includes dequeuing an element, performing the hotplug, and according to the result adding the element back to the queue with the rate-limiter.
Dequeue'ing an element is done with Get() which is a blocking func (until it ables to dequeue an element).

return newLimitedBackoffWithClock(l.baseBackoff.backoff, l.baseBackoff.limit, l.baseBackoff.clock)
}

func NewExponentialLimitedBackoffCreator() LimitedBackoffCreator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this creator is needed. It creates identical instance to LimitedBackoffCreator.baseBackoff. What do I miss?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of LimitedBackoff creation, maxStepTime (time.Now() + limit duration) is set relatively to the time it is created.
If we dont do this, by the time the RateLimitedExecutor will try to Exec(..) it may fail due to the limit time is already passed, if the first time its being called is after the limit time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, there is a need to clone baseBackoff each time, but at the same time to stamp it with the time of its instantiation.

testsClock = clock.NewFakeClock(time.Time{})
backoff = ratelimitcmd.NewExponentialLimitedBackoffWithClock(ratelimitcmd.DefaultMaxStep, testsClock)

testsClock.Step(time.Nanosecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this should be called before we start?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests-clock (fake clock) Now() will return the exact same timestamp as backoff.stepEnd timestamp since both getting the same fake clock.
Causing backoff.Ready to return false, which is expected, thus we need to bump the tests-clock a bit before starting.

pkg/virt-handler/ratelimitcmd/executor.go Outdated Show resolved Hide resolved
*
*/

package ratelimitcmd_test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider moving it under already existing pkg/util/ratelimiter.

Copy link
Member

@EdDev EdDev Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one is a flowcontrol rate limiter, no idea what it is.. but it is for sure unrelated with an executor/cmd rate limiter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I"m not saying the flowcontrol should be used. Just having the new ratelimiter under the same package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ratelimitercmd and pkg/util/ratelimiter even tough resembles each other semantically, they are totally two different things.
But I do agree ratelimitercmd should be moved to pkg/util/,
what you think about moving there as is a side to the one that exists?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant.

EDIT:
Sorry, misunderstood you, that's not what I meant:)
I meant moving the content of ratelimitercmd to pkg/util/ratelimiter.
I find it weird to have ratelimiter and ratelimitercmd under pkg\util. You can create hierarchy under pkg/util/ratelimiter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think anything should be under something called util, and there is nothing in common about ratelimiter as a super-package. Multiple objects may have a ratelimiter behavior, in this case it is a cmd/exec.

How about creating an executor package, which has a ratelimter behavior?
IMO it should have been the same with the other one that exists now, it is a "flowcontrol" thing that has a ratelimiter option and not the other way around.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ratelimiter can live by its own and it not only for the executor. The executor is just one usage of the ratelimiter. But I won't block the PR on it.
@ormergi what do you think?

Copy link
Contributor Author

@ormergi ormergi Jan 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of this again, I do agree we should not put it under pkg/util (we had bad experience with util packages..),
the current ratelimiter package we have (the one that warping flowcontrol) should have different name this is what confusing.
If there were more usages for the rate-limiter part (basically backoff.go) of the executor
it would have been natural to put it under its own package pkg/ratelimiter for reuse.
But since there are not other consumers at the moment we can keep it all under one package executor.

If you prefer to have the rate-limiter part (backoff.go) on its own package for viability and encouraging others to use it
we can split the it to two packages pkg/ratelimiter and pkg/executor, and deal with pkg/util/ratelimiter later

pkg/
|_ executor/
|  |_ pool.go
|  |_ executor.go
|    ...
|_ ratelimiter/
   |_ backoff.go

@AlonaKaplan @EdDev WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a property of an executor at the moment. I would only promote it to its own package if another user of it appears.

I think the ratelimiter can live by its own and it not only for the executor

Yes, but it is still a property of the executor and having its own package is just an option if we see it used by another functionality. I do not have one in mind at the moment.
And I also do not like util or have a name that this one can live nicely with the other one you mentioned.

@@ -2560,6 +2565,13 @@ func (d *VirtualMachineController) hotplugSriovInterfaces(vmi *v1.VirtualMachine
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you delete the vm from the d.sriovHotplugExecutorPool in this case? All the sriov nics are plugged, the rate limiting backoff should be zeroed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
Thanks for the heads up, done.

@@ -2517,6 +2517,10 @@ func (d *VirtualMachineController) vmUpdateHelperDefault(origVMI *v1.VirtualMach
return fmt.Errorf("failed to adjust resources: %v", err)
}
} else if vmi.IsRunning() {
if err := d.hotplugSriovInterfaces(vmi); err != nil {
log.Log.Object(vmi).Error(err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you re-enqueue the vmi is such case? How do you make sure that the vmi reconcile will be invoked again?

Also, on our hangouts meeting @EdDev mentioned the d.hotplugSriovInterfaces is async. In such case even if an error is not returned, the operation may fail. How do we make sure the vmi reconcile in called again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, currently when not all SR-IOV interfaces are plugged we dont change the VM phase to failing, we do best effort to do the hotplug w/o disrupting the VM update flow similar to how it was before this PR changes.
Having that being said, it may change in the future.

virt-launcher domain-notifier sends (Modified) event periodically that triggers virt-handler to perform VMI sync that eventually calles hotplugSriovInterfaces that does the hotplug.

On SR-IOV VM's logs there are periodic "Synced VMI" log messages every 1 minute or so.
And on virt-handler I added some debug messages to indicate a VM update and SR-IOV hotplug, both are seen periodically.
I will add references to the code soon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying the re-enqueqe we do in the controller is redundant?
The reconcile for the vmi is periodically invoked anyway?

EDIT: or what you're saying is that in the sriov hotplug case the re-enqueue is not needed since the vm is running and for running vm we have periodic reconcile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: or what you're saying is that in the sriov hotplug case the re-enqueue is not needed since the vm is running and for running vm we have periodic reconcile?

Yes.
Regarding the periodic VMI sync here are the references to what I described earlier:
The domains informer on virt-handler, triggers a VMI sync every 5 minutes [1] [2] [3] [4]
Other than that, virt-launcher domain-notifier client triggers a VMI sync (by sending domain modified event) every two minutes when QEMU guest-agent presents [5] [6] [7] [8] [9] [10]

@kubevirt-bot kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 11, 2022
@AlonaKaplan
Copy link
Member

/approve

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AlonaKaplan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2022
@ormergi ormergi force-pushed the hotplug-sriov-on-reconciler branch 2 times, most recently from bb58d66 to 98a9de1 Compare January 20, 2022 14:55
@ormergi
Copy link
Contributor Author

ormergi commented Jan 20, 2022

Following offline discussion about how this PR changes affect SR-IOV hotplug on old virt-launcher pods and backward comparability in general.
I did some testing on my local env that emulating different stages of Kubevirt during an upgrade in order to verify that SR-IOV hotplug is performed as expected.
link to the my test branch

There are two interesting scenarios:

  1. When a VM is migrated to a node that runs old version of virt-handler (w/o this PR changes), and new virt-launcher migration target pod (includes this PR changes)
    For example: when Kubevirt was upgraded but there is a node still running old virt-handler and there is SR-IOV VM migration to that node.
    virt-handler sends FinalizeVirtualMachineMigration command as part of post migration on target flow.
    But new virt-launcher doesn't attach host-devices as part of FinalizeVirtualMachineMigration any longer.
    Thus SR-IOV interfaces are not plugged to the guest at the end of the migration.
    Once virt-handler is upgraded it will send HotplugHostDevices as part of the reconcile loop that will trigger
    virt-launcher to attach the host-devices.

This is transient state as we expect all virt-handler pods to eventually upgrade as part of Kubevirt upgrade.

  1. When a VM is migrated to a node that runs new virt-handler and virt-launcher migration target pod is old (w/o this PR changes).
    For example: when Kubevirt is upgrading and virt-handler pods just finished its upgrade.
    virt-handler sends FinalizeVirtualMachineMigration command as part of post migration on target flow.
    old virt-launcher attach host-devices as part of FinalizeVirtualMachineMigration, but fails due to QEMU process lack of resources:
{"component":"virt-launcher","level":"error","msg":"cannot limit locked memory of process 117 to 1586495488: Operation not permitted","pos":"virProcessSetMaxMemLock:962","subcomponent":"libvirt","thread":"51","timestamp":"2022-01-18T15:46:56.085000Z"}
{"component":"virt-launcher","level":"error","msg":""msg":"failed to hot-plug host-devices","name":"testvmi-bcxxb","namespace":"kubevirt-test-default1","pos":"live-migration-target.go:42",
"reason":"failed to attach host-device \u003chostdev type=\"pci\" managed=\"no\"\u003e\u003csource\u003e\u003caddress type=\"pci\" domain=\"0x0000\" bus=\"0x04\" slot=\"0x07\" function=\"0x0\"\u003e\u003c/address\u003e\u003c/source\u003e\u003calias name=\"ua-sriov-sriov\"\u003e\u003c/alias\u003e\u003c/hostdev\u003e, 
err: virError(Code=38, Domain=0, Message='cannot limit locked memory of process 117 to 65536: Operation not permitted')\n","timestamp":"2022-01-18T15:46:56.091744Z","uid":"da2b6fb3-faef-4546-adcc-f44c76e1e2ba"}

Next, when the migration is completed and virt-handler now switched to the regular "vmUpdate" flow, it will send HotplugHostDevices command as part of the reconciliation loop, but old virt-launcher does not support it and return the following error:

"failed to hot-plug SR-IOV interfaces: unknown error encountered sending command HotplugHostDevices: rpc error: code = Unimplemented desc = unknown method HotplugHostDevices for service"

To solve this and support old virt-launcher pods it is necessary that make virt-handler keep adjusting QEMU process memloc limits (which is pre-requirement for attaching SRI-IOV host-device) as part of post migration on target node.

I have pushed new changes to fix it.

Currently when SR-IOV VM is migrated we detach its SR-IOV devices just
before migration start and attach them back to the target VM when
migration is finished successfully.
It is performed only one time only at post migration.

The current implementation wont leverage the VMController, may leave
the VM in incomplete state (missing SR-IOV devices) and also doesn't
follow Kubernetes desire state design that is followed all over the
project.

With this change instead of attaching SR-IOV devices only on post
migration, virt-handler VMController will handles it as part of its
reconcile loop when needed.

In order to support host-devices attachment at post migration on
older virt-launcher pod's, virt-handler keep adjusting QMEU process
memlock limits as part of post-migration on the target node flow.

Signed-off-by: Or Mergi <ormergi@redhat.com>
Attaching SR-IOV host-device is an intrusive operation that may
disturbs the VM workloads and overall availability.
It should be called with a backoff in order to give the underlying
components to finish.

Performing host-devices hot-plug with a limited backoff should make
sure it's done with reasonable time gaps and stop after a while.

Signed-off-by: Or Mergi <ormergi@redhat.com>
Now that hot-plug host-devices is done as part of virt-handler
VM controller reconcile loop, it should run on the background
instead of blocking the loop and causing the VM update flow to hang.

With this changes hot-plug host-devices logic will run on
its own goroutine in a way that it won't block virt-handler
VM controller reconcile loop.

Also, in order to prevent disruption for the VM workloads, and
redundant resource consumption, there will be no more than one
concurrent hot-plug host-devices goroutine on virt-launcher.

Signed-off-by: Or Mergi <ormergi@redhat.com>
@ormergi
Copy link
Contributor Author

ormergi commented Jan 20, 2022

I accidentally pushed wrong change along with the one that was needed [1], I have pushed new change to remove just that [2].

@ormergi
Copy link
Contributor Author

ormergi commented Jan 23, 2022

/hold

Placing hold until we have more eyes on it

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2022
Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, the result looks really good!

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2022
@EdDev
Copy link
Member

EdDev commented Jan 23, 2022

@AlonaKaplan , this change was added (after your approve) to support the (edge) scenario where a new virt-handler handles the migration target of an old virt-launcher.

It was explained in detail here.

I think we are good to go.

@EdDev
Copy link
Member

EdDev commented Jan 23, 2022

/unhold

We seem to be good, lets have this in.

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2022
@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-bot
Copy link
Contributor

kubevirt-bot commented Jan 23, 2022

@ormergi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubevirt-e2e-k8s-1.20-sig-network fcef8f3 link true /test pull-kubevirt-e2e-k8s-1.20-sig-network

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-bot kubevirt-bot merged commit bb05152 into kubevirt:main Jan 24, 2022
@phoracek
Copy link
Member

phoracek commented Oct 3, 2022

/cherry-pick release-0.49

@kubevirt-bot
Copy link
Contributor

@phoracek: new pull request created: #8560

In response to this:

/cherry-pick release-0.49

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants