Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vHostUser interface Design Proposal #218

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nvinnakota10
Copy link

This is a design proposal for the vhostuser interface to support UserSpace networking feature by the kubevirt VM.

The document provides information on the Feature/Interface proposed, why it is needed, how it can enable fast packet processing when dpdk is available.

@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Apr 27, 2023
@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cwilkers for approval by writing /assign @cwilkers in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 27, 2023
@kubevirt-bot
Copy link

Hi @nvinnakota10. Thanks for your PR.

I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

design-proposals/vhostuser-interface.md Outdated Show resolved Hide resolved
design-proposals/vhostuser-interface.md Outdated Show resolved Hide resolved
design-proposals/vhostuser-interface.md Outdated Show resolved Hide resolved
design-proposals/vhostuser-interface.md Outdated Show resolved Hide resolved
design-proposals/vhostuser-interface.md Outdated Show resolved Hide resolved
design-proposals/vhostuser-interface.md Outdated Show resolved Hide resolved
design-proposals/vhostuser-interface.md Outdated Show resolved Hide resolved
@alicefr
Copy link
Member

alicefr commented May 3, 2023

@nvinnakota10 I'm still missing the scheduling part or is it part of CNI? My question is will be a VMI requesting this network interface automatically scheduled on nodes where the CNI plugin is installed or does the virt-launcher pod requires additional information?

@maiqueb
Copy link
Contributor

maiqueb commented May 3, 2023

/cc

@kubevirt-bot kubevirt-bot requested a review from maiqueb May 3, 2023 14:29
Copy link
Member

@xpivarc xpivarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nvinnakota10 ,
Would it be possible to refer to cni providing this functionality and include how it works for Pods?


- Creating the VMs will follow the same process with a few changes highlighted as below:
1. **Once the VM spec virt-controller will add two new volumes to the virt-launcher pod.**\
a. **EmptyDir volume named (shared-dir) is used to share the vhostuser socket file with the virt-launcher pod and dpdk dataplane.**\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you follow up here and explain why this empty dir is required? Can it be mount on arbitrary path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the empty dir is to act as a socket directory in the Compute container and allow qemu to create a vhostuser socket in the directory /var/lib/kubelet/pods//volumes/kubernetes.io~empty-dir/shared-dir, which will be mounted by the CNI in the respective kubelet. The host path's mount can vary based on where the CNI wants to mount.

Moreover, to have a better transparency with k8s, emptyDir is more advantageous, as the clean up will be done by k8s, i.e., if we delete the pod, the empty directory will also be deleted.

ref: https://kubernetes.io/docs/concepts/storage/volumes/#emptydir

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't always true, see the comments in #218 (comment)

design-proposals/vhostuser-interface.md Outdated Show resolved Hide resolved
Copy link
Contributor

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your effort in driving / helping move forward this by providing a design doc.

I didn't review in depth; I hope I can find the time to provide a more comprehensive review.

design-proposals/vhostuser-interface.md Outdated Show resolved Hide resolved
design-proposals/vhostuser-interface.md Outdated Show resolved Hide resolved
design-proposals/vhostuser-interface.md Outdated Show resolved Hide resolved
design-proposals/vhostuser-interface.md Outdated Show resolved Hide resolved
devices:
interfaces:
- name: vhost-user-vn-blue
vhostuser: {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this implies adding yet another binding mechanism.

We've tried not to do that in the past (just a heads up ...) .

We are actually trying to decouple this from KubeVirt, thus allowing end users to provide (and use ...) their bindings without moving the maintenance burden to KubeVirt.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maiqueb , Can you please give a few pointers on how this works. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

  1. update the KubeVirt API with this new interface type.
  2. AFAIU, nothing needs to be done for the phase#1 (virt-handler performs some privileged operations on behalf of the launcher). Everything you need should be done by userspace-cni, right ?
  3. the launcher domain manager must generate the interface domXML snippet for your particular interface. More stuff done here (example for macvtap).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as I've said in this comment, we're currently investigating how to allow users to extend the network binding options without requiring the core KubeVirt maintainers to maintain said bindings... We've recently discouraged users from posting new binding types ...

FYI .

@EdDev can provide more information; I just shipped around a few half baked ideas on how this could be done, but there's nothing concrete yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maiqueb, I believe I am not using a binding mechanism for the vhostuser interface, as I am not doing what exactly has been done for interfaces like bridge, masquerade, PASST.

I am using this PR kubevirt/kubevirt#3208 as a reference and most of the changes are from it.
I have built kubevirt with the changes from the mentioned PR, and made necessary changes to make it work with the latest kubevirt, for the releases 0.58, 0.59 and 0.48. I was able to bring up the interface without using the binding mechanism.

Please let me know if you think this is still not clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, you may be right, seems we also don't need to generate the interface domXML snippet (on the launcher pod):
https://github.com/kubevirt/kubevirt/pull/3208/files#diff-030aa8b7ff9bf4a511d1d5641c1dec04205a944137672be3a691f0b1b4068753R93

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While pod networking configuration is not needed by virt-handler, the domain configuration is surely needed [1].

I am thinking that we can generalize these types of bindings. I.e noop bindings which do not touch the pod networking.
But we are still left with the logic of how to define the domain spec (libvirt domain config), including NUMA, hugepages, etc.

[1] https://github.com/kubevirt/kubevirt/pull/3208/files#diff-bc5b4c898f931f6ffeea71321777a71452dcb2e521946f77d28a72a885154a53

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdDev, @maiqueb, sorry for the confusion. I am actually using the binding interface method. Like I mentioned I made a few changes, and forgot I was using the interface binding method for vhostuser interface.

kubevirt/kubevirt@release-0.59...Juniper:kubevirt:release-0.59

The above comparison shows the exact changes we are currently using. I have rechecked my changes as I remember this binding method for the interface was added sometime ago.

https://github.com/Juniper/kubevirt/blob/f441f1068f3210ff007cea1fed7b8281ce3f36cb/staging/src/kubevirt.io/api/core/v1/schema.go#L1239

What would be your suggestion in this case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maiqueb , @EdDev

Can you please confirm if we can proceed with the interface binding method, or do we have any work in progress for the same as mentioned in the comment:
#218 (comment)

Thanks

@nvinnakota10
Copy link
Author

@nvinnakota10 I'm still missing the scheduling part or is it part of CNI? My question is will be a VMI requesting this network interface automatically scheduled on nodes where the CNI plugin is installed or does the virt-launcher pod requires additional information?

The scheduling part is done by the k8s. In the VM spec, the user can specify custom labels and use them as selectors for the scheduling part. Moreover, user can specify the huge pages required and memory requests. This will allow user to label the nodes that can support and k8s to schedule accordingly based on requirements.

Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late feedback on this.
There is a lot to digest here and it will probably take me some time to learn all the details, sorry in advance.

As @maiqueb mentioned earlier, we are trying to find a elegant way to decouple as much as possible the different network interface types to ease and distribute the maintenance effort.
This type seems not to require the first step which configures the pod networking, only the second step where there is a need to define the domain configuration to consume the endpoint (it this case a socket).

In the meantime, I think the discussion on this proposals can continue as usual.

- Creating the VMs will follow the same process with a few changes highlighted as below:
1. **Once the VM spec virt-controller will add two new volumes to the virt-launcher pod.**\
a. **EmptyDir volume named (shared-dir) "/var/lib/cni/usrcni/" is used to share the vhostuser socket file with the virt-launcher pod and dpdk dataplane. This socket file acts as an UNIX socket between the VM interface and the dataplane running in the UserSpace.**\
b. **DownwardAPI volume named (podinfo) is used to share vhostuser socket file name with the virt-launcher via pod annotations.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the EmptyDir mounts the actual socket to a specific path, why do we need the DownwardAPI ?
What information needs to arrive to the virt-launcher filesystem and why it cannot be passed through a field in the VMI (e.g. the in the interface details).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdDev, Since, both the kubemanager and CNI need to use the same name for the socket file, I am using the downwarAPI to achieve this. This is just to make sure the pod creates a VM with a socket with right name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdDev , Also the CNI/KubeManager will be creating the socket files while creating the virtio interfaces. So, to have the consistency, we expect that the CNI uses the socket name reperesenting the poduid+intfname+id. This info will not be available from the VM YAML spec.

DownwardAPI will help us maintain consistency as the CNI creates the sockets in a custom path, which will be shared to the launcher pod. This socket name will be passed to qemu.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdDev , Also the CNI/KubeManager will be creating the socket files while creating the virtio interfaces. So, to have the consistency, we expect that the CNI uses the socket name reperesenting the poduid+intfname+id. This info will not be available from the VM YAML spec.

DownwardAPI will help us maintain consistency as the CNI creates the sockets in a custom path, which will be shared to the launcher pod. This socket name will be passed to qemu.

@nvinnakota10, if socket name is built after poduid+intfname+id, live migration of VM will fail because a new pod will be created where the running VM will be transferred to, and qemu will be started with the same parameters including the path/socketname.
But obviously poduid changed and the CNI created a new socket named after newpodui+intfname+id, and uses also this name to configure the vswitch port. qemu will then hang waiting the socket poduid+intfname+id to be opened by the client (vswitch).
We did experience this issue with userspace CNI, and had to explicitly specify socket name in the userspace CNI config in order to have a successful VM live migration:

        "cniVersion": "0.3.1",
        "type": "userspace",
        "name": "userspace-vpp-250-1",
        "KubeConfig": "/etc/cni/net.d/multus.d/multus.kubeconfig",
        "SharedDir": "/var/lib/cni/usrcni/",
        "logFile": "/var/log/userspace-vpp-250-1.log",
        "logLevel": "debug",
        "host": {
                "engine": "vpp",
                "iftype": "vhostuser",
                "netType": "bridge",
                "vhost": {
                        "mode": "client",
                        "socketfile": "net1"
                },
                "bridge": {
                        "bridgeName": "250"
                }
        }

The socket name is poduid+intfname+id is the default one in userspace CNI. Maybe it worth changing this default naming in the CNI (standard interface?) or change the qemu parameters regarding the vhost-user-net socketname when live migrate a VM?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TReZ42 , Yes, we will be have to look at the Migration part, as the new pod created might still have the same socket name, but the downwardAPI might create a new socket config based on the podUID. This might cause the VM to fail waiting for the socket.

I will check the case with multi interface as well. I am not sure if the interface supports live migration. Though it is a secondary interface, it can fail live migration to fail if not handled appropriately, as the socket names might change, as PodUId changes and the VM should have the appropriate socket name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TReZ42 , One possible way is to hashing the combination of name+namespace of the VM created. But for this, can you please confirm if we can Live Migrate the VMs using VirtualMachineInstanceReplicaSet kind. How do we handle the names of the VM when VirtualMachineInstanceReplicaSet kind is used to create the VMs.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvinnakota10, VMI names are created after the template name metadata if it exists, or VirtualMachineInstanceReplicaSet name if not, appended with an id: testvmi5wfxp, testvmibjkpz.

We can also think about VirtualMachinePool: the VMs and VMIs names are created after VirtualMachinePool name, appended with -0, -1, etc. : test-vmp-0, test-vmp-1.

We can Live Migrate each of the VMI created by VirtualMachineInstanceReplicaSet or VirtualMachinePool, but not directly the VirtualMachineInstanceReplicaSet or the VirtualMachinePool.

VMI name does not change after live migration, so use a hash of vmi+namespace+intfname as socket name could be OK. But userspace CNI must also be adapted to use socket names according to this naming convention...

Don't you think it is possible to let the CNI defines the socket name, and for live migration patch the domain of the target VMI with the new socket name ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TReZ42 , Yes. This part needs to be handled by the CNI. For the changes I am planning to upstream for vHostUser interface, Kubevirt need no information on how the socket name is derived. But the CNI needs to make sure that the socket name needs to be consistent, if they need Live Migration.

Also, if we create VMs using replicasets, the hash appended at the end of the VMI name is immutable. So, even if we Live Migrate these VMs they still have the same name in the VMI object, which shouldn't affect the Socket name consistency.

I was asking the question to make sure if the approach make sense, or do I need to look at other possibilities for having CNI to maintain consistent names.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvinnakota10, if I understand correctly, you want the CNI to derive the socket name from the VM name (VMI, VMIRS, VMPool) ? But is this info available to the CNI in Downward API? Perhaps in annotations?

It's not possible to change the target domain with updated socket name, generated by the CNI as usual, the way you are doing in createDomainInterfaces with the vhostuser case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am using the createDomainInterfaces. As soon as the CNI creates the socket name and interface name, it will store the annotations in the path mounted for DownwardAPI for the launcher pod.

The createDomainInterfaces will retrieve this information to create the socket using qemu args.

@stefanha
Copy link

stefanha commented May 16, 2023

Please use the term vhost-user-net to refer to the network device since there are other vhost-user devices too (blk, fs, gpio, etc). KubeVirt may support other vhost-user devices in the future, so it helps to be clear what is specific vhost-user-net and what is generic vhost-user infrastructure that other devices will also use.

Edit: To clarify what I mean, please change the title of this proposal to vhost-user-net so it's clear this vhost-user integration is specific to network interfaces.

@stefanha
Copy link

Hi @nvinnakota10 , Would it be possible to refer to cni providing this functionality and include how it works for Pods?

👍
Yes, please. It's worth defining the interface for regular k8s Pods (without KubeVirt in the picture). For example, someone may want to run a DPDK application in a regular Pod attached to a DPDK software switch provided by CNI using vhost-user.

spec:
config: '{
"cniVersion": "0.3.1",
"type": "userspace",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it called "userspace" and not "vhost-user-net"? Is the idea to support other userspace network interfaces in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this must have the name of the CNI binary which will run on the host.

AFAIU, the "reference" CNI to use is userspace CNI, whose binary is named userspace: https://github.com/intel/userspace-cni-network-plugin#work-standalone

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanha , yes. the type here is pointing to the CNI binary name. Since, userspace CNI by Intel is one of the Opensource userspace CNI, I mentioned that in the example.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maiqueb @nvinnakota10
Thanks for sharing the link to the Userspace CNI plugin! It seems the Userspace CNI plugin doesn't define how Pods need to be configured - it's left up to the user to create volumes, choose paths, and I'm not sure it uses annotations and the downward API.

I think a standard interface should be defined by the Userspace CNI plugin, not in KubeVirt. Then other container workloads can benefit from a standard interface that doesn't require coming up with configuration (volumes, paths, etc) from scratch.

The KubeVirt proposal would just be about implementing the Userspace CNI interface, not defining annotations, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a first step towards this, I suggest splitting this proposal into two parts:

  1. The standard Userspace CNI plugin interface that allows any Pod to detect vhost-user-net devices and consume them. This will eventually go into the Userspace CNI plugin project (and other CNI plugins can implement a compatible interface).
  2. The KubeVirt-specific part that describes how to implement the interface in virt-launcher, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised this topic in the video call we had last week, but no one had a suggestion for how the Userspace CNI plugin (and third-party plugins that work along the same lines) could provide a standard interface. I am not familiar with CNI myself so I also don't have a way forward.

In light of this, I think it's okay to keep things as they are for now and consider this review comment resolved.

@stefanha
Copy link

@nvinnakota10 I don't see a URL for a video conference in the kick-off event calendar invite you sent. Can you share the URL?

- Removing the VMs will follow the below chnages:
1. **The CNI should delete the socket files of each interface. If 'n' interfaces are present, 'n' number of socket files will be deleted allowing k8s to proceed with a clean pod deletion.**

No additional changes are required while deleting the VM, it will be deleted as a regular VM, the CNI should handle the deletion of interfaces. In the case of vHostUser, CNI should delete the socket files before k8s initiates pod deletion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize if I insist on this point, but does kubelet request again to CNI the deletion of the network interface if there is a failure? If so, then the cleanup shouldn't be a problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU no; depending on the implementation, kubelet may not even be aware something failed.

The CNI spec in fact indicates a CNI DEL should not return an error:

Plugins should generally complete a DEL action without error even if some resources are missing.

Some entity should reconcile the resources afterwards ensuring those are gone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, we might still have a problem here :/ @nvinnakota10 if you have a working prototype, I'd be really interested to know what happens if you try to kill the component doing the unmount and at the same time deleting the VMI/pod.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alicefr , I tried something similar you asked. I made changes to my CNI where it won't execute the deletion of sockets. I just made sure CNI del won't do anything.

I tried deleting the VM using the kubectl delete command. The kubelet logs from journalctl showed the kubelet did unmount the volumes both shared-dir(emptyDir) and podinfo (downwardAPI). The virt-launcher deletion was not having any issues.

Please find the logs related to the same:

May 23 22:07:12 vihari-worker-1 kubelet[5747]: I0523 22:07:12.908533 5747 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume "shared-dir" (UniqueName: "kubernetes.io/empty-dir/9f57a12d-5ccf-47b2-afe0-a010241b6c77-shared-dir") pod "virt-launcher-vm-single-virtio-sktwl" (UID: "9f57a12d-5ccf-47b2-afe0-a010241b6c77") " pod="kubevirttest/virt-launcher-vm-single-virtio-sktwl"
May 23 22:07:12 vihari-worker-1 kubelet[5747]: I0523 22:07:12.908556 5747 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume "podinfo" (UniqueName: "kubernetes.io/downward-api/9f57a12d-5ccf-47b2-afe0-a010241b6c77-podinfo") pod "virt-launcher-vm-single-virtio-sktwl" (UID: "9f57a12d-5ccf-47b2-afe0-a010241b6c77") " pod="kubevirttest/virt-launcher-vm-single-virtio-sktwl"

————————————
May 23 22:21:32 vihari-worker-1 kubelet[5747]: I0523 22:21:32.729817 5747 reconciler.go:211] "operationExecutor.UnmountVolume started for volume "podinfo" (UniqueName: "kubernetes.io /downward-api/9f57a12d-5ccf-47b2-afe0-a010241b6c77-podinfo") pod "9f57a12d-5ccf-47b2-afe0-a010241b6c77" (UID: "9f57a12d-5ccf-47b2-afe0-a010241b6c77") "

May 23 22:21:32 vihari-worker-1 kubelet[5747]: I0523 22:21:32.729871 5747 reconciler.go:211] "operationExecutor.UnmountVolume started for volume "shared-dir" (UniqueName: "kubernetes .io/empty-dir/9f57a12d-5ccf-47b2-afe0-a010241b6c77-shared-dir") pod "9f57a12d-5ccf-47b2-afe0-a010241b6c77" (UID: "9f57a12d-5ccf-47b2-afe0-a010241b6c77") "


May 23 22:21:32 vihari-worker-1 kubelet[5747]: I0523 22:21:32.737107 5747 operation_generator.go:890] UnmountVolume.TearDown succeeded for volume "kubernetes.io/downward-api/9f57a12d-5c cf-47b2-afe0-a010241b6c77-podinfo" (OuterVolumeSpecName: "podinfo") pod "9f57a12d-5ccf-47b2-afe0-a010241b6c77" (UID: "9f57a12d-5ccf-47b2-afe0-a010241b6c77"). InnerVolumeSpecName "podinfo". PluginName "kubernetes.io/downward-ap i", VolumeGidValue ""

May 23 22:21:32 vihari-worker-1 kubelet[5747]: I0523 22:21:32.772221 5747 operation_generator.go:890] UnmountVolume.TearDown succeeded for volume "kubernetes.io/empty-dir/9f57a12d-5ccf- 47b2-afe0-a010241b6c77-shared-dir" (OuterVolumeSpecName: "shared-dir") pod "9f57a12d-5ccf-47b2-afe0-a010241b6c77" (UID: "9f57a12d-5ccf-47b2-afe0-a010241b6c77"). InnerVolumeSpecName "shared-dir". PluginName "kubernetes.io/empty -dir", VolumeGidValue ""


May 23 22:21:32 vihari-worker-1 kubelet[5747]: I0523 22:21:32.831658 5747 reconciler.go:399] "Volume detached for volume "podinfo" (UniqueName: "kubernetes.io/downward-api/9f57a12d-5ccf-47b2-afe0-a010241b6c77-podinfo") on node "vihari-worker-1" DevicePath """

May 23 22:21:32 vihari-worker-1 kubelet[5747]: I0523 22:21:32.831680 5747 reconciler.go:399] "Volume detached for volume "shared-dir" (UniqueName: "kubernetes.io/empty-dir/9f57a12d-5ccf-47b2-afe0-a010241b6c77-shared-dir") on node "vihari-worker-1" DevicePath """

1. **Once the VM spec virt-controller will add two new volumes to the virt-launcher pod.**\
a. **EmptyDir volume named (shared-dir) "/var/lib/cni/usrcni/" from virt-launcher pod is used to share the vhostuser socket file with the virt-launcher pod and dpdk dataplane. This socket file acts as an UNIX socket between the VM interface and the dataplane running in the UserSpace.**\
b. **DownwardAPI volume named (podinfo) is "/etc/podnetinfo/annotations" used to share vhostuser socket file name with the virt-launcher via pod annotations. The downwardAPI is used here to have the pod know the socket details like name and path, which is created by CNI/kubemanager while bringingup the virt-launcher pod and this info is only availbale after the pod is created**
2. **The CNI should mount the shared-dir i.e., EmptyDir volume's host path /var/lib/kubelet/pods/<podID>/volumes/kubernetes.io~empty-dir/shared-dir by linking it to a custom path CNI prefers to have. The CNI should create the sockets in the host in the custom path. For each interface, CNI should create a socket.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we guarantee that the socket is always created before we start QEMU?

Copy link
Contributor

@maiqueb maiqueb May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the socket is created by CNI, it is indeed guaranteed; first CNI creates the pod sandbox, and only after that are the pod's containers started.

Addressed a few comments and added a sequence diagram as requested.

Signed-off-by: nvinnakota <nvinnakota@juniper.net>
@@ -0,0 +1,166 @@

# Overview
- User Space networking is a feature that involves packet path from the VM to the data plane running in userspace like OVS-DPDK etc, by-passing the kernel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, any idea to make vhost-user interface available for block devices and qemu-storage-daemon as well?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alicefr and @albertofaria have investigated vhost-user-blk devices in the past separately from this PR. Perhaps they can share a proposal or the current state of the discussion with you.

@vasiliy-ul
Copy link
Contributor

Hey, @nvinnakota10 and all. Raised this also in the mailing list, but just decided to duplicate here. I am curious about the status of this PR. Any chance we can move this forward? What are the current blockers? Ping @maiqueb, @EdDev, @xpivarc, @alicefr.

@alicefr
Copy link
Member

alicefr commented Sep 7, 2023

@vasiliy-ul I might be wrong, but I think this is blocked by the fact that we would like to separate the network plugin externally and we are missing the infrastructure for this

UPDATE: this might be already available from this PR: kubevirt/kubevirt#10284

@vasiliy-ul
Copy link
Contributor

@vasiliy-ul I might be wrong, but I think this is blocked by the fact that we would like to separate the network plugin externally and we are missing the infrastructure for this

UPDATE: this might be already available from this PR: kubevirt/kubevirt#10284

@alicefr, thanks for the link 👍 I was not aware of that PR. Will take a closer look.

I've seen this concern raised here in this PR, but from the conversations it is not clear if it's a blocker or not: #218 (comment) and #218 (review)

So, does that mean that from now on this new API is the way to go if we need to introduce network bindings?

@alicefr
Copy link
Member

alicefr commented Sep 7, 2023

I've seen this concern raised here in this PR, but from the conversations it is not clear if it's a blocker or not: #218 (comment) and #218 (review)

So, does that mean that from now on this new API is the way to go if we need to introduce network bindings?

This was at least my understanding, @EdDev could you please confirm this?

@kubevirt-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubevirt-bot kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2023
@kubevirt-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@kubevirt-bot kubevirt-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 5, 2024
@aburdenthehand aburdenthehand removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 10, 2024
@maiqueb
Copy link
Contributor

maiqueb commented Feb 21, 2024

The networking bindings were already refactored, and now it is possible for community provided bindings to be provided out of (KubeVirt) tree.

Should this proposal be revisited under these new terms ?

@sbrivio-rh
Copy link

Should this proposal be revisited under these new terms ?

Which prompts an update from my side: we're about to merge vhost-user support in passt.

I guess it's going to be transparent for KubeVirt, that is, libvirt will eventually pass different options to both passt (--vhost-user and perhaps different socket path) and QEMU (essentially the memory-backend-memfd object).

So, that support doesn't aim in any way at replacing this proposal, and probably it doesn't integrate with it at all, but please let me know if opinions differ.

@maiqueb
Copy link
Contributor

maiqueb commented Feb 21, 2024

Should this proposal be revisited under these new terms ?

Which prompts an update from my side: we're about to merge vhost-user support in passt.

I guess it's going to be transparent for KubeVirt, that is, libvirt will eventually pass different options to both passt (--vhost-user and perhaps different socket path) and QEMU (essentially the memory-backend-memfd object).

So, that support doesn't aim in any way at replacing this proposal, and probably it doesn't integrate with it at all, but please let me know if opinions differ.

Well, while it doesn't intend to replace this proposal, it surely brings forward a "cheaper" alternative.

It would be interesting to see an "alternatives" section in this proposal focusing on that @sbrivio-rh .

Thanks for raising it.

@fabiand
Copy link
Member

fabiand commented May 2, 2024

@alicefr @EdDev @phoracek in the light of the other plugin based vhost user proposal - What should we do about this one? Can it be closed?

@alicefr
Copy link
Member

alicefr commented May 2, 2024

@fabiand I would, at least, wait until we see the other proposal. Otherwise, @nvinnakota10 are you still working on it?

@fabiand
Copy link
Member

fabiand commented May 2, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet