Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Vhostuser implementation #3208

Closed
wants to merge 2 commits into from
Closed

Conversation

krsacme
Copy link

@krsacme krsacme commented Mar 26, 2020

Closed: Its being reworked as smaller PRs. Tracking list

What this PR does / why we need it:
Add the implementation to use OvS-DPDK with kubevirt VM

Special notes for your reviewer:
Integrated userspace cni for ovsdpdk type. The vhostuser
socket will be shared between kubevirt VM and openvswitch
with DPDK running on the host. In order to support
vhostuser, shared memory is required. NUMA cells are
added with share memory support. Added app-netutil
in order to get the vhostuser interface details from
the annotations, which will be updated by userspace
cni.

  • Run openvswitch in host as openvswitch:hugetlbfs (default)
  • Enable DPDK with required configs and add netdev bridge and dpdk ports
  • Create user qemu with uid 107 for vhostsocket directory
  • Create vhostsocket directory
    * mkdir -p /var/lib/vhost_sockets
    * touch /var/lib/vhost_sockets/touched
    * chown qemu:hugetlbfs -R /var/lib/vhost_sockets/
    * chmod g+r -R /var/lib/vhost_sockets/
  • Build userspace cni
  • libvirt container image changes
    * Added openvswitch package
    * Removed explicit user and group config in qemu.conf,
    as it is default, so that group can be overridden
  • Create VMI with interface of type vhostuser

VM XML with vhostuser

    <interface type='vhostuser'>
      <mac address='52:54:00:df:73:da'/>
      <source type='unix' path='/var/lib/cni/usrspcni/6390c723f122-net1' mode='server'/>
      <target dev='6390c723f122-net1'/>
      <model type='virtio'/>
      <driver rx_queue_size='1024' tx_queue_size='1024'/>
      <alias name='ua-vhost-user-net-2'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </interface>

VM XML with sharedmem

  <cpu mode='custom' match='exact' check='full'>
    ...
    <numa>
      <cell id='0' cpus='0-7' memory='8388608' unit='KiB' memAccess='shared'/>
    </numa>
  </cpu>
1. Added vhostuser implementation to support OvS-DPDK based VMs

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Mar 26, 2020
@openshift-ci-robot
Copy link
Collaborator

Hi @krsacme. Thanks for your PR.

I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci-robot
Copy link
Collaborator

Hi @krsacme. Thanks for your PR.

I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 26, 2020
@kubevirt-bot
Copy link
Contributor

Hi @krsacme. Thanks for your PR.

I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubevirt-bot kubevirt-bot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Mar 26, 2020
@vladikr
Copy link
Member

vladikr commented Mar 26, 2020

/test all

@vladikr
Copy link
Member

vladikr commented Mar 26, 2020

@krsacme Great to see this PR, Saravanan, thanks.

@kubevirt-bot
Copy link
Contributor

@krsacme: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubevirt-e2e-k8s-multus-1.13.3 cc6b184 link /test pull-kubevirt-e2e-k8s-multus-1.13.3
pull-kubevirt-e2e-k8s-1.15-ceph cc6b184 link /test pull-kubevirt-e2e-k8s-1.15-ceph

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@phoracek
Copy link
Member

Thanks a lot for the contribution @krsacme!

The code looks okay, but we need to make sure this feature is understood, tested and maintainable. In general, we don't ship code that does not have test coverage. I'd like to ask you to add following:

@fabiand
Copy link
Member

fabiand commented Mar 27, 2020

Please also remember the user-guide

@krsacme
Copy link
Author

krsacme commented Mar 27, 2020

thanks @vladikr @dankenigsberg @phoracek @fabiand for the inputs. Now that I have the consensus on the approach, I will add the required items specified above and remove changes like unit >> uint32 to a separate PR.

Copy link
Member

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,

There's a lot going on this this PR. I generally understand the basic intent of what is being accomplished here, but the approach has a couple of concerning elements in it. Specifically the exposing of the VMI pod to the /var/lib/vhost_sockets/ and /var/run/openvswitch/ host paths.

Can we get some more information about why this is necessary, and why a CNI plugin can't perform this privileged access on the PODs behalf in a more controlled way?

Now that I have the consensus on the approach, I will add the required items specified above and remove changes like unit >> uint32 to a separate PR.

Is there some sort of design document around this? It's unclear to me what the consensus is you're talking about. Maybe I'm missing a conversation?

@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Mar 31, 2020
@krsacme
Copy link
Author

krsacme commented Mar 31, 2020

@davidvossel thanks for you comments. Here are the reason for these 2 host directories:
/var/lib/vhost_sockets/ - This directory is used to share the vhostuser socket between OvS and QEMU. I have explained about this in the docs (with the recent update).
/var/run/openvswitch/ - Libvirt will execute the ovs-vsctl command virNetDevOpenvswitchInterfaceStats when vhostuser interface is used. It requires OvS DB socket access which is present in this directory. Without this mount, the functionality is not affected (VM is launching), but there are error logs from libvirt, like below:

2020-03-31T20:15:37.628837581+05:30 stderr F {"component":"virt-launcher","level":"error","msg":"Cannot find 'ovs-vsctl' in path: No such file or directory","pos":"virExec:559","subcomponent":"libvirt","thread":"35","timestamp":"2020-03-31T14:45:37.628000Z"}
2020-03-31T20:15:37.628837581+05:30 stderr F {"component":"virt-launcher","level":"error","msg":"internal error: Interface not found","pos":"virNetDevOpenvswitchInterfaceStats:410","subcomponent":"libvirt","thread":"35","timestamp":"2020-03-31T14:45:37.628000Z"}

Is there some sort of design document around this? It's unclear to me what the consensus is you're talking about. Maybe I'm missing a conversation?

I was referring to the consensus on the direction of the apporach, didn't wanted to contiune on the wrong direction. I have added the doc now with detailed steps.

@phoracek I have added the steps in the docs directory, which I have been using on the baremetal deployment. vfio-pci driver cannot be used in virtualized deployment. Alternatively, there are drivers uio_pci_genric and 'igb_uio` which can be used in virt environments. I will evaluate it with kubevirtci repo, wil reach out to you.

@davidvossel
Copy link
Member

/var/lib/vhost_sockets/ - This directory is used to share the vhostuser socket between OvS and QEMU. I have explained about this in the docs (with the recent update).

OvS is acting as a client to the vhost socket setup by the qemu process, right?

If we mount an EmptyDir named vhost_sockets to the VMI pod's /var/lib/vhost_sockets/ directory, then from the host OvS can connect to any socket in that directory as long as the pod's UID is known through the /var/lib/kubelet/pods<pod uid>/volumes/kubernetes.io~empty-dir/vhost_sockets/<some socket file> on the host. That would let us remove the shared hostpath volume

What we're trying to avoid here is a shared hostpath volume containing vhost sockets accessible to all VMI pods on the host. The technique above would avoid hostpath entirely, which is ideal if possible.

@krsacme
Copy link
Author

krsacme commented Apr 1, 2020

OvS is acting as a client to the vhost socket setup by the qemu process, right?

Yes,

What we're trying to avoid here is a shared hostpath volume containing vhost sockets accessible to all VMI pods on the host. The technique above would avoid hostpath entirely, which is ideal if possible.

This looks like feasible, I will try this approach and will update accordingly. Thanks.

@krsacme
Copy link
Author

krsacme commented Apr 2, 2020

I had trouble in using the complete path as it is for the EmptyDir. OpenvSwitch was unable to connect to the socket and is reconnecting continuously. It is beacuse there is limit of 108 characters for unix socket path. The complete path where the socket is created is

/var/lib/kubelet/pods/45483214-7a48-4ccd-984e-90eeeb27086e/volumes/kubernetes.io~empty-dir/shared-dir/a3d9ee70c35d-net1

As per the man pages, the size for storing the path in sun_family is 108, where as this path is exceeding that limit. And DPDK library does not error out when the path exceeds limit, instead copies only the required length.

I am going to try to create a link the above mentioned directory to a local directory something like /var/lib/vhost_sockets/45483214-7a48-4ccd-984e-90eeeb27086e/a3d9ee70c35d-net1, so the limit can maintained. And OvS will connect to the shorter path instead of the full path. It will be managed by userspace CNI.

@davidvossel
Copy link
Member

It is beacuse there is limit of 108 characters for unix socket path.

that's really frustrating. I've hit the same limit a few times as well.

I am going to try to create a link the above mentioned directory to a local directory something like /var/lib/vhost_sockets/45483214-7a48-4ccd-984e-90eeeb27086e/a3d9ee70c35d-net1, so the limit can maintained. And OvS will connect to the shorter path instead of the full path. It will be managed by userspace CNI.

If OvS is managed as a Pod, you can expose /var/lib/kubelet/pods as a HostPath volume to OvS in the Pod definition and specify the volumeMount's MountPath as something simple like /pods for the OvS container.

maybe that's what you're already doing?

@krsacme
Copy link
Author

krsacme commented Apr 6, 2020

If OvS is managed as a Pod, you can expose /var/lib/kubelet/pods as a HostPath volume to OvS in the Pod definition and specify the volumeMount's MountPath as something simple like /pods for the OvS container.

The end goal is to run OvS with DPDK inside a POD, but it is not acheived yet with this PR. Running OvS on the host or POD does not have any impact with this PR.

Currently, with this PR in userspace cni, an intermediate directory /var/lib/vhost_sockets/ is used to create a shorter path for the vhostuser sockets. When OvS is running in the host, the port will be added directly to this path. When OvS is running in the POD, this directory will be mounted to OvS container and the port will be added directly to this path.

@krsacme krsacme force-pushed the vhostuser branch 2 times, most recently from 54b98cf to e9bd87a Compare April 9, 2020 12:03
@wavezhang
Copy link

Is this PR OK to merge now?@davidvossel

Copy link

@wavezhang wavezhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not code related to the following changes

libvirt container image changes
Added openvswitch package
Removed explicit user and group config in qemu.conf,
as it is default, so that group can be overridden

@kubevirt-bot
Copy link
Contributor

@wavezhang: changing LGTM is restricted to collaborators

In response to this:

There is not code related to the following changes

libvirt container image changes
Added openvswitch package
Removed explicit user and group config in qemu.conf,
as it is default, so that group can be overridden

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2020
Saravanan KR added 2 commits July 8, 2020 11:20
Signed-off-by: Saravanan KR <skramaja@redhat.com>
Integrated userspace cni for ovsdpdk type. The vhostuser
socket will be shared between kubevirt VM and openvswitch
with DPDK running on the host. In order to support
vhostuser, shared memory is required. NUMA cells are
added with share memory support. Added app-netutil
in order to get the vhostuser interface details from
the annotations, which will be updated by userspace
cni.

Run openvswitch in host as openvswitch:hugetlbfs (default)
Create user qemu with uid 107 for vhostsocket directory
Create vhostsocket directory
  mkdir -p /var/lib/vhost_sockets
  touch /var/lib/vhost_sockets/touched
  chown qemu:hugetlbfs -R /var/lib/vhost_sockets/
  chmod g+r -R /var/lib/vhost_sockets/
Build userspace cni
libvirt container image changes
  Added openvswitch package
  Removed explicit user and group config in qemu.conf,
    as it is default, so that group can be overridden
Create VMI with interface of type vhostuser

VM XML with vhostuser
---------------------
    <interface type='vhostuser'>
      <mac address='52:54:00:df:73:da'/>
      <source type='unix' path='/var/lib/cni/usrspcni/6390c723f122-net1' mode='server'/>
      <target dev='6390c723f122-net1'/>
      <model type='virtio'/>
      <driver rx_queue_size='1024' tx_queue_size='1024'/>
      <alias name='ua-vhost-user-net-2'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </interface>

VM XML with sharemem
--------------------
  <cpu mode='custom' match='exact' check='full'>
    ...
    <numa>
      <cell id='0' cpus='0-7' memory='8388608' unit='KiB' memAccess='shared'/>
    </numa>
  </cpu>

Signed-off-by: Saravanan KR <skramaja@redhat.com>
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 8, 2020
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign danielbelenky
You can assign the PR to them by writing /assign @danielbelenky in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

} else if iface.Vhostuser != nil {
domainIface.Type = "vhostuser"
interfaceName := GetPodInterfaceName(networks, cniNetworks, iface.Name)
vhostPath, vhostMode, err := getVhostuserInfo(interfaceName, c)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to set MAC address here if CNI returns a mac.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

userspace cni does not provide mac address. It is not possible to validate it without userspace cni. In this iteration, the mac address is left with libvirt auto generated. In the next iteration, the support will be added in both cni and kubevrit to support custom mac address.

@krsacme
Copy link
Author

krsacme commented Jul 9, 2020

@vladikr @davidvossel @phoracek - I know this is a big patch to review, do you suggest that this can be broken in to smaller patches for easier review?

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2020
@kubevirt-bot
Copy link
Contributor

@krsacme: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@phoracek
Copy link
Member

@krsacme please do split it into smaller commits guiding us through the implementation. @EdDev you may want to review this PR.

@krsacme
Copy link
Author

krsacme commented Jul 16, 2020

@krsacme please do split it into smaller commits guiding us through the implementation. @EdDev you may want to review this PR.

Thanks @phoracek, I have created list of tasks in the the issue. And I will create smaller commits. I will close ths PR with appropriate comments.

@krsacme krsacme closed this Jul 16, 2020
@EdDev
Copy link
Member

EdDev commented Jul 16, 2020

@krsacme please assign/cc me on you future PRs.

@fgschwan
Copy link

fgschwan commented Nov 8, 2021

@davidvossel @wavezhang @phoracek @SchSeba @krsacme @cijohnson Hi,
i'm interested in this feature because i want to use it in one networking project. My use case is little bit different (change ovs-dpdk with VPP, but that is also supported by usespace-cni), but from kubevirt perspective this is what i need. I would be not only interested to have some working implementation, but also to have it merged.
I checked the PRs and it seems that these things stands between current state and merging:

  1. finish functionality implementation:
    1.1 resolve hugetblfs or other way how qemu and VPP can have access to vhost user socket
    1.2 i started from 4 virt launcher domain spec #5662 that contains also previous implementation tasks, but got some compilation problems -> rebase only?
    1.3 test it locally that it actually works
  2. do e2e test of this functionality (some pinging between VPP and VM through vhost, using VPP in test can maybe simplify the test setup of test host)
  3. write documentation
  4. help with user guide (this is done after merging, but i guess that you guys as developers of kubevirt need enough info about this functionality for writing user guide before merge)

Could you please confirm that after doing these things you don't see reason why not to merge the functionality? Before i consider investing effort i want to know whether there are no other known (but not written down) problems preventing merge (i.e. some qemu incompatibility, developer decision not to have it,...).

@krsacme If i can ask, do you have any particular reasons why you stopped implementing this feature? Any kind of stopper/blockers that i can eventually hit too? Also did you make the topology working locally at any point (i.e. before the PR split)?

PS: sorry if asking in closed PR is not the right place (do you want me to do question issue or something?). Feel free also to add people to discussion/reply if needed.

@maiqueb
Copy link
Contributor

maiqueb commented Nov 8, 2021

@davidvossel @wavezhang @phoracek @SchSeba @krsacme @cijohnson Hi, i'm interested in this feature because i want to use it in one networking project. My use case is little bit different (change ovs-dpdk with VPP, but that is also supported by usespace-cni), but from kubevirt perspective this is what i need. I would be not only interested to have some working implementation, but also to have it merged. I checked the PRs and it seems that these things stands between current state and merging:

  1. finish functionality implementation:
    1.1 resolve hugetblfs or other way how qemu and VPP can have access to vhost user socket
    1.2 i started from 4 virt launcher domain spec #5662 that contains also previous implementation tasks, but got some compilation problems -> rebase only?
    1.3 test it locally that it actually works
  2. do e2e test of this functionality (some pinging between VPP and VM through vhost, using VPP in test can maybe simplify the test setup of test host)
  3. write documentation
  4. help with user guide (this is done after merging, but i guess that you guys as developers of kubevirt need enough info about this functionality for writing user guide before merge)

Could you please confirm that after doing these things you don't see reason why not to merge the functionality? Before i consider investing effort i want to know whether there are no other known (but not written down) problems preventing merge (i.e. some qemu incompatibility, developer decision not to have it,...).

AFAIU, you're right.

You're welcome to bring this topic to the KubeVirt community call; I'd love to hear about it there.

@krsacme If i can ask, do you have any particular reasons why you stopped implementing this feature? Any kind of stopper/blockers that i can eventually hit too? Also did you make the topology working locally at any point (i.e. before the PR split)?

PS: sorry if asking in closed PR is not the right place (do you want me to do question issue or something?). Feel free also to add people to discussion/reply if needed.

@krsacme
Copy link
Author

krsacme commented Nov 9, 2021

@krsacme If i can ask, do you have any particular reasons why you stopped implementing this feature? Any kind of stopper/blockers that i can eventually hit too? Also did you make the topology working locally at any point (i.e. before the PR split)?

@fgschwan There is no technical reason in stopping it other than the traction and efforts required to integrate it with kubevirtci. Due to change in priorities, I was not able to contribute further, and @cijohnson worked on it to bring in contrail usecase, but could not complete it. Here is a document that covers the changes done in the bare-metal server - https://github.com/krsacme/kubevirt/blob/ovsdpdk_docs/docs/ovsdpdk.md

I was focussing on using vhostuser ports as extra networks, the default network is not changed. I was able to get it working on the bare-metal servers end to end. And I had some working version on the kubevrtici (deployment works but ping was not working in kubevirtci local deployment, need to do dig further to find the exact reason).

We all know that this feature will be primarily used in bare-metal servers but making it working kubevirtci environment (kubevirt VM running in a container inside worker VM node) which is mandatory to get the changes merged was challenging and it requires efforts to complete it. Let me know if any specific things need to be discussed regarding it.

@maiqueb
Copy link
Contributor

maiqueb commented Nov 16, 2021

@fgschwan I'd just want to point out that I do see value in this proposal, and re-starting this effort.

This is extremely valuable for single node clusters / or traffic patterns where inner-node east west traffic are more common.

Did you summarize your use cases ?

@fgschwan
Copy link

@maiqueb I provided some use case info in kubevirt dev mailing thread (https://groups.google.com/g/kubevirt-dev/c/opZPTUNsoMQ/m/nHgRMrsKAgAJ). However i can add some more information here.

I want to implement service chaining (also called service function chaining/SFC - https://www.sdxcentral.com/data-center/virtualization/definitions/what-is-network-service-chaining/). This is standard solution how to dynamically create path between chosen network functions (1 container/1VM container=1 network function) is SDN(software defined network). Ideally all functions should be use container natively, however some software is hard to containerize and therefore it is running as VM. Hence usage of Kubevirt. And i need for the dataplane high performance networking (for control/management plane is ok to use any interface that Kubevirt already supports). So the main one node use case should look like this:

  1. configure k8s and VPP on k8s node(s)
  2. dataplane: traffic going into k8s node -> fd.io's VPP(router) -> Service 1 (container-native service) -> VPP(router) -> Service 2 (VM container service) -> traffic going out of k8s node

There can be multiple variations to this use case, including but not limited to use various combinations of container-native and VM-container services, to use multiple k8s nodes (ideally to keep as much related services in one node).

From perspective of Kubevirt implementation, this is not that important because Kubevirt will not interact with all those components. It will interract only with VPP (usespace cni that will export vhost socket). Therefore kubevirtci test should probably be only pinging between VPP and kubevirt container VM by using vhost interface. However, i understand, that from overall perspective, the use case is really relevant.

@maiqueb
Copy link
Contributor

maiqueb commented Nov 16, 2021

I had missed your email !!

Thanks for kicking this forward :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet