Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mainstream some day? #1

Closed
CodeFetch opened this issue Nov 16, 2021 · 14 comments
Closed

Mainstream some day? #1

CodeFetch opened this issue Nov 16, 2021 · 14 comments

Comments

@CodeFetch
Copy link

Hi! I've been fiddling with VPNs for some years now. Did you write this module due to performance reasons? Shall this be a mainstream module some day?

We had performance issues with OpenVPN some years back due to context-switching of the TUN/TAP devices having a syscall for every packet read op. An optimized userspace clone of OpenVPN was therefore created called fastd. It is faster than OpenVPN. But we still had the context switch bottleneck. Therefore I wrote a hacky patch for fastd to utilize io_uring. With that the context switch bottleneck is gone.
Maybe you could adopt this for the OpenVPN userspace version to reach a performance similar to the kernel module:
CodeFetch/fastd@059cdf7

@ordex
Copy link
Member

ordex commented Nov 16, 2021

Hi @CodeFetch and thanks for your message.
To answer your questions: yes, the idea is to take this kernel module upstream once it reaches reasonable maturity.
We are currently working on supporting it in OpenVPN2, so that it can get to a larger user base.

The main reason is definitely performance: with this kernel module we are basically moving the whole data plane (not the control plane!) in kernelspace, similarly to other device drivers or tunnel implementations. On the other hand, it allows us to greatly simplify the linux part of the userspace implementation as it doesn't need to handle user data anymore.

io_uring sounds interesting, but at the moment it doesn't go towards the direction we are taking (as we are moving the data plane in kernel directly). Therefore it wouldn't be meaningful to allocate energy on that side right now.

Still, if you want to play with it and work on a PoC (that may result in clean patches) for OpenVPN, please do.

@dsommers
Copy link
Member

This approach is interesting for platforms which will not or cannot support our DCO kernel module. I would expect DCO still to be faster (as the network packets will still have the fastest path between the physical network interface and the virtual one, without any context switching at all).

But a faster user-space implementation with io_uring might be useful on lower-end routers where getting ovpn-dco running being too difficult, or in setups where the user insists on using a not recommended non-GCM based ciphers or compression or other protocol features not available in the ovpn-dco module.

Is io_uring supported on other platforms than Linux? I believe I read something about Jens Axboe being involved, who is a Linux kernel developer.

@CodeFetch
Copy link
Author

@dsommers io_uring is only supported by Linux from what I know. It allows you to define kind of ring buffers on which a kernel thread works on. This allows networking in userspace with almost no context switches. Together with MSG_ZEROCOPY this allows performance equal to running natively in the kernel minus the skb_copy operation for sending packets. As most ressource-limited devices which would profit from that are practically CPEs with higher downstream bandwidth usage, this would not hurt much I guess. Tests have showed that raw encryption performance in userspace is near what a kernel thread could achieve e.g. using WireGuard.

@dsommers
Copy link
Member

dsommers commented Nov 16, 2021

Thanks! So, then the advantage of ovpn-dco will basically be that it can utilize all the CPU cores on the data plane. OpenVPN 2.x is (still!) single-threaded and will therefor hit some limitations on the server side when more clients are connected, and I expect io_uring in an OpenVPN implementation to also be limited in that regards. However, for server sides with only one client connected, the performance difference might not be that big.

@huangya90
Copy link
Contributor

@CodeFetch Would you help to share some performance numbers to prove advantage of io_uring?

@CodeFetch
Copy link
Author

@huangya90 I don't have any statistics on that anymore, but it was a constant 20-30% increase in throughput on an Intel 4820K. Single threaded... I have looked at it with oprofile. The recvmsg sendmsg syscalls are gone which accounted for 20-30% previously. So that matches.

@CodeFetch
Copy link
Author

@huangya90 Keep in mind there is more potential for improvement. As far as I know OpenVPN does not only lack multithreading support, but it doesn't have a buffer pool either and isn't optimized for cache hotness.

@huangya90
Copy link
Contributor

huangya90 commented Nov 17, 2021

I don't have any statistics on that anymore, but it was a constant 20-30% increase in throughput on an Intel 4820K. Single threaded... I have looked at it with oprofile. The recvmsg sendmsg syscalls are gone which accounted for 20-30% previously. So that matches.

If so, speeding up in the kernel space is much better. Please refer to performance numbers [1]of ovpn-dco tested before.

[1] https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg21584.html

@CodeFetch
Copy link
Author

@huangya90 That's the gain of fastd which is better optimized than OpenVPN. The numbers don't look comparable to me. I guess he used a multicore processor and there must have been AES hardware acceleration as ChaCha should be faster.

@cron2
Copy link
Contributor

cron2 commented Nov 17, 2021 via email

@CodeFetch
Copy link
Author

@cron2 Yes, but it's like comparing apples with Microsoft. Using CPU threads for all cores is possible with the userspace implementation, too, but not implemented in OpenVPN. ovpn-dco will definitely perform better than a userspace version, but not that much without hardware encryption on a single thread machine.

BTW does ovpn-dco also support layer 2 tunnels?

@andywangevertz
Copy link

@CodeFetch We are using openvpn on level 2 device (tap0) and the process looks like below (RX side)
packets -> kernel -> openvpn(decrypt) -> kernel -> tap0 -> Application

As you can see, it will go into and come out of kernel twice.. I am not sure what fastd does on the level 2 device/tunnel, do you think that io_uring could also benefit on openvpn userspace application (like 20%-30%)? I would like to give it a try and would like hear some advice for io_uring.

Thanks!

@CodeFetch
Copy link
Author

@andywangevertz Indeed. You will save the syscall for the freads/fwrites of the TAP device. TAP devices can ordinarily only accept one packet at a time per syscall
This is a lot of context-switching. With io_uring you lift that limitation. With it you can from my experience send/receive about 64 packets per syscall. Still it's slower than e.g. WireGuard, because WireGuard has multithreading support and does not need to copy the packets from userspace memory to kernel memory and vice-versa, but the skb_copy for that is not the performance killer. I guess with multithreading support or on single-core devices or with multiple OpenVPN instances you could reach almost 90% of WireGuard's throughput.

@ordex
Copy link
Member

ordex commented Dec 9, 2021

I am closing this, but for further discussions, please reach out to the openvpn-devel mailing list, where a broader audience will be able to join the conversation.
Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants