-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SoftNIC performance and NIC offload features (scatter-gather lists and TSO) #390
Comments
You are right. When one TCP flow is considered (thus only one core is utilized in the kernel space), the Linux TCP implementation processes roughly 0.5Mpps (and their ACKs). It effectively limits its throughput to 1500B * 8 * 0.5Mpps = 6Gbps. Since the overheads mostly come from per-packet cost, not per-byte cost, increasing the MTU (TSO and LRO) will immediately boost the throughput. Also note that you would need a multi-queue vport, and distribute packets (e.g., HashLB module) across queues. Otherwise all RX packets will be forwarded to the same core. And these ACK packets will trigger TX packets to be generated on the very same core, effectively utilizing only one kernel for all flows. In the early days of SoftNIC there were modules that implement NIC offloading functionality, such as TSO, LRO, checksum, FlowDirector, etc. Although BESS inherited most of the SoftNIC code I decided to disable the features simply because I did not have enough cycles to maintain the features. I believe a fair amount of code is still there in the kernel module side; reviving them shouldn't be terribly difficult. You are welcome to put the fragmented code pieces back together and make it work again. I attached the old TSO/LRO code of SoftNIC. |
Thanks for the response. It seems like I do want to build a version of SoftNIC that implements at least the TSO and IP checksumming NIC offload features. Given this, I think the "segmentation.c" file that you sent along should be helpful. Thanks for sharing it. However, I have a few follow up questions/comments: Question 1:Although "segmentation.c" seems like it will be helpful, this only seems to be one aspect of the BESS daemon's implementation of TSO and LRO. It seems like additional changes need to be made to the SoftNIC kernel driver and the interface between the BESS daemon and the SoftNIC kernel driver. As a quick example, it seems like SNBUF_DATA may need to be changed so that the kernel driver can pass large segments to the BESS daemon. Similarly, even though it seems like there is support for TX checksumming in the kernel driver, simply enabling the "NETIF_F_IP_CSUM" flag in the kernel driver isn't working for me at the moment. Do you have a more complete snapshot of a version of both the SoftNIC kernel driver and BESS daemon that implements TSO that you would be willing to share with me? Given some of the problems I'm currently trying to solve, I think such a snapshot could be useful in helping save me some development time. Question/Comment 2:The reason that I'm currently trying to use SoftNIC/BESS is to evaluate changes to two things: 1) How the NIC schedules TX packets and 2) How kernel drivers and kernel-bypass applications (DPDK) interface with the NIC, .e.g., what information is communicated through transmit descriptors and how flows are mapped to queues. While SoftNIC/BESS seems to be a good platform for evaluating changes to the NIC scheduler, the implementation of the SoftNIC kernel module makes it difficult to draw any meaningful conclusions about the overheads of different implementations of the OS/NIC (and App/NIC) interface. This is because the TX interface between the kernel module and the BESS daemon is backwards from that of a physical NIC in a way that places additional computational overheads on the kernel driver and in a way that changes how the memory shared between the kernel driver and the daemon is used and laid out. At a high level, I feel that this is a mistake. I think it would be better if the SoftNIC kernel module were as close to that of a driver for a physical NIC as possible and that any extra overheads in emulating a NIC were instead incurred by the BESS daemon, not the kernel driver. More concretely, in BESS, the SoftNIC kernel module is responsible for copying packet data into SNBUFs for use by BESS as part of its xmit function. In contrast, in a physical NIC, the kernel module would only create DMA mappings and send metadata and pointers to the NIC, which is then responsible for copying packet data into NIC memory through DMAs. Since this extra overhead of memory copying in SoftNIC/BESS is incurred by the kernel driver and this implementation detail changes how memory is laid out, it is not easy to draw any conclusions from the CPU and memory overheads incurred by the kernel when it interfaces with SoftNIC/BESS. Further, as it is currently implemented, the SoftNIC/BESS kernel module breaks the byte queue limits (BQL) kernel optimization. Given these limitations, its looking to me like I may reimplement the TX side of SoftNIC (BESS VPort). My questions about the TX implementation of the SoftNIC kernel driver are the following: First, can anyone give me some insight into why the current implementation of SoftNIC/BESS incurs the overheads of copying packets from skbs in the driver instead of in the BESS daemon? Second, has an alternate implementation approach for the SoftNIC TX interface ever been considered? Third, in what follows, I give a high level description of what I think would be a better implementation of the TX interface between the SoftNIC kernel driver and the BESS daemon. Do you see any obvious limitations, drawbacks, or problems with taking this approach instead of the current approach taken by SoftNIC? My high-level outline of how I think the SoftNIC's kernel driver's TX interface should change to be more faithful to how kernel drivers for physical NICs are implemented is as follows:
Thanks in advance for taking the time to read this message, ~Brent |
Hi,
I'm currently working on a project where I am evaluating changes to how NIC hardware schedules packets. In my evaluation, I want to compare the performance of different POSIX applications given different NIC packet scheduling behaviors. From what I can tell so far, BESS/SoftNIC seems like it could be a good platform for evaluating the NIC changes I have in mind. However, as it is currently implemented, it seems like SoftNIC doesn't provide good enough performance (thoughput) to be useful in my experiments. I go into more detail about exactly what these performance issues are in the rest of this message, but I believe these performance issues do not stem from any performance issue in the BESS daemon in userspace but instead stem from the overheads of the kernel interfacing with the SoftNIC driver because the SoftNIC driver does not implement important offloading features like scatter-gather lists and TSO. As a result, even on only a 10Gbps link with SoftNIC configured to use a large number of queues and BESS configured with multiple worker threads, traditional POSIX applications struggle to drive line-rate when sending data over a SoftNIC VPort. Given this performance problem, I have the following questions:
Is there a reason that SoftNIC does not currently support scatter-gather lists or TSO? From line 702 in core/kmod/sn_netdev.c (https://github.com/NetSys/bess/blob/master/core/kmod/sn_netdev.c#L702), it seems like supporting these features was considered at some point in time before they were disabled in the current version.
Is there any development branch or plan to support scatter-gather lists and TSO in SoftNIC in the future?
If I wanted to implement support scatter-gather lists and TSO in SoftNIC myself, are there any caveats or design challenges that I should know about?
Is there some other best practice for hooking up applications to BESS so that they can drive a 10Gbps link?
In the rest of this message, I provide some more background on my methodology and why I think the performance problems I observe are caused by a lack of scatter-gather and TSO support.
I've observed this problem on two different platforms. My methodology is as follows: Both systems use an Intel 82599 10Gbps NIC. The first is a machine on my local testbed that uses a 4-core/8-thread 2.8GHz Xeon E5-1410, and the second is a machine on CloudLab Clemson that two sockets of "Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz" cores for a total of 56 logical threads. In my BESS configuration, I create a VPort with 16 queues, I configure a PMDPort to run the physical 82599 device also with 16 queues, and I directly connect the incoming and outgoing queues on the VPort and PMDPort. Also, I configured BESS to use 4 worker threads.
To measure throughput, I have been using the "iperf3" network benchmarking tool with a variable number of parallel connections "-P <x>". With the default ixgbe driver, iperf3 can drive 10Gbps line rate with any number of parallel connections that I've used, including only using a single connection. However, with SoftNIC/BESS, iperf3 can only drive ~6Gbps. Further, performance with SoftNIC/BESS can be highly variable. As best as I can tell, this poor performance is solely because the SoftNIC driver does not implement scatter-gather lists or TSO. I think this because of the following results:
With the ixgbe driver which supports SG and TSO, no core is ever saturated according to dstat/htop and iperf3 can consistently drive line-rate.
With SoftNIC/BESS, in addition to the BESS daemon cores, some of the cores running application and kernel code become saturated. In this experiment, the application can drive at most ~6Gbps.
If I disable SG (and TSO) on the ixgbe driver, then the performance of ixgbe matches that of SoftNIC in both that throughput is limited to ~6Gbps and that at least one of the kernel/app cores is saturated.
Out of curiosity, I also looked at the example KNI application in the DPDK repository because SoftNIC seems to be derived from KNI. The performance of KNI is comparable to that of SoftNIC. It is worth noting that KNI does also not implement SG or TSO.
Thanks for taking the time to help me look into this problem,
~Brent
The text was updated successfully, but these errors were encountered: