SoftNIC performance and NIC offload features (scatter-gather lists and TSO) #390

bestephe · 2017-04-17T18:54:08Z

Hi,

I'm currently working on a project where I am evaluating changes to how NIC hardware schedules packets. In my evaluation, I want to compare the performance of different POSIX applications given different NIC packet scheduling behaviors. From what I can tell so far, BESS/SoftNIC seems like it could be a good platform for evaluating the NIC changes I have in mind. However, as it is currently implemented, it seems like SoftNIC doesn't provide good enough performance (thoughput) to be useful in my experiments. I go into more detail about exactly what these performance issues are in the rest of this message, but I believe these performance issues do not stem from any performance issue in the BESS daemon in userspace but instead stem from the overheads of the kernel interfacing with the SoftNIC driver because the SoftNIC driver does not implement important offloading features like scatter-gather lists and TSO. As a result, even on only a 10Gbps link with SoftNIC configured to use a large number of queues and BESS configured with multiple worker threads, traditional POSIX applications struggle to drive line-rate when sending data over a SoftNIC VPort. Given this performance problem, I have the following questions:

Is there a reason that SoftNIC does not currently support scatter-gather lists or TSO? From line 702 in core/kmod/sn_netdev.c (https://github.com/NetSys/bess/blob/master/core/kmod/sn_netdev.c#L702), it seems like supporting these features was considered at some point in time before they were disabled in the current version.
Is there any development branch or plan to support scatter-gather lists and TSO in SoftNIC in the future?
If I wanted to implement support scatter-gather lists and TSO in SoftNIC myself, are there any caveats or design challenges that I should know about?
Is there some other best practice for hooking up applications to BESS so that they can drive a 10Gbps link?

In the rest of this message, I provide some more background on my methodology and why I think the performance problems I observe are caused by a lack of scatter-gather and TSO support.

I've observed this problem on two different platforms. My methodology is as follows: Both systems use an Intel 82599 10Gbps NIC. The first is a machine on my local testbed that uses a 4-core/8-thread 2.8GHz Xeon E5-1410, and the second is a machine on CloudLab Clemson that two sockets of "Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz" cores for a total of 56 logical threads. In my BESS configuration, I create a VPort with 16 queues, I configure a PMDPort to run the physical 82599 device also with 16 queues, and I directly connect the incoming and outgoing queues on the VPort and PMDPort. Also, I configured BESS to use 4 worker threads.

To measure throughput, I have been using the "iperf3" network benchmarking tool with a variable number of parallel connections "-P <x>". With the default ixgbe driver, iperf3 can drive 10Gbps line rate with any number of parallel connections that I've used, including only using a single connection. However, with SoftNIC/BESS, iperf3 can only drive ~6Gbps. Further, performance with SoftNIC/BESS can be highly variable. As best as I can tell, this poor performance is solely because the SoftNIC driver does not implement scatter-gather lists or TSO. I think this because of the following results:

With the ixgbe driver which supports SG and TSO, no core is ever saturated according to dstat/htop and iperf3 can consistently drive line-rate.
With SoftNIC/BESS, in addition to the BESS daemon cores, some of the cores running application and kernel code become saturated. In this experiment, the application can drive at most ~6Gbps.
If I disable SG (and TSO) on the ixgbe driver, then the performance of ixgbe matches that of SoftNIC in both that throughput is limited to ~6Gbps and that at least one of the kernel/app cores is saturated.
Out of curiosity, I also looked at the example KNI application in the DPDK repository because SoftNIC seems to be derived from KNI. The performance of KNI is comparable to that of SoftNIC. It is worth noting that KNI does also not implement SG or TSO.

Thanks for taking the time to help me look into this problem,

~Brent

sangjinhan · 2017-04-19T16:31:51Z

You are right. When one TCP flow is considered (thus only one core is utilized in the kernel space), the Linux TCP implementation processes roughly 0.5Mpps (and their ACKs). It effectively limits its throughput to 1500B * 8 * 0.5Mpps = 6Gbps. Since the overheads mostly come from per-packet cost, not per-byte cost, increasing the MTU (TSO and LRO) will immediately boost the throughput.

Also note that you would need a multi-queue vport, and distribute packets (e.g., HashLB module) across queues. Otherwise all RX packets will be forwarded to the same core. And these ACK packets will trigger TX packets to be generated on the very same core, effectively utilizing only one kernel for all flows.

In the early days of SoftNIC there were modules that implement NIC offloading functionality, such as TSO, LRO, checksum, FlowDirector, etc. Although BESS inherited most of the SoftNIC code I decided to disable the features simply because I did not have enough cycles to maintain the features. I believe a fair amount of code is still there in the kernel module side; reviving them shouldn't be terribly difficult. You are welcome to put the fragmented code pieces back together and make it work again.

I attached the old TSO/LRO code of SoftNIC.
segmentation.c.zip

bestephe · 2017-04-19T21:02:41Z

Thanks for the response. It seems like I do want to build a version of SoftNIC that implements at least the TSO and IP checksumming NIC offload features. Given this, I think the "segmentation.c" file that you sent along should be helpful. Thanks for sharing it. However, I have a few follow up questions/comments:

Question 1:

Although "segmentation.c" seems like it will be helpful, this only seems to be one aspect of the BESS daemon's implementation of TSO and LRO. It seems like additional changes need to be made to the SoftNIC kernel driver and the interface between the BESS daemon and the SoftNIC kernel driver. As a quick example, it seems like SNBUF_DATA may need to be changed so that the kernel driver can pass large segments to the BESS daemon. Similarly, even though it seems like there is support for TX checksumming in the kernel driver, simply enabling the "NETIF_F_IP_CSUM" flag in the kernel driver isn't working for me at the moment. Do you have a more complete snapshot of a version of both the SoftNIC kernel driver and BESS daemon that implements TSO that you would be willing to share with me? Given some of the problems I'm currently trying to solve, I think such a snapshot could be useful in helping save me some development time.

Question/Comment 2:

The reason that I'm currently trying to use SoftNIC/BESS is to evaluate changes to two things: 1) How the NIC schedules TX packets and 2) How kernel drivers and kernel-bypass applications (DPDK) interface with the NIC, .e.g., what information is communicated through transmit descriptors and how flows are mapped to queues. While SoftNIC/BESS seems to be a good platform for evaluating changes to the NIC scheduler, the implementation of the SoftNIC kernel module makes it difficult to draw any meaningful conclusions about the overheads of different implementations of the OS/NIC (and App/NIC) interface. This is because the TX interface between the kernel module and the BESS daemon is backwards from that of a physical NIC in a way that places additional computational overheads on the kernel driver and in a way that changes how the memory shared between the kernel driver and the daemon is used and laid out. At a high level, I feel that this is a mistake. I think it would be better if the SoftNIC kernel module were as close to that of a driver for a physical NIC as possible and that any extra overheads in emulating a NIC were instead incurred by the BESS daemon, not the kernel driver. More concretely, in BESS, the SoftNIC kernel module is responsible for copying packet data into SNBUFs for use by BESS as part of its xmit function. In contrast, in a physical NIC, the kernel module would only create DMA mappings and send metadata and pointers to the NIC, which is then responsible for copying packet data into NIC memory through DMAs. Since this extra overhead of memory copying in SoftNIC/BESS is incurred by the kernel driver and this implementation detail changes how memory is laid out, it is not easy to draw any conclusions from the CPU and memory overheads incurred by the kernel when it interfaces with SoftNIC/BESS. Further, as it is currently implemented, the SoftNIC/BESS kernel module breaks the byte queue limits (BQL) kernel optimization. Given these limitations, its looking to me like I may reimplement the TX side of SoftNIC (BESS VPort).

My questions about the TX implementation of the SoftNIC kernel driver are the following: First, can anyone give me some insight into why the current implementation of SoftNIC/BESS incurs the overheads of copying packets from skbs in the driver instead of in the BESS daemon? Second, has an alternate implementation approach for the SoftNIC TX interface ever been considered? Third, in what follows, I give a high level description of what I think would be a better implementation of the TX interface between the SoftNIC kernel driver and the BESS daemon. Do you see any obvious limitations, drawbacks, or problems with taking this approach instead of the current approach taken by SoftNIC?

My high-level outline of how I think the SoftNIC's kernel driver's TX interface should change to be more faithful to how kernel drivers for physical NICs are implemented is as follows:

In the xmit function, the driver will start by mapping the skb frags into the address space of the BESS daemon
Next, the driver will create tx descriptors to communicate both any necessary metadata and the skb fragment pointers to BESS. These descriptors will be enqueued into the llring instead of enqueueing SNBUF pointers directly like SoftNIC currently does.
Once the BESS VPort driver reads the transmit descriptors, it will create packets by allocating memory and copying data as necessary.
Once the BESS VPort is done with the TX descriptors, it will update some registers shared with the SoftNIC kernel driver and generate a TX interrupt for the SoftNIC kernel driver to process.
In the SoftNIC TX interrupt handler, the kernel driver unmaps skb fragments and frees skbs that BESS has acknowledged that it is finished with.

Thanks in advance for taking the time to read this message,

~Brent

sangjinhan closed this as completed Apr 19, 2017

sangjinhan mentioned this issue Nov 1, 2017

performance question #708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SoftNIC performance and NIC offload features (scatter-gather lists and TSO) #390

SoftNIC performance and NIC offload features (scatter-gather lists and TSO) #390

bestephe commented Apr 17, 2017

sangjinhan commented Apr 19, 2017 •

edited

Loading

bestephe commented Apr 19, 2017 •

edited

Loading

SoftNIC performance and NIC offload features (scatter-gather lists and TSO) #390

SoftNIC performance and NIC offload features (scatter-gather lists and TSO) #390

Comments

bestephe commented Apr 17, 2017

sangjinhan commented Apr 19, 2017 • edited Loading

bestephe commented Apr 19, 2017 • edited Loading

Question 1:

Question/Comment 2:

sangjinhan commented Apr 19, 2017 •

edited

Loading

bestephe commented Apr 19, 2017 •

edited

Loading