The same high-speed user space network driver written in lots of different high-level languages.
Switch branches/tags
Nothing to show
Clone or download
Latest commit cc87a06 Dec 3, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
img add readme Oct 15, 2018
LICENSE Initial commit Oct 14, 2018
README.md Update README.md Dec 3, 2018

README.md

Overview

Ixy is an educational user space network driver for the Intel ixgbe family of 10 Gbit/s NICs (82599ES aka X520, X540, X550, ...). Its goal is to show that writing a super-fast network driver can be surprisingly simple, check out the full description in the main repository.

Ixy was originally written in C as lowest common denominator of system programming languages, but it is possible to write user space drivers in any programming language.

Yes, these drivers are really a full implementation of an actual PCIe driver in these languages; they handle everything from setting up DMA memory to receiving and transmitting packets in a high-level language. You don't need to write any kernel code to build drivers! Some languages require a few lines of C stubs for features not offered by the language; usually related to getting the memory address of buffers or poking MMIO registers in the right way. But all the core logic is in high-level languages; the implementations are about 1000 lines of code each.

Language Code Status Full evaluation
C ixy.c* Finished Paper (draft)
Rust ixy.rs Finished Thesis
go ixy.go Finished Thesis
C# ixy.cs Finished Thesis
Swift ixy.swift Finished Documentation
OCaml ixy.ml WIP Documentation
Haskell ixy.hs WIP WIP
Python ixy.py* WIP WIP

*) also features a VirtIO driver for easy testing in VMs with Vagrant

This repository here is only a short summary of the project, check out the repositories and full evaluations linked above for all the gory details.

Performance

Our benchmarking script for the MoonGen packet generator loads the forwarder example application with full bidirectional load at 20 Gbit/s with 64 byte packets (29.76 Mpps). The forwarder then increments one byte in the packet to ensure that the packet is loaded all the way into the L1 cache. Correct functionality of the forwarder is also tested by the script by validating sequence numbers.

Running this on a single core of a Xeon E3-1230 v3 CPU yields these results when varying the CPU frequency.

CPU frequency vs. throughput

A main driver of performance for network drivers is sending/receiving packets in batches from/to the NIC. Ixy can already achieve a high performance with relatively low batch sizes of 32-64 because it is a full user space driver. Other user space packet processing frameworks like netmap that rely on a kernel driver need larger batch sizes of 512 and above to amortize the larger overhead of communicating with the driver in the kernel.

Effect of batch size with CPU clocked at 3.3 GHz Performance with different batch sizes, CPU at 3.3 GHz

Effect of batch size with CPU clocked at 1.6 GHz Performance with different batch sizes, CPU at 1.6 GHz

[TODO: measure cache misses for other languages, similar to the cache measurement for the C version in the paper]

Notes on multi-core: Some languages implement multi-threading for ixy, but some can't or are limited by the language's design (e.g., the GIL in Python and OCaml). However, this isn't a real problem because multi-threading within one process isn't really necessary. Network cards can split the traffic at the hardware level (via a feature called RSS), the traffic can be distributed to independent different processes. For example, Snabb works like this and many DPDK applications use multiple threads that do not communicate.

Latency

Average and median latency is the same regardless of the programming language, the evaluation script can sample the latency of up to 1000 packets per second with hardware timestamping (precision: 12.8 nanoseconds). This yields somewhat interesting results depending on queue sizes and NUMA configuration, see the ixy paper for an evaluation.

Latency spikes induced by languages feature a garbage collector and/or a JIT compiler might not be caught by the test setup above. We are working on an additional latency measurement setup based on fiber taps with our MoonSniff framework. This allows us to timestamp every single packet and detect JIT warmup phases and individual gc cycles.