Skip to content
Matthieu Baerts edited this page Mar 31, 2022 · 60 revisions

Linux MPTCP Upstream Project

Overview

The goal of this community is to develop and improve the MPTCP protocol (RFC 8684) in the upstream Linux kernel.

Our basic guidelines for the upstream implementation are:

  • TCP complexity can't increase. It's already a complex, performance-sensitive piece of software that every Linux user depends on. Intrusive changes have a risk of creating bugs or changing operation of the stack in unexpected ways.

  • sk_buff structure size can't get bigger. It's already large and, if anything, the maintainers hope to reduce its size. Changes to the data structure size are amplified by the large number of instances in a busy system.

  • An additional protocol like MPTCP should be opt-in, so users of regular TCP continue to get the same type of connection and performance unless MPTCP is requested.

The Linux MultiPath TCP Project also has a MPTCP-enabled kernel available, however it was developed with different requirements in mind and was not upstreamable as-is.

Talks and Articles

In English:

In Chinese (中文):

ChangeLog

  • v5.6: Create MPTCPv1 sockets with a single subflow
    • Prerequisites: modifications in TCP and Socket API
    • Single subflow & RFC8684 support
    • Selftests for single subflow
  • v5.7: Create multiple subflows but use them one at a time and get stats
    • Multiple subflows: one subflow is used at a time
    • Path management: global, controlled via Netlink
    • MIB counters
    • Subflow context exposed via inet_diag
    • Selftests for multiple-subflow operation
    • Selftests for the Netlink path manager interface
    • Bug-fix
  • v5.8: Stabilisation and support more MPTCPv1 spec
    • Shared receive window across multiple subflows
    • A few optimisations
    • Bug-fix
  • v5.9: Stabilisation and support more MPTCPv1 spec
    • Token refactoring
    • KUnit tests
    • Receive buffer auto-tuning
    • diag features (list MPTCP connections using ss)
    • Full DATA FIN support
    • MPTCP SYN Cookie support
    • A few optimisations
    • Bug-fix
  • v5.10: Send over multiple subflows at the same time and support more MPTCPv1 spec
    • Multiple xmit: possibility to send data over multiple subflows
    • ADD_ADDR support with echo-bit
    • REMOVE_ADDR support
    • A few optimisations
    • Bug-fix
  • v5.11: Performances and support more MPTCPv1 spec
    • Refines receive buffer autotuning
    • Improves GRO and RX coalescing with MPTCP skbs
    • Improve multiple xmit streams support
    • MP_ADD_ADDR v6 support
    • Sending MP_ADD_ADDR port support
    • Incoming MP_FAST_CLOSE support
    • A few optimisations
    • Bug-fix
  • v5.12: Good performances, PM events and support more MPTCPv1 spec
    • Accepting MP_JOIN to another port (after having sent an ADD_ADDR with this port) support
    • MP_PRIO support
    • Per connection netlink PM events
    • "Delegated actions" framework to improve communications between MPTCP socket and subflows
    • Support IPv4-mapped in IPv6 for additional subflows
    • Performances improvement
    • A few optimisations
    • Bug-fix
  • v5.13: Supporting more options and items from the protocol
    • Outgoing MP_FAST_CLOSE support
    • MP_TCPRST support
    • RM_ADDR: addresses' list support
    • Switch to next available address when a subflow creation fails
    • Support removing subflows with ID 0
    • New MIB counters: active MPC, token creation fallback
    • socket options:
      • only admit explicitly supported ones
      • support new ones: SO_KEEPALIVE, SO_PRIORITY, SO_RCV/SNDBUFF, SO_BINDTODEVICE/IFINDEX, SO_LINGER, SO_MARK, SO_INCOMING_CPU, SO_DEBUG, TCP_CONGESTION and TCP_INFO
    • debug: new tracepoints support
    • Retransmit DATA_FIN support
    • MSG_TRUNC and MSG_PEEK support
    • A few optimisations/cleanup
    • Bug-fix
  • v5.14: Supporting more options and items from the protocol
    • Checksum support
    • MP_CAPABLE C flag support
    • Receive path cmsg support (e.g. timestamp)
    • MIB counters for invalid mapping
    • A few optimisations/cleanup (that might affect perfs)
    • Bug-fix
  • v5.15: Supporting more options and usability improvements
    • MP_FAIL support (without TCP fallback / infinite mapping)
    • Packet scheduler improvements (especially with backup subflows)
    • Full mesh path management support
    • Refactoring of ADD_ADDR and ECHO handling
    • Memory and execution optimization of option header transmit and receive
    • Bug-fix and small optimisations
  • v5.16: Supporting more socket options
    • Support for MPTCP_INFO socket option (similar to TCP_INFO)
    • Default max additional subflows for the in-kernel PM is now set to 2
    • Batch SNMP operations
    • Bug-fix and optimisations
  • v5.17: Even more socket options
    • Support for new ioctls: SIOCINQ, OUTQ, and OUTQNSD
    • Support for new socket options: IP_TOS, IP_FREEBIND, IPV6_FREEBIND, IP_TRANSPARENT, IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY
    • Support for cmsgs: TCP_INQ
    • PM: Support changing the "backup" bit via Netlink (ip mptcp)
    • PM: Do not block subflows creation on errors
    • Packet scheduler improvement with better HoL-blocking estimation improving the stability
    • Support sending MP_FASTCLOSE option (quick shutdown of the full MPTCP connection, similar to TCP RST in regular TCP)
    • Bug-fix and optimisations
  • v5.18: Stabilisation
    • Support dynamic change of the Fullmesh PM flag
    • Support for new socket options: SNDTIMEO
    • Code cleanup:
      • Clarify when MPTCP options can be used together
      • Constify a bunch of of helpers
      • Make some OPS structure Read-Only
    • Add MIBs for MP_FASTCLOSE and MP_RST
    • Add tracepoint in mptcp_sendmsg_frag()
    • Restricts RM_ADDR generation to previously explicitly announced ones
    • Send ADD_ADDR echo before creating subflows
  • v5.19: TODO

Resources

How to use MPTCP?

Here is a checklist:

  • Use a "recent" kernel with MPTCP support (grep MPTCP /boot/config-<version>), see ChangeLog section
  • Make sure it has not been disabled: sysctl net.mptcp.enabled
  • You app should create socket with IPPROTO_MPTCP proto (socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);). Legacy apps can be forced to create and use MPTCP sockets instead of TCP ones via the mptcpize command bundled with the mptcpd daemon. There are also additional older workarounds available(1 - 2)
  • Configure the routing rules to use multiple subflows from multiple interfaces: Configure Routing
  • Configure the path manager using ip mptcp or mptcpd on both the client and server, e.g.
ip mptcp limits set subflow 2 add_addr_accepted 2
ip mptcp endpoint add <ip> dev <iface> <subflow|signal>

Please also DO NOT use multipath-tcp.org nor amiusingmptcp.de to check that you have MPTCP working: these services only support the v0 of the protocol while the upstream version only support the v1 and the two are not compatible. Please use tools like tcpdump and wireshark or check counters with nstat or directly in /proc/net/netstat.

Upstream vs out-of-tree implementations

There are two different but active Linux kernel projects related MPTCP:

  • out-of-tree:
    • URL: https://github.com/multipath-tcp/mptcp
    • It cannot be "upstreamed" to the official Linux kernel as it: there are too many modifications in the TCP stack.
    • It is designed to have very good performances with MPTCP but it has an impact on normal TCP and the maintenance is harder.
    • This version is used for the server behind http://multipath-tcp.org/
    • MPTCPv0 spec is supported
    • MPTCPv1 spec support is available from the v0.96 version
  • upstream: (here)
    • URL: https://github.com/multipath-tcp/mptcp_net-next
    • Available since v5.6 in the official Linux kernel (if enabled in the kernel config)
    • It is a new implementation designed to fit with upstream standards.
    • The work is ongoing, please see the ChangeLog section above to see what is supported
    • Only MPTCPv1 is supported
    • Note: RHEL8 has MPTCP support based on this upstream implementation.

MPTCPv0 vs MPTCPv1

For the moment, there are also different versions of the protocol: RFC 6824 (MPTCPv0) and RFC 8684 (MPTCPv1).

MPTCPv1 has significant changes that make it incompatible with v0. By design, the upstream version is not compatible with MPTCPv0. That is why curl http://multipath-tcp.org/ will always say you don't support MPTCP(v0).

Please note that MPTCPv0 and MPTCPv1 are not used to defined the different Linux kernel implementations (out-of-tree vs upstream), it is just the version of the protocol. Please use 'out-of-tree' and 'upstream' if you want to talk about the Linux kernel implementation.

curl http://multipath-tcp.org is reporting I don't have MPTCP supported

$ curl http://multipath-tcp.org
Nay, Nay, Nay, your have an old computer that does not speak MPTCP. Shame on you!

Please see the above sections for more details but the server behind http://multipath-tcp.org is using the out-of-tree the implementation with MPTCPv0 only. It is then not compatible with MPTCPv1 and reporting the error.

It is planned to have a public MPTCP server with the upstream kernel but it is not ready yet.

Kernel Development environment

The Docker image used by the public CIs can be used to create a basic kernel dev environment.

Download the kernel source code and then run these two commands to download the latest Docker image and run it:

$ docker pull mptcp/mptcp-upstream-virtme-docker:latest
$ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it mptcp/mptcp-upstream-virtme-docker:latest <manual-normal | manual-debug | auto-normal | auto-debug | auto-all>

For more details: https://github.com/multipath-tcp/mptcp-upstream-virtme-docker

Under heavy load, high number of retransmissions / dropped packets at the NIX RX queue level

Even if with MPTCP, the subflow processing is done by the TCP stack, the main difference with plain TCP is that this processing does not use the socket backlog and always happens in BH. When the host is under heavy load, BH processing happens in ksoftirqd context, and there is some latency between the ksoftirqd scheduling and the moment ksoftirqd actually runs that. This depends on the process scheduler decisions (and settings).

A way to reduce these retransmissions and avoid the dropped packets at the NIC level is to increase the NIC RX queue. See issue #253 for more details.