authors | state | discussion |
---|---|---|
Robert Mustacchi <rm@joyent.com> |
draft |
When sending and receiving networking traffic, there are many different checksums that need to be calculated and verified in a packet. These checksums are often included in the L2, L3, and L4 headers. An example of an L2 checksum is the Ethernet CRC. An example of an L3 header checksum is the IPv4 header checksum. An example of an L4 header checksum is the TCP header checksum.
Specifically, we're concerned with L3 and L4 checksums. The L2 Ethernet checksum is usually taken care of by hardware. Not all protocols have checksums. Consider ARP or LLDP. Both of these are L3 protocols, but neither have checksums. Similarly, while IPv4 has an L3 header checksum, IPv6 does not!
We're specifically concerned about the following L3 protocols with checksums:
- IPv4
We're specifically concerned about the following L4 protocols with checksums:
- TCP
- UDP
- ICMP
- SCTP
Traditionally, these checksums were all calculated by the networking stack. However, hardware has offered support to offload the calculation of these checksums. This RFD covers some additional checksum extensions offered by modern hardware that we'd like to take advantage of.
The GLDv3 is the framework that networking device drivers implement. The
GLDv3 has a notion of a capabilities, one of which, MAC_CAPAB_HCKSUM
is used to identify whether or not hardware supports checksum offload
features. A driver indicates what features it supports for transmit.
When the networking stack is assembling packets, it will use this
information to determine whether a checksum will be calculated in
software or if it will be calculated in hardware.
Drivers can currently indicate the following flags. These flags are
documented in mac(9E)
:
-
HCKSUM_IPHDRCKSUM
: This indicates that the calculation of a checksum of an IPv4 header can be offloaded -
HCKSUM_INET_FULL_V4
: This indicates that the calculation of a checksum of an IPv4 packet's L4 header can be offloaded. -
HCKSUM_INET_FULL_V6
: This indicates that the calculation of a checksum of an IPv6 packet's L4 header can be offloaded. -
HCKSUM_INET_PARTIAL
: The hardware requires that a psuedo-header is calculated for it when performing L4 offloads.
Later, when a driver transmits a packet it calls the
mac_hcksum_get(9F)
function
to retrieve the set of operations it must perform on the packet. Based
on these options, it generally sets bit in the packet's descriptor
header.
When a driver receives data, it looks at the header and indicates which
checksum features were triggered in a packet's descriptor header.
Depending on the type of packet, nothing may have been done or it's
possible that checksums were validated and errors were determined. Once
the driver understands what has occurred, it calls the
mac_hcksum_set(9F)
function and sets checksum information as documented in the manual page.
Aside from this, everything else is handled internally in the OS and by consumers such as DLS, IP, viona, etc.
Today there are two major issues that we want to bring up.
- Support for VXLAN encapsulation checksum offload
- Clarifying the
HCKSUM_INET_FULL_V*
andHCKSUM_INET_PARTIAL
values.
The first issue is that we want to have the ability to leverage hardware checksum offload for encapsulation protocols. Modern hardware has the ability to peer into and understand VXLAN, GRE, and Geneve encapsulation headers and calculate the inner and outer L3 and L4 header checksums. This can be used on transmit to calculate the checksum in hardware and on receive to verify that the checksum is correct in hardware.
This is complicated by the fact that in the VXLAN spec, it is considered optional to set a checksum in the UDP header. However, if one does, then the checksum must be validated.
The second issue is somewhat related. Today, when we have the
HCKSUM_INET_FULL_V4
, HCKSUM_INET_FULL_V6
and HCKSUM_INET_PARTIAL
flags
we do not indicate the set of L4 protocols that are considered by these.
The issue comes to the forefront when hardware supports different sets from
what the operating system expects.
The four L4 checksums that come up most often are the ones we mentioned
earlier: TCP, UDP, ICMP, and SCTP. Of these, the OS implies that those
values mean TCP, UDP, and ICMP. However, while most hardware supports
TCP and UDP checksum offload, the same is not true for ICMP. For
example, the qede
driver does not support ICMPv6 checksum offload and
the i40e
driver does not support partial checksum calculation for ICMP.
To deal with the question around what does HCKSUM_INET_FULL_V4
,
HCKSUM_INET_FULL_V6
and HCKSUM_INET_PARTIAL
mean, I propose that we
clarify things in the manual page to specifically say that these only apply
to the TCP and UDP protocols.
This does mean that there are some drivers that currently provide ICMP checksum support which will be missing out. However, the percentage of traffic that is ICMP and the corresponding cost for such drivers will be minimal. In addition, we should add three new flags to cover ICMP.
HCKSUM_INET_FULL_ICMPv4
This indicates that hardware supports checksum offload for IPv4 ICMP packets.
HCKSUM_INET_FULL_ICMPv6
This indicates that hardware supports checksum offload for IPv6 ICMP packets.
HCKSUM_INET_PARTIAL_ICMP
This indicates that hardware supports partial checksum offload for ICMP packets.
These could also be extended to cover SCTP in a similar fashion by
adding HCKSUM_INET_FULL_SCTPv4
, HCKSUM_INET_FULL_SCTPv6
and
HCKSUM_INET_PARTIAL_SCTP
. Or if some future IPv4 protocol has hardware
checksum offload support, we can indicate it in a similar way.
One nice side effect of this is that it does not impact the set of flags
that we need to offer to use with mac_hcksum_get(9F)
or
mac_hcksum_set(9F)
.
There are a few more challenges with checksum offload features. First, the following hardware table of features is useful to share. This table table surveys devices that are currently on the market and what they support for VXLAN encapsulation. Inner L4 only refers to TCP and UDP.
Driver | Inner L3 | Inner L4 | Outer L3 | Outer L4 | Notes |
---|---|---|---|---|---|
bnxt | yes | yes | yes | yes | no illumos driver |
cxgbe | yes* | yes* | yes* | yes* | Pending verification |
i40e | yes | yes | yes | no* | Outer L4 only supported on X722 MAC, Partial Checksums |
ixgbe | yes* | yes* | yes* | no* | Only supported on X550 MAC |
mlx4x | yes* | yes* | yes* | yes* | no illumos driver, pending verification |
mlx5x | yes* | yes* | yes* | yes* | no illumos driver, pending verification |
qede | yes | yes | yes | yes | - |
Based on this there are a few important things to note. The first is that this table is focused entirely on VXLAN. However, there are other encapsulation protocols such as GENEVE that might need to be dealt with in the future.
Next, we should only concern ourselves only with the inner TCP and UDP headers. While this means that we're leaving out inner ICMP traffic, as a proportion of traffic it isn't very large and not a lot of hardware supports it.
In addition, if we're talking about VXLAN, then really that means that the outer L4 is UDP. Today's hardware seems to always support the inner L3 and L4 offloads, even if it varies on the support of the outer L3 and L4. As such, we should start with a more limited bit set that covers what we care about. If hardware offers more options then we can add flags to cover that hardware if it's required.
As such, I'd propose that we add the following flags to the capability set:
HCKSUM_VXLAN_FULL
This indicates that the inner L3, inner L4, outer L3, and outer L4, can all be offloaded to hardware for processing. Specifically this means the following:
- Inner L4 can be TCP or UDP
- Inner L3 can be IPv4 or IPv6
- Outer L4 must be UDP
- Outer L3 can be IPv4 or IPv6
There is no checksum in an IPv6 header. However, this indicates that an L3 IPv6 header can be understood by the hardware.
- HCKSUM_VXLAN_PSEUDO
This is similar to HCKSUM_VXLAN_FULL
; however, both the inner and
outer L4 checksums require that the psuedo-header be calculated and
filled into the checksum field. This is similar to, but distinct from
the traditional HCKSUM_INET_PARTIAL. Importantly, we don't end up being
able to retreive all the fields that we woul from HCKSUM_INET_PARTIAL
.
HCKSUM_VXLAN_PSEUDO_NO_OL4
This is similar to the HCKSUM_VXLAN_PSEUDO
; however, there is no support
for offloading the checksum of the outer L4 header and instead, the
checksum must be calculated on its own.
Finally, there are no non-verified checksum flags present here. The intent is to only support the full header checksums that are completely validated by the OS at this time. If hardware comes along that requires this mode, then we can add support for this.
We must add a few more flags to the mac_hcksum_get(9F)
and
mac_hcksum_set(9F)
family of functions. First, we suggest that we
treat all of the existing flags as referring to the outer headers. This
means that there are no changes to the existing values and their meaning
in drivers, even when dealing with encapsulated packets.
When receiving we should add the following flags:
-
HCK_INNER_IPV4_HDRCKSUM_OK
: This is the equivalent of theHCK_IPV4_HDRCKSUM_OK
flag; however, it applies to the inner IPv4 header. -
HCK_INNER_FULLCKSUM_OK
: This is equivalent of theHCK_FULLCKSUM_OK
flag; however, it applies to the inner L4 header.
When transmitting we should add the following flags:
-
HCK_INNER_IPV4_HDRCKSUM_NEEDED
: Indicates that the hardware must calculate the inner IPv4 header checksum. This is the equivalent ofHCK_IPV4_HDRCKSUM
. -
HCK_INNER_FULLCKSUM_NEEDED
: Indicates that the hardware must calculate the inner L4 header checksum. This is the equivalent ofHCK_FULLCKSUM
. -
HCK_INNER_PSEUDO_NEEDED
: Indicates that the hardware must calculate the inner L4 header checksum. While this is similar toHCK_PARTIALCKSUM
; however, while the driver requires that the pseudo header be present in the L4 checksum, the driver does not require any of the other fields from to be available in the data block for performing checksum offload.
To round this off, we may need to add additional support functions. These functions should be used by the overlay driver and other internal routines that need to know how to shift values from a message block with an encapsulated packet to the message block that just has the inner payload or vice versa.
For example, when a frame is being encapsulated, something may have
already set the HCKSUM_FULLCKSUM
flag. When the frame is encapsulated,
this will need to transform from HCK_FULLCKSUM
to
HCK_INNER_FULLCKSUM_NEEDED
. The opposite will need to happen in
decapsulation. If HCK_INNER_IPV4_HDRCKSUM_OK
was set, then
HCK_IPV4_HDRCKSUM_OK
will need to be set on the decapsulated packet.
The following are function prototypes that may be able to offer this functionality. However, this is subject to change as we further develop this featuer and gain a better understanding of how it is supposed to work.
/*
* Take the flags that indicate a needed set of checksum values and turn
* them into the corresponding flags that refer to inner values. This
* should be called when a mblk_t is being encapsulated.
*/
extern void mac_hcksum_encap_shift(mblk_t *mp, uint32_t flags);
/*
* Take the flags that were applied to a full message block and
* transform them so that they apply to the decapsulated version. For
* example, if HCK_INNER_FULLCKSUM_OK was set in flags, instead
* HCK_FULLCKSUM_OK will be set on mp.
*/
extern void mac_hcksum_decap_shift(mblk_t *mp, uint32_t flags);
With this scheme, we'll still need to make sure that the checksum flags are cleared on the original message blocks, so nothing gets confused. In addition, we may need to go through and improve mac_fixup_cksum(), viona, and others to handle these newer flags and work around them.
It is also likely that we're going to end up coming up with other additions that we need when we're working with the specific drivers. For example, it may be useful to indicate that VXLAN encapsulation specifically was used with some additional metadata. We'll let the implementation guide us here.