Permalink
Cannot retrieve contributors at this time
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
2934 lines (2555 sloc)
146 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Title : In-band Network Telemetry (INT) Dataplane Specification | |
Title Note : Version 2.1 | |
Title Footer: 2020-11-11 | |
Author : The P4.org Applications Working Group. Contributions from | |
Affiliation : *Alibaba, Arista, CableLabs, Cisco Systems, Dell, Intel, Marvell, Netronome, VMware* | |
Heading depth: 5 | |
Pdf Latex: xelatex | |
Document Class: [11pt]article | |
Package: [top=1in, bottom=1.25in, left=1in, right=1in]{geometry} | |
Package: fancyhdr | |
Tex Header: | |
\setlength{\headheight}{30pt} | |
\renewcommand{\footrulewidth}{0.5pt} | |
@if html { | |
body.madoko { | |
font-family: utopia-std, serif; | |
} | |
title,titlenote,titlefooter,authors,h1,h2,h3,h4,h5 { | |
font-family: helvetica, sans-serif; | |
font-weight: bold; | |
} | |
pre, code { | |
language: p4; | |
font-family: monospace; | |
font-size: 10pt; | |
} | |
} | |
@if tex { | |
body.madoko { | |
font-family: UtopiaStd-Regular; | |
} | |
title,titlenote,titlefooter,authors { | |
font-family: sans-serif; | |
font-weight: bold; | |
} | |
pre, code { | |
language: p4; | |
font-family: LuxiMono; | |
font-size: 75%; | |
} | |
} | |
Colorizer: p4 | |
.token.keyword { | |
font-weight: bold; | |
} | |
@if html { | |
p4example { | |
replace: "~ Begin P4ExampleBlock&nl;\ | |
````&nl;&source;&nl;````&nl;\ | |
~ End P4ExampleBlock"; | |
padding:6pt; | |
margin-top: 6pt; | |
margin-bottom: 6pt; | |
border: solid; | |
background-color: #ffffdd; | |
border-width: 0.5pt; | |
} | |
} | |
@if tex { | |
p4example { | |
replace: "~ Begin P4ExampleBlock&nl;\ | |
````&nl;&source;&nl;````&nl;\ | |
~ End P4ExampleBlock"; | |
breakable: true; | |
padding: 6pt; | |
margin-top: 6pt; | |
margin-bottom: 6pt; | |
border: solid; | |
background-color: #ffffdd; | |
border-width: 0.5pt; | |
} | |
} | |
@if html { | |
p4pseudo { | |
replace: "~ Begin P4PseudoBlock&nl;\ | |
````&nl;&source;&nl;````&nl;\ | |
~ End P4PseudoBlock"; | |
padding: 6pt; | |
margin-top: 6pt; | |
margin-bottom: 6pt; | |
border: solid; | |
background-color: #e9fce9; | |
border-width: 0.5pt; | |
} | |
} | |
@if tex { | |
p4pseudo { | |
replace: "~ Begin P4PseudoBlock&nl;\ | |
````&nl;&source;&nl;````&nl;\ | |
~ End P4PseudoBlock"; | |
breakable : true; | |
padding: 6pt; | |
margin-top: 6pt; | |
margin-bottom: 6pt; | |
background-color: #e9fce9; | |
border: solid; | |
border-width: 0.5pt; | |
} | |
} | |
@if html { | |
p4grammar { | |
replace: "~ Begin P4GrammarBlock&nl;\ | |
````&nl;&source;&nl;````&nl;\ | |
~ End P4GrammarBlock"; | |
border: solid; | |
margin-top: 6pt; | |
margin-bottom: 6pt; | |
padding: 6pt; | |
background-color: #e6ffff; | |
border-width: 0.5pt; | |
} | |
} | |
@if tex { | |
p4grammar { | |
replace: "~ Begin P4GrammarBlock&nl;\ | |
````&nl;&source;&nl;````&nl;\ | |
~ End P4GrammarBlock"; | |
breakable: true; | |
margin-top: 6pt; | |
margin-bottom: 6pt; | |
padding: 6pt; | |
background-color: #e6ffff; | |
border: solid; | |
border-width: 0.5pt; | |
} | |
} | |
[TITLE] | |
[]{tex-cmd: "\newpage"} | |
[]{tex-cmd: "\fancyfoot[L]{&date; &time;}"} | |
[]{tex-cmd: "\fancyfoot[C]{In-band Network Telemetry}"} | |
[]{tex-cmd: "\fancyfoot[R]{\thepage}"} | |
[]{tex-cmd: "\pagestyle{fancy}"} | |
[]{tex-cmd: "\sloppy"} | |
[TOC] | |
# Introduction | |
Inband Network Telemetry (“INT”) is a framework designed to allow the | |
collection and reporting of network state, by the data plane, without requiring | |
intervention or work by the control plane in collecting and delivering | |
the state from the data plane. In the INT architectural model, | |
packets may contain header fields that are interpreted as “telemetry | |
instructions” by network devices. | |
INT traffic sources (applications, end-host networking stacks, | |
hypervisors, NICs, send-side ToRs, etc.) can embed the instructions either in | |
normal data packets, cloned copies of the data packets or in special probe packets. | |
Alternatively, the instructions may be | |
programmed in the network data plane to match on particular network flows | |
and to execute the instructions on the matched flows. | |
These instructions tell an INT-capable device what state to collect. | |
The network state information may be directly exported by the data plane | |
to the telemetry monitoring system, or can be | |
written into the packet as it traverses the | |
network. When the information is embedded in the packets, INT traffic sinks | |
retrieve (and optionally report) the collected results of these instructions, | |
allowing the traffic sinks to monitor the exact data plane state that the | |
packets “observed” while being forwarded. | |
Some examples of traffic sink behavior are described below: | |
* OAM – the traffic sink[^Transit] might simply collect the encoded network state, then | |
export that state to an external controller. This export could be in a raw | |
format, or could be combined with basic processing (such as compression, | |
deduplication, truncation). | |
* Real-time control or feedback loops – traffic sinks might use the encoded | |
data plane information to feed back control information to traffic sources, | |
which could in turn use this information to make changes to traffic engineering | |
or packet forwarding. (Explicit congestion notification schemes are an example | |
of these types of feedback loops). | |
* Network Event Detection - If the collected path state indicates a condition | |
that requires immediate attention or resolution (such as severe congestion or | |
violation of certain data-plane invariances), the traffic sinks[^Transit] could generate | |
immediate actions to respond to the network events, forming a feedback control | |
loop either in a centralized or a fully decentralized fashion (a la TCP). | |
[]{tex-cmd: "\newpage"} | |
The INT architectural model is intended to be generic and enables a | |
number of interesting high level applications, such as: | |
* Network troubleshooting and performance monitoring | |
- Traceroute, micro-burst detection, packet history (a.k.a. postcards[^Postcard]) | |
* Advanced congestion control | |
* Advanced routing | |
- Utilization-aware routing (For example, HULA[^HULA], CLOVE[^CLOVE]) | |
* Network data plane verification | |
A number of use case descriptions and evaluations are described in the Millions | |
of Little Minions paper [^Minions]. | |
[^Transit]: While this will be commonly done by Sink nodes, Transit nodes may also generate OAM's or carry out Network Event Detection | |
[^Postcard]: I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks, USENIX NSDI 2014. | |
[^HULA]: HULA: Scalable Load Balancing Using Programmable Data Planes, ACM SOSR 2016 | |
[^CLOVE]: CLOVE: Congestion-Aware Load Balancing at the Virtual Edge, ACM CoNEXT 2017 | |
[^Minions]: Millions of Little Minions: Using Packets for Low Latency Network Programming and Visibility, ACM SIGCOMM 2014. | |
# Terminology | |
* **Monitoring System**: | |
: A system that collects telemetry data sent from different network devices. | |
The monitoring system components may be physically distributed but logically | |
centralized. | |
* **INT Header**: | |
: A packet header that carries INT information. There are three types of INT | |
Headers -- *eMbed data (MD-type)*, *eMbed instruction (MX-type)* and | |
*Destination-type* (See Section [#sec-int-header-types]). | |
* **INT Packet**: | |
: A packet containing an INT Header. | |
* **INT Node**: | |
: An INT-capable network device that participates in the INT data plane by | |
regularly carrying out at least one of the following: inserting, adding to, | |
removing, or processing instructions from INT Headers in INT packets. | |
Depending on deployment scenarios, examples of INT Nodes may include devices | |
such as routers, switches, and NICs. | |
* **INT Instruction**: | |
: Instructions indicating which INT Metadata (defined below) to collect at | |
each INT node. The instructions are either configured at each INT-capable | |
node's Flow Watchlist or written into the INT Header. | |
* **Flow Watchlist**: | |
: A dataplane table that matches on packet headers and inserts or applies | |
INT instructions on each matched flow. A flow is a set of packets having the | |
same values on the selected header fields. | |
* **INT Source**: | |
: A trusted entity that creates and inserts INT Headers into the packets it | |
sends. A Flow Watchlist is configured to select the flows in which INT | |
headers are to be inserted. | |
* **INT Sink**: | |
: A trusted entity that extracts the INT Headers and collects the path state | |
contained in the INT Headers. The INT Sink is responsible for removing INT | |
Headers so as to make INT transparent to upper layers. (Note that this does | |
not preclude having nested or hierarchical INT domains.) The INT Sink can | |
decide to send the collected information to the monitoring system. | |
* **INT Transit Hop**: | |
: A trusted entity that collects metadata from the data plane by following | |
the INT Instructions. Based on the instructions, the data may be directly | |
exported to the telemetry monitoring system or embedded into the INT Header | |
of the packet. | |
Note that one physical device | |
may play multiple roles -- INT Source, Transit, Sink -- at the same time | |
for the same or different flows. For example, an INT Source node may | |
embed its own metadata into the packet, playing the roles of INT Transit as well. | |
* **INT Metadata**: | |
: Information that an INT Source or an INT Transit Hop node inserts into the | |
INT Header, or into a telemetry report. Examples of metadata are described | |
in Section [#what-to-monitor]. | |
* **INT Domain**: | |
: A set of inter-connected INT nodes under the same administration. This | |
specification defines the behavior and packet header formats for | |
interoperability between INT nodes from different vendors in an INT | |
domain. The INT nodes within the same domain must be configured in a | |
consistent way to ensure interoperability between the nodes. Operators of | |
an INT domain should deploy INT Sink capability at domain edges to prevent | |
INT information from leaking out of the domain. | |
# INT Modes of Operation | |
Since INT was first introduced at P4.org in 2015, a number of variations of INT | |
have been evolved and discussed in IETF and industry communities. | |
Also the term 'INT' has been used to broadly indicate data plane telemetry in | |
general, not limited to the original classic INT where both instructions and | |
metadata are embedded in the data packets. Hence we define different modes of | |
INT operation based on the degree of packet modifications, i.e., what to | |
embed in the packets. | |
The different modes of operation are described in detail below, and summarized | |
in Figure [#fig-int-modes]. | |
## INT Application Modes | |
Original data packets are monitored and may be modified | |
to carry INT instructions and metadata. | |
There are three variations based on the level of packet modifications. | |
* **INT-XD** (eXport Data): INT nodes directly export metadata from their | |
dataplane to the monitoring system based on the INT instructions configured | |
at their Flow Watchlists. No packet Modification is needed. | |
This mode was also known as "Postcard" mode in the previous versions of | |
the Telemetry Report spec, originally inspired by [^Postcard]. | |
* **INT-MX** (eMbed instruct(X)ions): The INT Source node embeds INT | |
instructions in the packet header, then the INT Source, each INT Transit, and | |
the INT sink directly send the metadata to the monitoring system by following | |
the instructions embedded in the packets. The INT Sink node strips the | |
instruction header before forwarding the packet to the receiver. Packet | |
modification is limited to the instruction header, the packet size doesn't | |
grow as the packet traverses more Transit nodes. | |
INT-MX also supports 'source-inserted' metadata as part of Domain Specific | |
Instructions. This allows the INT Source to embed additional metadata that | |
other nodes or the monitoring system can consume. | |
This mode is inspired by IOAM's "Direct Export" [^IOAM] [^IOAM-DEX]. | |
* **INT-MD** (eMbed Data): In this mode both INT instructions and metadata are | |
written into the packets. This is the classic hop-by-hop INT where 1) INT | |
Source embeds instructions, 2) INT Source & Transit embed metadata, and 3) INT | |
Sink strips the instructions and aggregated metadata out of the packet and | |
(selectively) sends the data to the monitoring system. | |
The packet is modified the most in this mode while it minimizes the overhead | |
at the monitoring system to collate reports from multiple INT nodes. | |
Since v2.0, INT-MD mode supports 'source-only' metadata as part of Domain | |
Specific Instructions. This allows the INT Source to embed additional | |
metadata for the INT Sink or the monitoring system to consume. | |
**NOTE: the rest of the spec is assuming INT-MD as the default mode, unless | |
specified otherwise.** | |
[^IOAM]: Data Fields for In-situ OAM, [draft-ietf-ippm-ioam-data-09](https://tools.ietf.org/html/draft-ietf-ippm-ioam-data-09), March 2020. | |
[^IOAM-DEX]: In-situ OAM Direct Exporting, [draft-ietf-ippm-ioam-direct-export-00](https://tools.ietf.org/html/draft-ietf-ippm-ioam-direct-export-00), February 2020. | |
~ Figure { #fig-int-modes; caption: "Various modes of INT operation." } | |
![int-modes] | |
~ | |
[int-modes]: images/INT_modes.pdf { width: 5.5in } | |
## INT Applied to Synthetic Traffic | |
INT Source nodes may generate INT-marked synthetic traffic either by | |
cloning original data packets or by generating special probe packets. | |
INT is applied to this traffic by transit nodes in exactly the same way | |
as all traffic. | |
The only difference between live traffic and Synthetic traffic is that | |
INT Sink nodes may need to discard synthetic traffic after extracting | |
the collected INT data as opposed to forwarding the traffic. This is | |
indicated by using the 'D' bit of the INT Header to mark relevant packets | |
as being copies/clones or probes, to be 'D'iscarded at the INT Sink. | |
All INT modes may be used on these synthetic/probe packets, as decided by the INT Source node. | |
Specifically the **INT-MD** (eMbed Data) mode applied to Synthetic or probe packets allows | |
functionality similar to **IFA**[^IFA]. | |
It is likely that synthetic traffic created by cloning would be discarded at the Sink, | |
while Probe packets might be marked for forwarding or discarding, depending on the | |
use-case. It is the responsibility of the INT Source node to mark packets correctly | |
to determine if the INT Sink will forward or discard packets after extracting the | |
INT Data collected along the path. | |
[^IFA]: Inband Flow Analyzer, [draft-kumar-ippm-ifa-02](https://tools.ietf.org/html/draft-kumar-ippm-ifa-02), April 2020. | |
# What To Monitor { #what-to-monitor } | |
In theory, one may be able to define and collect any device-internal | |
information using the INT approach. In practice, however, it seems useful to | |
define a small baseline set of metadata that can be made available on a wide | |
variety of devices: the metadata listed in this section comprises such a set. | |
As the INT specification evolves, we expect to add more metadata to this | |
INT specification. | |
The exact meaning of the following metadata (e.g., the unit of timestamp | |
values, the precise definition of hop latency, queue occupancy or buffer | |
occupancy) can vary from one device to another for any number of reasons, | |
including the heterogeneity of device architecture, feature sets, | |
resource limits, etc. Thus, defining the exact meaning of each metadata is | |
beyond the scope of this document. Instead we assume that the semantics of | |
metadata for each device model used in a deployment is communicated with | |
the entities interpreting/analyzing the reported data in an out-of-band fashion. | |
## Device-level Information | |
* Node id | |
: The unique ID of an INT node. | |
This is generally administratively assigned. Node IDs must be unique | |
within an INT domain. | |
## Ingress Information | |
* Ingress interface identifier | |
: The interface on which the INT packet was received. A packet may be received | |
on an arbitrary stack of interface constructs starting with a physical port. | |
For example, a packet may be received on a physical port that belongs to | |
a link aggregation port group, which in turn is part of a Layer 3 Switched | |
Virtual Interface, and at Layer 3 the packet may be received in a tunnel. | |
Although the entire interface stack may be monitored in theory, this | |
specification allows for monitoring of up to two levels of ingress interface | |
identifiers. The first level of ingress interface identifier would typically | |
be used to monitor the physical port on which the packet was received, hence | |
a 16-bit field (half of a 4-Byte metadata) is deemed adequate. The second | |
level of ingress interface identifier occupies a full 4-Byte metadata field, | |
which may be used to monitor a logical interface on which the packet was | |
received. A 32-bit space at the second level allows for an adequately large | |
number of logical interfaces at each network element. The semantics of | |
interface identifiers may differ across devices, each INT hop chooses the | |
interface type it reports at each of the two levels. | |
* Ingress timestamp | |
: The device local time when the INT packet was received on the ingress | |
physical or logical port. | |
## Egress Information | |
* Egress interface identifier | |
: The interface on which the INT packet was sent out. A packet may | |
be transmitted on an arbitrary stack of interface constructs ending at a | |
physical port. For example, a packet may be transmitted on a tunnel, | |
out of a Layer 3 Switched Virtual Interface, on a Link Aggregation Group, | |
out of a particular physical port belonging to the Link Aggregation Group. | |
Although the entire interface stack may be monitored in theory, this | |
specification allows for monitoring of up to two levels of egress interface | |
identifiers. The first level of egress interface identifier would typically | |
be used to monitor the physical port on which the packet was transmitted, | |
hence a 16-bit field (half of a 4-Byte metadata) is deemed adequate. | |
The second level of egress interface identifier occupies a full | |
4-Byte metadata field, which may be used to monitor a logical interface on | |
which the packet was transmitted. A 32-bit space at the second level | |
allows for an adequately large number of logical interfaces at each network | |
element. The semantics of interface identifiers may differ across devices, | |
each INT hop chooses the interface type it reports at each of the two levels. | |
* Egress timestamp | |
: The device local time when the INT packet was processed by the egress | |
physical or logical port. | |
* Hop latency | |
: Time taken for the INT packet to be switched within the device. | |
* Egress interface TX Link utilization | |
: Current utilization of the egress interface via which the INT packet was | |
sent out. Again, devices can use different mechanisms to keep track of the | |
current rate, such as bin bucketing or moving average. While the latter is | |
clearly superior to the former, the INT framework does not stipulate the | |
mechanics and simply leaves those decisions to device vendors. | |
* Queue occupancy | |
~ The build-up of traffic in the queue (in bytes, cells, or packets) that the | |
INT packet observes in the device while being forwarded. The format of this | |
4-octet metadata field is implementation specific and the metadata semantics | |
YANG model shall describe the format and units of this metadata field in the | |
metadata stack. | |
* Buffer occupancy | |
: The build-up of traffic in the buffer (in bytes, cells, or packets) that the | |
INT packet observes in the device while being forwarded. Use case is when | |
the buffer is shared between multiple queues. The format of this 4-octet | |
metadata field is implementation specific and the metadata semantics YANG | |
model shall describe the format and units of this metadata field in the | |
metadata stack. | |
A metadata semantics YANG model [^metadata-yang] is being developed that allows | |
nodes to report details of the metadata format, units, and semantics. | |
[^metadata-yang]: p4-dtel-metadata-semantics, [https://github.com/p4lang/p4-applications/blob/master/telemetry/code/models/p4-dtel-metadata-semantics.yang](https://github.com/p4lang/p4-applications/blob/master/telemetry/code/models/p4-dtel-metadata-semantics.yang) | |
[]{tex-cmd: "\newpage"} | |
# INT Headers | |
This section specifies the format and location of INT Headers. | |
INT Headers and their locations are relevant for INT-MX and INT-MD modes where | |
the INT instructions (and metadata stack in case of MD mode) are written into | |
the packets. | |
## INT Header Types | |
There are three types of INT Headers: MD-type, MX-type and Destination-type. | |
A given INT packet may carry either of MD or MX type headers, and/or | |
a Destination-type header. When Destination-type and MD-type or MX-type | |
headers are present, the MD-type header or MX-type header must precede the | |
Destination-type header. | |
* MD-type (**INT Header type 1**) | |
- Intermediate nodes (INT Transit Hops) must process this type of INT | |
Header. The format of this header is defined in Section | |
[#sec-int-md-metadata-header-format]. | |
* Destination-type (**INT Header type 2**) | |
- Destination headers must only be consumed by the INT Sink. Intermediate | |
nodes must ignore Destination headers. | |
- Destination headers can be used to enable Edge-to-Edge communication between | |
the INT Source and INT Sink. For example: | |
- INT Source can add a sequence number to detect loss of INT packets. | |
- INT Source can add the original values of IP TTL and INT Remaining | |
Hop Count, thus enabling the INT sink to detect network devices | |
on the path that do not support INT by comparing the IP TTL | |
decrement against INT Remaining Hop Count decrement (assuming each | |
network device is an L3 hop) | |
- The format of Destination-type headers will be defined in a future | |
revision. Note some Edge-to-Edge INT use cases can be supported by | |
'source-only' and 'source-inserted' metadata, part of Domain Specific | |
Instructions in the MD-type and MX-type headers. | |
* MX-type (**INT Header type 3**) | |
- Intermediate nodes (INT Transit Hops) must process this type of INT | |
Header and generate reports to the monitoring system as instructed. | |
The format of this header is defined in Section [#int-mx-header]. | |
## Per-Hop Header Operations | |
### INT Source Node | |
In the INT-MD and INT-MX modes, the INT Source node in the packet forwarding | |
path creates the INT-MD or INT-MX Header. | |
In INT-MD, the source node add its own INT metadata after the header. To avoid | |
exhausting header space in the case of a forwarding loop or any other anomalies, | |
it is strongly recommended to limit the number of total INT metadata fields added | |
by Transit Hop nodes by setting the *Remaining Hop Count* field in INT header | |
appropriately. | |
The INT-MD and INT-MX headers are described in detail in the subsequent | |
sections. | |
### INT Transit Hop Node | |
In the INT-MD mode, each node in the packet forwarding path creates additional | |
space in the INT-MD Header on-demand to add its own INT metadata. To avoid | |
exhausting header space in the case of a forwarding loop or any other anomalies, | |
each INT Transit Hop must decrement the *Remaining Hop Count* field in the INT | |
header appropriately. | |
In the INT-MX mode, each node in the packet forwarding path follows the | |
intructions in the INT-MX Header, gathers the device specific metadata and | |
exports the device metadata using the Telemetry Report. | |
INT Transit Hop nodes may update the *DS Flags* field in the INT-MD or INT-MX | |
header. The *Hop ML*, *Instruction Bitmap*, *Domain Specific ID* and | |
*DS Instruction* fields must not be modified by Transit Hop nodes. | |
### INT Sink Node | |
In INT-MD mode, the INT Sink node removes the INT Headers and Metadata stack from | |
the packet, and decides whether to report the collected information. | |
In INT-MX mode, the INT Sink node removes the INT-MX header, gathers the | |
device specific metadata and decides whether to report that metadata. | |
## MTU Settings | |
In both INT-MX and INT-MD modes, it is possible that insertion of the INT header | |
at the INT Source node may cause the egress link MTU to be exceeded. | |
In INT-MD mode, as each hop creates additional space in the INT header | |
to add its metadata, the packet size increases. This can potentially | |
cause egress link MTU to be exceeded at an INT node. | |
This may be addressed in the following ways - | |
* It is recommended that the MTU of links between INT sources and sinks be | |
configured to a value higher than the MTU of preceding links (server/VM NIC | |
MTUs) by an appropriate amount. Configuring an MTU differential of | |
[Per-hop Metadata Length\*4\*INT Hop Count + Fixed INT Header Length] bytes | |
(just [Fixed INT Header Length] for INT-MX mode), | |
based on conservative values of total number of INT hops and Per-hop | |
Metadata Length, will prevent egress MTU being exceeded due to INT metadata | |
insertion at INT hops. The Fixed INT Header Length is the sum of INT metadata | |
header length (12B) and the size of encapsulation-specific shim/option header | |
(4B) as defined in Section [#sec-header-location]. | |
* An INT source/transit node may optionally participate in dynamic discovery | |
of Path MTU for flows being monitored by INT by transmitting ICMP message to | |
the traffic source as per Path MTU Discovery mechanisms of the corresponding | |
L3 protocol (RFC 1191 for IPv4, RFC 1981 for IPv6). An INT source or transit | |
node may report a conservative MTU in the ICMP message, assuming that the | |
packet will go through the maximum number of allowed INT hops (i.e. *Remaining | |
Hop Count* will decrement to zero), accounting for cumulative metadata | |
insertion at all INT hops, and assuming that the egress MTU at all downstream | |
INT hops is the same as its own egress link MTU. This will help the path | |
MTU discovery source to converge to a path MTU estimate | |
faster, although this would be a conservative path MTU estimate. | |
Alternatively, each INT hop may report an MTU only accounting for the metadata | |
it inserts. This would enable the path MTU discovery source converge to a | |
precise path MTU, at the cost of receiving more ICMP messages, one from each | |
INT hop. | |
Regardless of whether or not an INT transit node participates in Path MTU | |
discovery, if it cannot insert all requested metadata because doing so will | |
cause the packet length to exceed egress link MTU, it must either: | |
- not insert any metadata and set the M bit in the INT header, indicating | |
that egress MTU was exceeded at an INT hop, or | |
- report the metadata stack collected from previous hops (setting the | |
Intermediate Report bit if a Telemetry Report 2.0 [^telem-report] packet is | |
generated) and remove the reported metadata stack from the packet, including | |
the metadata from this transit hop in either the report or embedding in the | |
INT-MD metadata header. | |
An INT source inserts 12 bytes of fixed INT headers, and may also insert | |
Per-hop Metadata Length\*4 bytes of its own metadata. If inserting the | |
fixed headers causes egress link MTU to be exceeded, INT cannot not be | |
initiated for such packets. If an INT source is programmed to insert | |
its own INT metadata, and there is enough room in a packet to insert fixed INT | |
headers, but no additional room for its INT metadata, the source must | |
initiate INT and set the M bit in the INT header. | |
In theory, an INT transit node can perform IPv4 fragmentation to overcome | |
egress MTU limitation when inserting its metadata. However, IPv4 fragmentation | |
can have adverse impact on applications. Moreover, IPv6 packets cannot be | |
fragmented at intermediate hops. Also, fragmenting packets at INT transit hops, | |
with or without copying preceding INT metadata into fragments imposes | |
extra complexity of correlating fragments in the INT monitoring engine. | |
Considering all these factors, this specification requires that an INT | |
node must not fragment packets in order to append INT information to | |
the packet. | |
## Congestion Considerations | |
Use of the INT encapsulation should not increase the impact of congestion on | |
the network. While many transport protocols (e.g. TCP, SCTP, DCCP, QUIC) | |
inherently provide congestion control mechanisms, other transport protocols | |
(e.g. UDP) do not. For the latter case, applications may provide congestion | |
control or limit the traffic volume. | |
It is recommended not to apply INT to application traffic that is known not to | |
be congestion controlled (as described in RFC 8085 [^RFC8085] Section 3.1.11). | |
In order to achieve this, packet filtering mechanisms such as access control | |
lists should be provided, with match criteria including IP protocol and L4 | |
ports. | |
Because INT encapsulation endpoints are located within the same administrative | |
domain, an operator may allow for INT encapsulation of traffic that is known | |
not to be congestion controlled. In this case, the operator should carefully | |
consider the potential impact of congestion, and implement appropriate | |
mechanisms for controlling or mitigating the effects of congestion. This | |
includes capacity planning, traffic engineering, rate limiting, and other | |
mechanisms. | |
[^RFC8085]: UDP Usage Guidelines, [RFC 8085](https://www.rfc-editor.org/info/rfc8085), March 2017. | |
## INT over any encapsulation | |
The specific location for INT Headers is intentionally not specified: | |
an INT Header can be inserted as an option or payload of any encapsulation | |
type. The only requirements are that the encapsulation header provides | |
sufficient space to carry the INT information and that all INT nodes | |
(Sources, transit hops and Sinks) agree on the location of the INT Headers. | |
The following choices are potential encapsulations using common protocol | |
stacks, although a deployment may choose a different encapsulation format | |
if better suited to their needs and environment. | |
* INT over VXLAN (as VXLAN payload, per GPE extension) | |
* INT over Geneve (as Geneve option) | |
* INT over NSH (as NSH payload) | |
* INT over TCP (as payload) | |
* INT over UDP (as payload) | |
* INT over GRE (as a shim between GRE header and encapsulated payload) | |
## Checksum Update | |
As described above in Section [#sec-int-over-any-encapsulation], INT | |
headers and metadata may be carried in an L4 protocol such as TCP or UDP, | |
or in an encapsulation header that includes an L4 header, such as VXLAN. | |
The checksum field in the TCP or UDP L4 header needs | |
to be updated as INT nodes modify the L4 payload via insertion/removal of | |
INT headers and metadata. However, there are certain exceptions. | |
For example, when UDP is transported over IPv4, it is possible to assign a | |
zero checksum, causing the receiver to ignore the value of the checksum field | |
(as defined in RFC 768). For UDP over IPv6, there are specific use cases in | |
which it is possible to assign a zero Checksum (as defined in RFC 6936). | |
INT source, transit and sink nodes must comply with IETF standards | |
for Layer 4 transport protocols with respect to whether or not Layer 4 | |
checksum is to be updated upon modification of Layer 4 payload. | |
For example, if an INT source/transit/sink hop receives | |
UDP traffic with zero L4 checksum, it must not update the L4 checksum | |
in conformance with the behavior defined in relevant IETF standards | |
such as RFC 768 and RFC 6936. | |
When L4 checksum update is required, an INT source/transit node may update | |
the checksum in one of two ways: | |
* Update the L4 Checksum field such that the new value is equal to the checksum | |
of the new packet, after the INT-related updates (header additions/removals, | |
field updates), or | |
* If the INT source indicates that Checksum-neutral updates are allowed by | |
setting an instruction bit corresponding to the Checksum Complement metadata, | |
then the INT source/transit nodes may assign a value to the Checksum | |
Complement metadata which guarantees that the existing L4 Checksum is the | |
correct value of the packet after the INT-related updates. | |
The motivation for the Checksum Complement is that some hardware implementations | |
process data packets in a serial order, which may impose a problem when INT | |
fields and metadata that reside after the L4 Checksum field are inserted or | |
modified. Therefore, the Checksum Complement metadata, if present, is the last | |
metadata field in the stack. | |
Note that when the Checksum Complement metadata is present source/transit | |
nodes may choose to update the L4 Checksum field instead of using the | |
Checksum Complement metadata. In this case the Checksum Complement metadata | |
must be assigned the reserved value 0xFFFFFFFF. A host that verifies the | |
L4 Checksum will be unaffected by whether some or all of the nodes chose | |
not to use the Checksum Complement, since the value of the L4 Checksum | |
should fit the Checksum of the payload in either of the cases. | |
INT sink cannot perform a Checksum-neutral update using Checksum Complement | |
metadata as it removes all INT headers from the packet. Thus, an INT sink | |
when performing a checksum update has to do so by updating the L4 Checksum | |
field. | |
Regardless of whether checksum update is performed via modifying the L4 | |
checksum field or via use of Checksum Complement metadata, performing the | |
update based on an incremental checksum calculation (as is typically done) | |
will ensure that any potential corruption is detected at the point of | |
checksum validation. If full checksum computation is performed at an INT | |
node, it should be preceded by checksum validation so as to not mask out any | |
corruption at preceding hops. | |
## Header Location | |
We describe four encapsulation formats in this specification, covering | |
different deployment scenarios, with and without network virtualization: | |
1. *INT over IPv4/GRE* - INT headers are carried between the GRE header and the | |
encapsulated GRE payload. | |
2. *INT over TCP/UDP* - A shim header is inserted following TCP/UDP | |
header. INT Headers are carried between this shim header and TCP/UDP payload. | |
Since v2.0, the spec also supports an option to insert a new UDP header | |
(followed by INT headers) before the existing L4 header. | |
This approach doesn’t rely on any tunneling/virtualization mechanism and is | |
versatile to apply INT to both native and virtualized traffic. | |
3. *INT over VXLAN* - VXLAN generic protocol extensions [^VXLAN-GPE] are used | |
to carry INT Headers between the VXLAN header and the encapsulated VXLAN | |
payload. | |
4. *INT over Geneve* - Geneve is an extensible tunneling framework, allowing | |
Geneve options to be defined for INT Headers. | |
[^VXLAN-GPE]: Generic Protocol Extension for VXLAN, [draft-ietf-nvo3-vxlan-gpe-09](https://tools.ietf.org/html/draft-ietf-nvo3-vxlan-gpe-09), December 2019. | |
### INT over IPv4/GRE | |
In case the traffic being monitored is not encapsulated by any virtualization | |
header, INT over VXLAN or INT over Geneve is not helpful. Instead, a GRE | |
encapsulation as defined in RFC 2784 [^RFC2784] can be utilized. The INT | |
metadata header and INT metadata follows the GRE header. In an administrative | |
domain where INT is used, insertion of the INT metadata header and metadata in | |
GRE is enabled at the INT source and deletion of INT metadata header | |
and metadata is enabled at the INT sink by means of configuration. | |
There are two scenarios when utilizing GRE encapsulation to support INT: | |
1. If the incoming packet at the source node of the INT domain is GRE | |
encapsulated, then the source node should add the INT Metadata Header | |
and Metadata following the GRE header. The sink node of the INT domain | |
should remove the INT Metadata Header and Metadata stack before forwarding | |
the GRE encapsulated packet to the destination. | |
2. If the incoming packet at the source node of the INT domain is not GRE | |
encapsulated, then the source node should add a GRE encapsulation and insert | |
the INT Metadata Header and Metadata following the GRE header. The sink node | |
of the INT domain should remove the GRE encapsulation along with removing | |
the INT Metadata Header and the Metadata stack before forwarding the packet | |
to the destination. | |
IPv4 GRE Option format for carrying INT Header and | |
Metadata: | |
` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
|C|R|K|S|s|Recur| Flags | Ver | Protocol Type = TBD_INT | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Checksum (optional) | Offset (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ G | |
| Key (Optional) | R | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ E | |
| Sequence Number (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Routing (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
| Type |G| Rsvd| Length | Next Protocol | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ I | |
| | N | |
| Variable Option Data (INT Metadata Headers and Metadata) | T | |
| | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
` | |
The GRE header and fields are defined in RFC 2784 [^RFC2784]. The GRE Protocol | |
Type value is TBD_INT. | |
The INT Shim header for GRE option is defined as follows: | |
* **Type (4b):** This field indicates the type of INT Header following the shim | |
header. The Type values are defined in Section [#sec-int-header-types]. | |
* **G (1b):** Indicates whether the GRE headers were inserted to transport INT | |
by the INT source. | |
- **0:** Original packet (before insertion of INT headers and metadata) had | |
GRE encapsulation. | |
- **1:** Original packet had no GRE encapsulation, hence the INT source | |
inserted GRE. | |
- This is a hint that helps the INT sink (when it is not the GRE tunnel endpoint) | |
determine whether to remove the GRE headers part of INT decapsulation (if G=1). | |
* **Rsvd (3b):** reserved for future use, set to zero upon transmission and | |
ignored upon reception. | |
* **Length (8b):** This is the total length of INT metadata header, INT stack | |
excluding the shim header in 4-byte words. A non-INT device may read this | |
field and skip over INT headers. | |
* **Next Protocol (16b):** this field contains an EtherType value (defined in | |
the IANA registry [^ETYPES]) | |
indicating the type of the protocol following the INT stack. An implementation | |
receiving a packet containing a type value which is not listed in the registry | |
should discard the packet. | |
[^RFC2784]: Generic Routing Encapsulation (GRE), [RFC 2784](https://www.rfc-editor.org/info/rfc2784), March 2000. | |
[^ETYPES]: [IANA Ethernet Numbers](https://www.iana.org/assignments/ieee-802-numbers/ieee-802-numbers.xhtml). | |
### INT over TCP/UDP | |
In case the traffic being monitored is not encapsulated by any virtualization | |
header, one can also put the INT metadata just after layer 4 headers (TCP/UDP). | |
The scheme assumes that the non-INT devices between the INT source and the | |
INT sink either do not parse beyond layer-4 headers or can skip through the | |
INT stack using the Length field in the INT shim header. If TCP has any options, | |
the INT stack may come before or after the TCP options but the decision must | |
be consistent within an INT domain. | |
Note that INT over UDP can be used even when the packet is encapsulated by VXLAN, | |
Geneve, or GUE (Generic UDP Encapsulation). INT over TCP/UDP also makes it | |
easier to add INT stack into outer, inner, or even both layers. In such cases | |
both INT header stacks carry information for respective layers and need not be | |
considered interfering with each other. | |
A field in Ethernet, IP, or TCP/UDP should indicate if the | |
INT header exists after the TCP/UDP header. We propose three options. | |
1. UDP destination port field: a new UDP port number (INT_TBD) will be assigned | |
by IANA to indicate the existence of INT after UDP. This option supports two | |
cases: | |
- The original packet already has UDP header either as user application | |
protocol or as part of another UDP-based encapsulation such as VXLAN, | |
GENEVE, RoCEv2. INT is inserted after the UDP header with the UDP | |
destination port number changed to INT_TBD. | |
The original destination port number | |
is carried in the shim header for the INT sink to restore, when it | |
removes the INT stack from the packet. | |
- A new UDP header for INT is inserted between IP and the existing L4 header. | |
The protocol field of IP header is set to 17 for UDP and the original | |
IP protocol value is carried in the INT shim header. | |
In the new UDP header, INT_TBD is used as the destination port number. | |
It is recommended that the source port number of the new UDP header be | |
calculated using a hash of fields from the original packet, for example | |
the original outer 5 tuple or the original L4 header fields. | |
This is to enable a level of entropy for ECMP/LAG load balancing logic. | |
It is recommended that the checksum in the new UDP header be set to zero. | |
For IPv6 packets, this falls under the case of tunnel protocols, | |
which are allowed to use zero UDP checksums as specified in RFC 6936. | |
The existing L4 header will typically include a checksum computed | |
using the encapsulating IPv6 header fields, thus offering some protection | |
against IPv6 header corruption. | |
In both cases, traffic with INT headers is likely to be hashed | |
to a different path in the network as the new UDP | |
destination port (INT_TBD) becomes part of the outer 5 tuple used by ECMP. | |
The INT shim header for UDP has a field NPT (Next Protocol Type) that | |
indicates which of the two cases are applied to a given INT packet. In case | |
a new UDP header was inserted, | |
INT sink must copy the original IP protocol number from the shim header | |
to IP header, and strip the newly added UDP header with all INT headers. | |
For the case that original packet already had UDP header, INT sink must | |
restore the original destination port number from the shim header | |
into the UDP header and strip the INT headers. | |
2. IPv4 DSCP or IPv6 Traffic Class field: A value or a bit can be used to | |
indicate the existence of INT after TCP/UDP. When the INT source inserts the | |
INT header into a packet, it sets the reserved value in the field or sets the | |
bit. The INT source may write the original DSCP value in the INT headers so | |
that the INT sink can restore the original value. Restoring the original value | |
is optional. | |
- Allocating a bit, as opposed to a value codepoint, will allow the rest of | |
DSCP field to be used for QoS, hence allowing the coexistence of DSCP-based | |
QoS and INT. If the traffic being monitored is subjected to QoS services | |
such as rate limiting, shaping, or differentiated queueing based on DSCP | |
field, QoS classification in the network must be programmed to | |
ignore the designated bit position to ensure that the INT-enabled traffic | |
receives the same treatment as the original traffic being monitored. | |
- In brownfield scenarios, however, the network operator may not find a bit | |
available to allocate for INT but may still have a fragmented space of 32 | |
unused DSCP values. The operator can allocate an INT-enabled DSCP value | |
for every QoS DSCP value, map the INT-enabled DSCP value to the same | |
QoS behavior as the corresponding QoS DSCP value. This may double the | |
number of QoS rules but will allow the co-existence of DSCP-based QoS and | |
INT even when a single DSCP bit is not available for INT. | |
- Within an INT domain, DSCP values used for INT must exclusively be used | |
for INT. INT transit and sink nodes must not receive non-INT packets | |
marked with DSCP values used for INT. Any time a node forwards a packet | |
into the INT domain and there is no INT header present, it must ensure that | |
the DSCP/Traffic class value is not the same as any of the values used | |
to indicate INT. | |
3. Probe Marker fields: If DSCP field or values cannot be reserved for INT, | |
probe marker option could be used. A specific 64-bit value can be inserted | |
after the TCP/UDP header to indicate the existence of INT after TCP/UDP. | |
These fields must be | |
interpreted as unsigned integer values in network byte order. This approach is | |
a variation of an early IETF draft with existing implementation[^DPP]. | |
[^DPP]: Data-plane probe for in-band telemetry collection, [draft-lapukhov-dataplane-probe-01](https://tools.ietf.org/html/draft-lapukhov-dataplane-probe-01), June 2016. | |
INT probe marker for TCP/UDP: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Probe Marker (1) | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Probe Marker (2) | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
With arbitrary values being inserted after TCP/UDP header as probe markers, | |
the likelihood of conflicting with user traffic in a data center is | |
low, but cannot be completely eliminated. To further reduce the chance of | |
conflict, a deployment could choose to also examine TCP/UDP port numbers | |
to validate INT probe marker. | |
Any of the above options may be used in an INT domain, provided that the INT | |
transit and sink nodes in the INT domain comply with the mechanism chosen | |
at the INT sources, and are able to correctly identify the presence and location | |
of INT headers. The above approaches are not intended to interoperate in a | |
mixed environment, for example it would be incorrect to mark a packet for INT | |
using both DSCP and probe marker, as INT nodes that only understand | |
DSCP marking and do not recognize probe markers may incorrectly interpret the | |
first four bytes of the probe marker as INT shim header. | |
It is strongly recommended that only one option be used within an INT domain. | |
We introduce an INT shim header for TCP/UDP. The INT | |
metadata header and INT metadata stack will be encapsulated between | |
the shim header and the TCP/UDP payload. | |
INT shim header for TCP/UDP: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Type |NPT|R|R| Length | UDP port, IP Proto, or DSCP | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
* **Type (4b):** This field indicates the type of INT Header following the shim | |
header. | |
The Type values are defined in Section [#sec-int-header-types]. | |
* **NPT (Next Protocol Type, 2b):** This field is meaningful only when the UDP destination | |
port number (INT_TBD) is used to indicate the existence of INT. In the other cases, | |
this field must be zero. When UDP destination port is INT_TBD, this field may have | |
one of the two values: | |
- **one (1):** indicates that the original UDP payload follows the INT stack, | |
and the last two bytes of the shim header carry the original UDP destination port. | |
- **two (2):** indicates that another (the original) L4 header follows | |
the INT stack, and the last byte of the shim header carries the IP protocol | |
value for the L4 layer. | |
* **Length (8b):** This is the total length of INT metadata header and INT stack | |
in 4-byte words. The length of the shim header (1 word) is NOT counted | |
since INT version 2.0. | |
A non-INT device may read this field and skip over INT headers. | |
* **UDP port, IP proto, or DSCP (16b):** The contents of this field differ | |
depending on the value of NPT. | |
- **NPT=0:** The first byte and the last two bits of this 16b field are reserved, | |
set to zero upon transmission and ignored upon reception. The first 6 bits of | |
the second byte may optionally carry the original DSCP value. | |
- **NPT=1:** The original UDP destination port value. | |
- **NPT=2:** The first byte is reserved, set to zero upon transmission and ignored | |
upon reception. The second byte carries the original IP protocol value. | |
The other bits in the shim header are reserved (R) for future use, | |
set to zero upon transmission and ignored upon reception. | |
### INT over VXLAN GPE | |
VXLAN is a common tunneling protocol for network virtualization and is supported | |
by most software virtual switches and hardware network elements. The VXLAN | |
header as defined in RFC 7348 is a fixed 8-byte header as shown below. | |
VXLAN Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|R|R|Ver|I|P|B|O| Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| VXLAN Network Identifier (VNI) | Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
The amount of free space in the VXLAN header allows for carrying minimal network | |
state information. Hence, we embed INT metadata in a shim header between the | |
VXLAN header and the encapsulated payload. | |
The VXLAN header as defined in RFC 7348 does not specify the protocol being | |
encapsulated and assumes that the payload following the VXLAN header is an | |
Ethernet payload. Internet draft draft-ietf-nvo3-vxlan-gpe [^VXLAN-GPE] proposes | |
changes to the VXLAN header to allow for multi-protocol encapsulation. We use | |
this VXLAN generic protocol extension draft and propose a new | |
“Next Protocol” field value for INT. | |
VXLAN GPE Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|R|R|Ver|I|P|B|O| Reserved | Next Protocol | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| VXLAN Network Identifier (VNI) | Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
* **P bit:** Flag bit 5 is defined as the Next Protocol bit. The P bit MUST be | |
set to 1 to indicate the presence of the 8-bit next protocol field. | |
* **Next Protocol Values:** | |
* 0x01: IPv4 | |
* 0x02: IPv6 | |
* 0x03: Ethernet | |
* 0x04: Network Service Header (NSH) | |
* 0x05 to 0x7F: Unassigned | |
* 0x80 to 0xFF: Unassigned (shim headers) | |
* **0x82:** In-band Network Telemetry Header (This value has not been reserved | |
by VXLAN GPE specification yet, and is hence subject to change) | |
When there is one INT Header in the VXLAN GPE stack, the VXLAN GPE header for | |
the INT Header will have a next protocol value other than INT Header indicating | |
the payload following the INT Header - typically Ethernet. If there are multiple | |
INT Headers in the VXLAN GPE stack (for example if both MD and | |
destination type INT headers are being carried), then all VXLAN GPE shim | |
headers for the INT Headers other than the last one will carry 0x82 for | |
their next protocol values, and the VXLAN GPE header for the last INT Header | |
will carry next protocol value of the original VXLAN payload (e.g., Ethernet). | |
To embed a variable-length data (i.e., INT metadata) in the VXLAN GPE stack, we | |
introduce the INT shim header. This header follows each VXLAN GPE header for | |
INT. | |
INT shim header for VXLAN GPE encapsulation: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Type | Rsvd | Length |G| Reserved | Next Protocol | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Variable Option Data (INT Metadata Headers and Metadata) | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
* **Type (4b):** This field indicates the type of INT Header following the shim | |
header. | |
The Type values are defined in Section [#sec-int-header-types]. | |
* **Rsvd (4b):** These 4 bits must be set to zero in order to allign with the | |
shim header format recommended by Internet draft draft-ietf-nvo3-vxlan-gpe, | |
which allocates 8 bits for the Type field. | |
* **Length:** This is the total length of the variable INT option data, | |
not including the shim header, in 4-byte words. | |
* **G:** Indicates whether the original packet (before insertion of INT headers | |
and metadata) used a VXLAN or VXLAN GPE encapsulation. | |
- **0:** Original packet used VXLAN GPE encapsulation. | |
- **1:** Original packet used VXLAN encapsulation. | |
- This may be used as a hint that helps the INT sink (when it is not the VTEP) | |
determine whether to progress the packet using a VXLAN GPE encapsulation, | |
or whether to convert the VXLAN GPE encapsulation back to a VXLAN (without | |
GPE) encapsulation. | |
### INT over Geneve | |
Geneve is a generic and extensible tunneling framework, allowing for INT | |
metadata to be carried in TLV format as “Option headers” in the tunnel header. | |
Geneve Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Ver| Opt Len |O|C| Rsvd. | Protocol Type | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Virtual Network Identifier (VNI) | Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Variable Length Options | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
Geneve Option for INT: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Option Class | Type |R|R|R| Length | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Variable Option Data (INT Metadata Headers and Metadata) | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
Note: | |
* We do not need to reserve any special values for fields in the base | |
Geneve header for INT. | |
* Users may or may not use INT with Geneve along with VNI (network | |
virtualization), though using INT with Geneve without network virtualization | |
would be a bit wasteful. | |
* The Geneve **Option Class** codepoint 0x0103 has been tentatively assigned for | |
INT [^IANA-Geneve]. | |
* The Geneve Option **Type** field indicates the type of INT Header in the | |
Geneve Option. The Type values are defined in Section [#sec-int-header-types]. | |
* The variable length option data following the Geneve Option Header carries | |
the actual INT metadata header and metadata. | |
* The Length field of the Geneve Option header is 5-bits long, which | |
limits a single Geneve option instance to no more than 124 bytes long (31 * 4). | |
*Remaining Hop Count* in INT-MD type header has to be set accordingly at the | |
INT source to ensure that the Geneve option does not overflow. The entire | |
INT-MD header must fit in a single Geneve option. | |
[^IANA-Geneve]: [IANA Network Virtualization Overlay (NVO3), Geneve Option Class](https://www.iana.org/assignments/nvo3/nvo3.xhtml) | |
## INT-MD Metadata Header Format | |
In this section, we define the format of the INT-MD metadata header, | |
and the metadata itself. | |
INT-MD Metadata Header and Metadata Stack: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Ver = 2|D|E|M| Reserved | Hop ML |RemainingHopCnt| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Instruction Bitmap | Domain Specific ID | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| DS Instruction | DS Flags | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| INT Metadata Stack (Each hop inserts Hop ML * 4B of metadata) | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| . . . | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Last INT metadata | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
* INT-MD metadata header is 12 bytes long followed by a stack of INT metadata. | |
Each metadata is either 4 bytes or 8 bytes in length. Each INT hop adds | |
the same length of metadata, except for the source node if there is any | |
'source-only' metadata. The total length of the metadata stack is | |
variable as different packets may traverse different paths and hence | |
different number of INT hops. | |
* **Ver (4b):** INT metadata header version. Should be 2 for this version. | |
* **D (1b):** Discard. | |
- INT Sink must Discard the packet after Extracting INT-MD metadata. | |
* **E (1b):** Max Hop Count exceeded. | |
- This flag must be set if a node cannot prepend its own metadata due to | |
the *Remaining Hop Count* reaching zero. | |
- E bit must be set to 0 by INT source | |
* **M (1b):** MTU exceeded | |
- This flag must be set if a node cannot add all of the requested metadata | |
because doing so will cause the packet length to exceed egress link MTU. | |
In this case, the node must not add any metadata to the packet, and set | |
the M bit in the INT header. Note that it is possible for egress MTU | |
limitation to prevent INT metadata insertion at multiple hops along a | |
path. The M bit simply serves as an indication that INT metadata was not | |
inserted at one or more hops and corrective action such as reconfiguring | |
MTU at some links may be needed, particularly when INT nodes are not | |
participating in path MTU discovery. The M bit is not aimed at readily | |
identifying which node(s) did not insert INT metadata due to egress MTU | |
limitation. In theory, if this does not occur at consecutive hops, | |
it may be possible for the monitoring system to derive which | |
node(s) set the M bit based on knowledge of the network topology | |
and "Node ID, Ingress interface ID, Egress interface ID" tuples in the INT | |
metadata stack. | |
* **R (12b):** Reserved bits, should be set to 0 by the INT source and ignored | |
by other nodes. | |
* **Hop ML (5b):** Per-hop Metadata Length. This is the length of metadata | |
including the Domain Specific Metadata in 4-Byte words to be inserted at | |
each INT transit hop. | |
- *Hop ML* is set by the INT source for transit and sink hops to abide by. | |
If an INT domain uses 'source-only' Domain Specific Metadata, | |
defined below, the length of the source-only Domain Specific Metadata | |
is *excluded* from the *Hop ML*. | |
- The largest value of *Hop ML* for baseline and domain specific | |
metadata is 31. | |
* **Remaining Hop Count (8b):** The remaining number of hops that are allowed to | |
add their metadata to the packet. | |
- Upon creation of an INT metadata header, the INT Source must set this | |
value to the maximum number of hops that are allowed to add metadata | |
instance(s) to the packet. Each INT node on the path, including | |
the INT Source as well as INT Transit Hops, must decrement the | |
*Remaining Hop Count* if and when it pushes its local metadata onto the | |
stack. | |
- When a packet is received with the *Remaining Hop Count* equal to 0, the | |
node must ignore the INT instructions in the *Instruction Bitmap* and | |
*DS Instruction*, pushing no new metadata onto the stack, and the node | |
must set the E bit. | |
* **Instruction Bitmap:** Each bit corresponds to a specific standard metadata | |
as specified in Section [#what-to-monitor]. | |
- bit0 (MSB): Node ID | |
- bit1: Level 1 Ingress Interface ID (16 bits) + Egress Interface ID (16 bits) | |
- bit2: Hop latency | |
- bit3: Queue ID (8 bits) + Queue occupancy (24 bits) | |
- bit4: Ingress timestamp (8 bytes) | |
- bit5: Egress timestamp (8 bytes) | |
- bit6: Level 2 Ingress Interface ID + Egress Interface ID (4 bytes each) | |
- bit7: Egress interface Tx utilization | |
- bit8: Buffer ID (8 bits) + Buffer occupancy (24 bits) | |
- bit15: Checksum Complement | |
- The remaining bits are reserved. | |
The semantics of Queue occupancy and Buffer occupancy is the default semantics of | |
those two metadata. Additional semantics as needed for different implementation | |
can be defined in the metadata semantics YANG model [^metadata-yang]. | |
Bits 0 - 14 are Baseline INT Instructions. Each instruction bit that is set | |
requests 4 bytes of metadata to be inserted at each hop, except for bits 4-6, | |
each requires 8 bytes of metadata. Per-hop metadata length (*Hop ML*) is set | |
accordingly at the INT source. | |
* **Domain Specific ID (16b):** The unique ID of the INT Domain. | |
If the *Domain Specific ID* matches any Domain ID known to this node, then | |
additional processing of the Domain Specific Flags (*DS Flags*) and Domain | |
Specific Instruction (*DS Instruction*) is required. | |
The *Domain Specific ID* value 0x0000 is the default, known to all INT nodes. | |
For this value, all *DS Instruction* bits are treated as reserved. Operators | |
can assign values in the range 0x0001 to 0xFFFF. | |
* **DS Instruction (16b):** Instruction bitmap specific to the INT domain | |
identified by the *Domain Specific ID*. Each bit that is set requests that | |
Domain Specific Metadata be appended to the Baseline Metadata before the | |
Checksum Complement is inserted. | |
Some instruction bits can be defined as 'source-only' metadata by the INT domain. Those metadata | |
will be inserted only by the INT source, not by INT transit or sink nodes. | |
In a sense, 'source-only' bits do not serve as instructions for downstream INT nodes | |
to follow. The INT source sets the bits to indicate which source-only DS metadata it's adding | |
such that the monitoring system (or any consumer of the metadata) knows how to parse and use | |
the data. | |
The amount of Domain Specific Metadata added by each hop must be a multiple of 4 bytes, | |
determined from the *DS Instruction*. In case of INT transit, the amount | |
must be consistent with the per-hop metadata length (*Hop ML*) set by the INT source. | |
The amount of Domain Specific Metadata added by the INT source can be larger than | |
the amount added by a transit hop and | |
the delta must match the total size of 'source-only' Domain Specific Metadata. | |
Although the delta is excluded in *Hop ML*, it must be counted in the INT length field of | |
the INT shim header. | |
* Each INT Transit node along the path that supports INT, and the INT Source | |
node as well, adds its own metadata values as specified in the *Instruction | |
Bitmap* and *DS Instruction*, immediately after the INT metadata header. | |
- When adding new metadata, each node must prepend its metadata in | |
front of the metadata that are already added by the upstream nodes. | |
This is similar to the push operation on a stack data structure. | |
Hence, the most recently | |
added metadata appears at the top of the stack. The node must add | |
metadata in the order of bits set in the *Instruction Bitmap* and *DS | |
Instruction*, except that the Checksum Complement is last. | |
- If a node is unable to provide a metadata value specified in the | |
instruction bitmap because its value is not available, it must add a | |
special all-ones reserved value indicating "invalid" (4 or 8 bytes of 0xFF | |
depending on metadata length). | |
- Reserved bits in the *Instruction Bitmap* are to be handled similarly. If an | |
INT transit hop receives a reserved bit set in the *Instruction Bitmap* (e.g. | |
set by a INT source that is running a newer version), the transit hop must | |
either add corresponding metadata filled with the reserved value 0xFFFFFFFF | |
or must not add any INT metadata to the packet. This means that an | |
instruction bit marked reserved in this specification may be | |
used for a 4B metadata in a subsequent minor version while still being | |
backward compatible with this specification. However, an instruction bit | |
marked reserved in this specification may be used for a 8B metadata only | |
in the next major version, breaking backward compatibility and requiring all | |
INT nodes to be upgraded to the new major version. For example | |
a version 2.0 INT node cannot operate alongside version 3.0 INT nodes | |
if a new 8B metadata is introduced in version 3.0, as the version 2.0 | |
INT node could insert 0xFFFFFFFF reserved value for a 8B metadata field. | |
- If the *Domain Specific ID* does not match any Domain ID known to this node, | |
then the node is required to either: | |
- Pad the node's INT Metadata stack with the special all-ones reserved value | |
for a Domain Specific Metadata length, calculated by subtracting from the | |
*Hop ML* a length computed from all bits in the 16-bit *Instruction | |
Bitmap*, or | |
- Skip INT processing altogether and not insert any metadata into the packet. | |
- An INT-capable node may be limited in the maximum number | |
of instructions it can process and/or maximum length of metadata it can | |
insert in data packets. An INT hop that cannot process all instructions | |
must still insert *Hop ML* \* 4 bytes, with all-ones | |
reserved value (4 or 8 bytes of 0xFF depending on the length of metadata) | |
for the metadata corresponding to instructions it cannot process. An | |
INT hop that cannot insert Per-hop Metadata Length \* 4 bytes must skip | |
INT processing altogether and not insert any metadata in the packet. This | |
ensures that each INT node adds either zero bytes or | |
*Hop ML* \* 4 bytes to the packet. | |
- If an INT hop does not add metadata to a packet due to any of the | |
above reasons, it must not decrement the *Remaining Hop Count* in the INT | |
metadata header. | |
* The INT Sink node has the option to add its local telemetry metadata in either | |
of the following ways, with differing implementation dependent impact: | |
1. The INT Sink's local telemetry metadata may be added to the INT-MD metadata | |
stack by following the same procedures described just above for INT | |
Transit nodes. The metadata stack with the INT Sink's local telemetry | |
metadata is included in the Telemetry Report, typically in a truncated | |
packet fragment. | |
2. The INT Sink's local telemetry metadata may be added to the Telemetry | |
Report's *Variable Optional Baseline Metadata* and *Variable Optional | |
Domain Specific Metadata*, following procedures similar to those described | |
for INT-MX nodes in Section [#int-mx-header], with the following addition | |
and restriction: | |
- *RepMdBits* bit 15 is also cleared since the Checksum Complement is not | |
applicable in the *Individual Report Header*. If the packet is dropped, | |
then *RepMdBits* bit 15 may be set since the bit is repurposed from its | |
usage in the INT-MD metadata header. | |
- The 'source-only' metadata is reported in the metadata stack, typically | |
in a truncated packet fragment, rather than the *Variable Optional Domain | |
Specific Metadata*. | |
The expectation is that each INT node implementation will only support one of | |
these, while monitoring systems should support both. | |
* Summary of the field usage | |
- The INT Source must set the following fields: | |
- *Ver*, *D*, *M*, *Hop ML*, *Remaining Hop Count*, | |
and *Instruction Bitmap*. | |
- INT Source should set all reserved bits to zero. | |
- INT Source may set the Domain-specific fields. | |
- Intermediate transit nodes can set the following fields: | |
- *E*, *M*, *Remaining Hop Count*, and *DS Flags* fields. | |
- Intermediate transit nodes must not modify the *Hop ML*, *Instruction | |
Bitmap*, *Domain Specific ID* and *DS Instruction* in the INT-MD header. | |
* The length (in bytes) of the INT metadata stack must always | |
be a multiple of (*Hop ML* \* 4), plus the size of 'source-only' Domain Specific | |
Metadata if added by the source. The total stack length can be determined | |
by subtracting the total INT fixed header sizes (12 bytes) | |
from (shim header length \* 4). | |
[]{tex-cmd: "\newpage"} | |
## INT-MX Header Format { #int-mx-header } | |
In this section, we define the format of the INT-MX header. | |
INT-MX Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Ver = 2|D| Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Instruction Bitmap | Domain Specific ID | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| DS Instruction | DS Flags | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Optional Domain Specific 'Source-Inserted' Metadata | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
* The INT-MX header is 12 bytes long. Each metadata requested in the | |
INT-MX Header instruction is either 4 bytes or 8 bytes in length. Each INT | |
node in the forwarding path will send the requested metadata to the | |
monitoring system in the Telemetry Report. | |
Details of the metadata semantics and format of the Telemetry Report can | |
be accessed in the Telemetry Report Format Specification [^telem-report]. | |
[^telem-report]: [Telemetry Report Format Specification Version 2.0](https://github.com/p4lang/p4-applications/blob/master/docs/telemetry_report_v2_0.pdf), May 2020. | |
* **Ver (4b):** INT-MX header version. Should be 2 for this version. | |
* **D (1b):** Discard. | |
- INT Sink must Discard the packet after sending the metadata requested | |
in the INT-MX header to the monitoring system. | |
* **R (27b):** Reserved bits. | |
- Should be set to 0 by INT Source and ignored by other nodes. | |
* **Instruction Bitmap:** Each bit corresponds to a specific standard metadata | |
as specified in Section [#what-to-monitor]. | |
- bit0 (MSB): Node ID | |
- bit1: Level 1 Ingress Interface ID (16 bits) + Egress Interface ID (16 bits) | |
- bit2: Hop latency | |
- bit3: Queue ID (8 bits) + Queue occupancy (24 bits) | |
- bit4: Ingress timestamp (8 bytes) | |
- bit5: Egress timestamp (8 bytes) | |
- bit6: Level 2 Ingress Interface ID + Egress Interface ID (4 bytes each) | |
- bit7: Egress interface Tx utilization | |
- bit8: Buffer ID (8 bits) + Buffer occupancy (24 bits) | |
- The remaining bits are reserved. | |
The semantics of Queue occupancy and Buffer occupancy is the default semantics of | |
those two metadata. Additional semantics as needed for different implementation | |
can be defined in the metadata semantics YANG model [^metadata-yang]. | |
Bits 0 - 14 are Baseline INT Instructions that will be reported in *RepMdBits* | |
and *Variable Optional Baseline Metadata* in the Telemetry Report. Each | |
instruction bit that is set requests 4 bytes of metadata to be sent to the | |
monitoring system at each hop, except for bits 4-6, each requires 8 bytes of | |
metadata. | |
* **Domain Specific ID (16b):** The unique ID of the INT Domain. | |
If the *Domain Specific ID* matches any Domain ID known to this node, then | |
additional processing of the Domain Specific Flags (*DS Flags*) and Domain | |
Specific Instruction (*DS Instruction*) is required. | |
The *Domain Specific ID* value 0x0000 is the default, known to all INT nodes. | |
For this value, all *DS Instruction* bits are treated as reserved. Operators | |
can assign values in the range 0x0001 to 0xFFFF. | |
* **DS Instruction (16b):** Instruction bitmap specific to the INT domain | |
identified by the *Domain Specific ID*. When a bit is defined for a particular | |
domain, in addition to the metadata semantics and syntax, the definition must | |
specify the behavior with respect to the following properties: | |
- **DS Instruction Mode:** | |
- **Export**: Each bit that is set requests that each node send this | |
metadata in its telemetry report. This is similar to the bits in the | |
*Instruction Bitmap*, except that the metadata syntax and semantics are | |
defined in a domain specific manner. | |
- **Source-Inserted**: Each bit that is set represents metadata that is | |
inserted by the source node into the *Optional Domain Specific | |
'Source-Inserted' Metadata* in this packet. | |
- **Source-Inserted Metadata Reporting Requirement:** | |
- **All Nodes**: Each node including the source, transit, and sink nodes | |
should report this source-inserted metadata to the monitoring system | |
along with the node's own metadata. One example of a type of metadata | |
that would benefit from *All Nodes* behavior is a sequence number, which | |
can be used by the monitoring system to assist in correlation of | |
multiple telemetry reports for the same flow. Note that there are two | |
ways to include the 'source-inserted' metadata in the Telemetry Report, | |
described below. | |
- **Sink Node**: The sink node should report this source-inserted metadata | |
to the monitoring system along with the node's own metadata. One example | |
of a type of metadata that would benefit from *Sink Node* behavior is a | |
timestamp, which could be used to determine the latency experienced by a | |
packet from the source node to the sink node. | |
- **None**: This source-inserted metadata is meant to be consumed by other | |
nodes, and need not be included in any of the telemetry reports directed | |
to the monitoring system. | |
- **Source-Inserted Metadata Mutability:** | |
- **Source-Only**: The source node inserts the metadata. Transit and sink | |
nodes must not change the value of this source-inserted metadata. | |
- **Cumulative**: Transit and sink nodes may update or replace the value | |
of the source-inserted metadata. | |
The amount of Domain Specific Metadata sent by each hop must be a multiple of | |
4 bytes, determined from the *DS Instruction*, as described below. | |
* **Optional Domain Specific 'Source-Inserted' Metadata:** The metadata | |
corresponding to 'source-inserted' *DS Instruction* bits follows the | |
*DS Instruction* and the *DS Flags* fields in the INT-MX header. The length | |
of this field must be counted in the INT length field of the INT shim header. | |
* Each INT node along the path that supports INT-MX (Source, Transit and Sink | |
nodes) sends its own metadata values, based on the *Instruction Bitmap* and | |
*DS Instruction* in the INT-MX header, as follows: | |
- Copy the *Instruction Bitmap* and *DS Instruction* from the INT-MX header | |
to *RepMdBits* and *DSMdBits*, respectively, in the Telemetry Report | |
[^telem-report]. Then modify *RepMdBits* and *DSMdBits* as described in | |
the following bullets. | |
- If INT-MX *Instruction Bitmap* bit 0 was set, clear *RepMdBits* bit 0 | |
since the *Node ID* is already included in the common *Telemetry Report | |
Group Header* that precedes the individual reports. | |
- If a node is unable to provide a metadata value specified in the | |
instruction bitmap because its value is not available, or because it | |
corresponds to a reserved bit, the node must ensure that the | |
corresponding bit in *RepMdbits* and/or *DSMdBits* is not set in the | |
Telemetry Report sent to the monitoring system. | |
- If the *Domain Specific ID* does not match any Domain ID known to this | |
node, then the node is required to either: | |
- Send the metadata corresponding to *Instruction Bitmap* and ensure that | |
*DSMdBits* is not set in the Telemetry Report, or | |
- Not send any of its own metadata to the monitoring systems. | |
- The INT node should make every effort to include domain specific | |
'source-inserted' metadata (from the INT-MX header) in the Telemetry | |
Report, if that metadata's *Source-Inserted Metadata Reporting | |
Requirement* is *All Nodes*, or *Sink Node* if this is the sink node. | |
Two ways to accomplish this are described, with differing implementation | |
dependent impact: | |
a. The truncated packet in the *Individual Report Inner Contents* | |
includes the INT-MX header with embedded 'source-inserted' metadata, | |
or | |
b. The embedded 'source-inserted' metadata from the INT-MX header is | |
copied into the *Variable Optional Domain Specific Metadata* in the | |
*Individual Report Main Contents*. Note that in order to achieve this, | |
the node must understand the *Domain Specific ID* and corresponding | |
*DSMdBits* definition, so that it can place that metadata in the | |
proper order relative to other domain specific metadata. The | |
corresponding bits in *DSMdBits* must remain set so that the | |
monitoring system (or any consumer of the metadata) knows how to parse | |
and use the data. | |
Although typically only one of these ways would be applied to a given | |
packet at a specific node, the combination of both of the above is | |
allowed. The expectation is that for each role (source, transit, sink), | |
each INT node implementation will rely on one of these to report | |
'source-inserted' metadata, while monitoring systems should support both. | |
If the 'source-inserted' metadata is *not* copied into the *Variable | |
Optional Domain Specific Metadata*, then the corresponding bits in | |
*DSMdBits* must be cleared. | |
Due to backward compatibility implications, domain administrators need | |
to be careful when leaving some bits in *DSMdBits* reserved, with regard | |
to defining any of those bits as 'source-inserted' in the future. | |
- An INT node may be limited in the maximum number of instructions it can | |
process and/or maximum length of metadata it can gather for each data | |
packet. An INT hop that cannot process all instructions must send to the | |
monitoring system the metadata it can process, updating the *RepMdBits* | |
and *DSMdBits* fields appropriately in the Telemetry Report. | |
- The value to be placed in the *MD Length* field must be computed based on | |
the resulting values of *RepMdBits* and *DSMdBits*. | |
- The *Variable Optional Baseline Metadata* and *Variable Optional Domain | |
Specific Metadata* in the Telemetry Report are populated based on the | |
resulting values of *RepMdBits* and *DSMdBits*. | |
* Summary of the field usage | |
- The INT Source must set the following fields: | |
- *Ver*, *D* and *Instruction Bitmap*. | |
- INT Source should set all reserved bits to zero. | |
- INT Source may set the Domain-specific fields. | |
- Intermediate transit nodes may set bits in the *DS Flags* field. | |
Intermediate transit nodes must not modify the *Instruction Bitmap*, | |
*Domain Specific ID* and *DS Instruction* in the INT-MX header. | |
* The length (in bytes) of the INT metadata sent to the monitoring system | |
must always be a multiple of 4B. | |
[]{tex-cmd: "\newpage"} | |
# Examples | |
This section shows example INT Headers with two hosts (Host1 and Host2), | |
communicating over a network path composed of three network switches | |
(Switch1, Switch2 and Switch3) as shown below. | |
``` | |
==> packet P travels from Host1 to Host2 ==> | |
Host1 --------> Switch1 ---------> Switch2 ---------> Switch3 --------> Host2 | |
``` | |
Detailed assumptions made for this example are as follows | |
* INT source requests each INT hop to insert node ID and queue occupancy | |
(For the sake of illustration we only consider node ID and queue occupancy | |
being inserted at each hop. Queue IDs are typically defined per port, hence | |
in a real use-case queue occupancy is likely to be collected along with egress | |
interface ID) | |
* There are three INT nodes (hops) on the path, and all the nodes expose | |
both metadata (node ID and queue occupancy). | |
* The maximum number of hops (network diameter) is 8. | |
* The values of INT metadata header fields in this example are as follows: | |
- *Ver* = 2 | |
- *D* = 0 (Packet is not a clone/copy, hence the Sink must not Discard) | |
- *E* = 0 (Max Hop Count not exceeded) | |
- *M* = 0 (MTU not exceeded at any node) | |
- Per-hop Metadata Length = 2 (for node id & queue occupancy) | |
- *Remaining Hop Count* starts at 8, decremented by 1 at each hop | |
## Example with INT-MD over TCP | |
We consider a scenario where host1 sends a TCP packet to host2. The ToR switch | |
of host1 (Switch1) acts as the INT source. It adds INT-MD headers and its own | |
metadata in the packet. Switch2 prepends its metadata. Finally, the ToR | |
switch of host2 (Switch3) acts as the INT sink and removes INT-MD headers before | |
forwarding the packet to host2. | |
Below is the packet received by INT sink Switch3, starting from the IPv4 | |
header. We use the value of 0x17 for IPv4.DSCP to indicate the existence of | |
INT headers. | |
IP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=4 | IHL=5 | DSCP=0x17 |ECN| Length | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Identification |Flags| Fragment Offset | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Time to Live | Proto = 6 | Header Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Destination Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
TCP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Port | Destination Port | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Sequence Number | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Acknowledgment Number | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Data | |U|A|P|R|S|F| | | |
| Offset| Reserved |R|C|S|S|Y|I| Window | | |
| | |G|K|H|T|N|N| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Checksum | Urgent Pointer | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT Shim Header for TCP/UDP, INT type is INT-MD (1) and | |
NPT (Next Protocol Type) is zero: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Type=1 | 0 |R R| Length=7 | Reserved | DSCP |R R| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT-MD Metadata Header and Metadata Stack, followed by TCP payload: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0|0|0| Reserved | HopML=2 |RemainingHopC=6| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop2 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| queue occupancy of hop2 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop1 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| queue occupancy of hop1 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| TCP payload | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
## Example with INT-MX over TCP | |
We consider a scenario where host1 sends a TCP packet to host2. The ToR switch | |
of host1 (Switch1) acts as the INT source. It adds INT-MX header in the packet. | |
Switch2 processes the INT-MX header. Finally, the ToR switch of host2 (Switch3) | |
acts as the INT sink and removes INT-MX header before forwarding the packet to host2. | |
Below is the packet received by INT sink Switch3, starting from the IPv4 | |
header. We use the value of 0x17 for IPv4.DSCP to indicate the existence of | |
INT headers. | |
IP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=4 | IHL=5 | DSCP=0x17 |ECN| Length | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Identification |Flags| Fragment Offset | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Time to Live | Proto = 6 | Header Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Destination Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
TCP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Port | Destination Port | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Sequence Number | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Acknowledgment Number | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Data | |U|A|P|R|S|F| | | |
| Offset| Reserved |R|C|S|S|Y|I| Window | | |
| | |G|K|H|T|N|N| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Checksum | Urgent Pointer | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT Shim Header for TCP/UDP, INT type is INT-MX (3) and | |
NPT (Next Protocol Type) is zero: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Type=3 | 0 |R R| Length=3 | Reserved | DSCP |R R| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT-MX Header, followed by TCP payload: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0| Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| TCP payload | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
## Example with new UDP header and INT-MD inserted before TCP | |
As before we consider a scenario where host1 sends a TCP packet to host2. | |
The ToR switch | |
of host1 (Switch1) acts as the INT source. It adds a new UDP header, | |
INT-MD headers and its own metadata in the packet. | |
Switch2 prepends its metadata. Finally, the ToR | |
switch of host2 (Switch3) acts as the INT sink and removes the UDP and | |
INT-MD headers before forwarding the packet to host2. | |
Below is the packet received by INT sink Switch3, starting from the IPv4 | |
header. We use INT_TBD for UDP.Destination_Port to indicate the existence of | |
INT headers. | |
IP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=4 | IHL=5 | DSCP |ECN| Length | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Identification |Flags| Fragment Offset | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Time to Live | Proto = 17 | Header Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Destination Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
UDP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Port | Destination Port = INT_TBD | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Length | Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT Shim Header for UDP, INT type is INT-MD (1) and | |
NPT (Next Protocol Type) is 2 indicating another L4 header follows INT. | |
IP proto is 6 to indicate that TCP follows INT: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Type=1 | 2 |R R| Length=7 | Reserved | IP proto = 6 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT-MD Metadata Header and Metadata Stack, followed by TCP header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0|0|0| Reserved | HopML=2 |RemainingHopC=6| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop2 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| queue occupancy of hop2 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop1 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| queue occupancy of hop1 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| TCP header | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
## Example with new UDP header and INT-MX inserted before TCP { #example-mx-udp-tcp } | |
As before we consider a scenario where host1 sends a TCP packet to host2. | |
The ToR switch of host1 (Switch1) acts as the INT source. It adds a new UDP header | |
and a new INT-MX header in the packet. Switch2 processes the INT-MX Header. | |
Finally, the ToR switch of host2 (Switch3) acts as the INT sink and removes | |
the UDP and INT-MX headers before forwarding the packet to host2. | |
Below is the packet received by INT sink Switch3, starting from the IPv4 | |
header. We use INT_TBD for UDP.Destination_Port to indicate the existence of | |
INT headers. | |
[]{tex-cmd: "\newpage"} | |
IP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=4 | IHL=5 | DSCP |ECN| Length | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Identification |Flags| Fragment Offset | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Time to Live | Proto = 17 | Header Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Destination Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
UDP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Port | Destination Port = INT_TBD | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Length | Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT Shim Header for UDP, INT type is INT-MX (3) and | |
NPT (Next Protocol Type) is 2 indicating another L4 header follows INT. | |
IP proto is 6 to indicate that TCP follows INT: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Type=3 | 2 |R R| Length=3 | Reserved | IP proto = 6 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT-MX Header, followed by TCP header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0| Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| TCP header | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
## Example with INT-MD in-between UDP header and UDP payload | |
In this scenario host1 sends a UDP packet to host2. The ToR switch | |
of host1 (Switch1) acts as the INT source. It alters the UDP destination port | |
to INT_TBD, inserts INT-MD headers before the UDP payload. | |
Switch2 prepends its metadata. Finally, the ToR | |
switch of host2 (Switch3) acts as the INT sink and removes the | |
INT-MD headers and restores the original UDP destination port | |
before forwarding the packet to host2. | |
Below is the packet received by INT sink Switch3, starting from the IPv4 | |
header. We use INT_TBD for UDP.Destination_Port to indicate the existence of | |
INT headers. | |
IP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=4 | IHL=5 | DSCP |ECN| Length | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Identification |Flags| Fragment Offset | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Time to Live | Proto = 17 | Header Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Destination Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
UDP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Port | Destination Port = INT_TBD | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Length | Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT Shim Header for UDP, INT type is INT-MD (1) and | |
NPT (Next Protocol Type) is 1 indicating UDP payload follows the INT. | |
The original port number XYZ is stored in the shim header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Type=1 | 1 |R R| Length=7 | UDP port = XYZ | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
[]{tex-cmd: "\newpage"} | |
INT-MD Metadata Header and Metadata Stack, followed by UDP payload: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0|0|0| Reserved | HopML=2 |RemainingHopC=6| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop2 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| queue occupancy of hop2 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop1 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| queue occupancy of hop1 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| UDP payload | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
## Example with INT-MX in-between UDP header and UDP payload { #example-mx-udp } | |
In this scenario host1 sends a UDP packet to host2. The ToR switch | |
of host1 (Switch1) acts as the INT source. It alters the UDP destination port | |
to INT_TBD, inserts INT-MX header before the UDP payload. Switch2 processes | |
the INT-MX header. Finally, the ToR switch of host2 (Switch3) acts as the | |
INT sink and removes the INT-MX headers and restores the original UDP | |
destination port before forwarding the packet to host2. | |
Below is the packet received by INT sink Switch3, starting from the IPv4 | |
header. We use INT_TBD for UDP.Destination_Port to indicate the existence of | |
INT headers. | |
IP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=4 | IHL=5 | DSCP |ECN| Length | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Identification |Flags| Fragment Offset | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Time to Live | Proto = 17 | Header Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Destination Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
[]{tex-cmd: "\newpage"} | |
UDP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Port | Destination Port = INT_TBD | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Length | Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT Shim Header for UDP, INT type is INT-MX (3) and | |
NPT (Next Protocol Type) is 1 indicating UDP payload follows the INT. | |
The original port number XYZ is stored in the shim header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Type=3 | 1 |R R| Length=3 | UDP port = XYZ | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT-MX Header, followed by UDP payload: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0| Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| | |
| UDP payload | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
## Example with new IP and UDP headers and INT-MX inserted before IPSec | |
We consider a scenario where host1 sends an IPSec transport mode packet to | |
host2. | |
The ToR switch of host1 (Switch1) acts as the INT source. It adds new IP and | |
UDP headers and a new INT-MX header in the packet. Switch2 processes the | |
INT-MX Header. | |
Finally, the ToR switch of host2 (Switch3) acts as the INT sink and removes | |
the UDP and INT-MX headers before forwarding the packet to host2. | |
Below is the packet received by INT sink Switch3, starting from the outer IPv4 | |
header. We use INT_TBD for UDP.Destination_Port to indicate the existence of | |
INT headers. | |
[]{tex-cmd: "\newpage"} | |
IP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=4 | IHL=5 | DSCP |ECN| Length | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Identification |Flags| Fragment Offset | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Time to Live | Proto = 17 | Header Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Destination Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
UDP Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Port | Destination Port = INT_TBD | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Length | Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT Shim Header for UDP, INT type is INT-MX (3) and | |
NPT (Next Protocol Type) is 2 indicating an IP protocol value specifies | |
the header that follows INT. | |
IP proto is 4 to indicate that an inner IPv4 header follows INT: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Type=3 | 2 |R R| Length=3 | Reserved | IP proto = 4 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT-MX Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0| Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
[]{tex-cmd: "\newpage"} | |
Inner IP Header followed by AH Header and IP Payload: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=4 | IHL=5 | DSCP |ECN| Length | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Identification |Flags| Fragment Offset | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Time to Live | Proto = 51 | Header Checksum | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Source Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Destination Address | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| AH Header | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| IP Payload | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
## Example with INT-MD over IPv4/GRE (Original packet IPv4) | |
In this scenario host1 sends an IPv4 packet to host2. | |
The ToR switch of host1 (Switch1) acts as the INT source. It does IPv4/GRE | |
encapsulation and inserts INT-MD headers before the inner (original) IPv4. | |
Switch2 prepends its metadata. Finally, the ToR switch of host2 (Switch3) | |
acts as the INT sink and removes the INT-MD headers and decapsulates outer | |
IPv4/GRE before forwarding the packet to host2. | |
Below is the packet received by INT sink Switch3, starting from the | |
outer IPv4 header. The G bit of INT shim is set to 1 to indicate that | |
GRE was inserted by the INT source. | |
[]{tex-cmd: "\newpage"} | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
| Ver=4 | IHL=5 | DSCP |ECN| Length | O | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ U | |
| Identification |Flags| Fragment Offset | T | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ E | |
| Time to Live | Proto = 0x2F | Header Checksum | R | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| (Outer) Source Address | I | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ P | |
| (Outer) Destination Address | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
|C|R|K|S|s|Recur| Flags | Ver | Protocol Type = TBD_INT | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Checksum (optional) | Offset (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ G | |
| Key (Optional) | R | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ E | |
| Sequence Number (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Routing (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
|Type=1 |1| Rsvd| Length=7 | Protocol = 0x0800 | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Ver=2 |0|0|0| Reserved | HopML=2 |RemainingHopC=6| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| I | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ N | |
| node id of hop2 | T | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| queue occupancy of hop2 | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| node id of hop1 | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| queue occupancy of hop1 | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
| Inner IP Header + Payload + Pad (L3/ESP...) | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
## Example with INT-MX over IPv4/GRE (Original packet IPv4) | |
In this scenario host1 sends an IPv4 packet to host2. | |
The ToR switch of host1 (Switch1) acts as the INT source. It does IPv4/GRE | |
encapsulation and inserts INT-MX header before the inner (original) IPv4. | |
Switch2 processes the INT-MX header. Finally, the ToR switch of host2 | |
(Switch3) acts as the INT sink and removes the INT-MX header and decapsulates | |
outer IPv4/GRE before forwarding the packet to host2. | |
Below is the packet received by INT sink Switch3, starting from the | |
outer IPv4 header. The G bit of INT shim is set to 1 to indicate that | |
GRE was inserted by the INT source. | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
| Ver=4 | IHL=5 | DSCP |ECN| Length | O | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ U | |
| Identification |Flags| Fragment Offset | T | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ E | |
| Time to Live | Proto = 0x2F | Header Checksum | R | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| (Outer) Source Address | I | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ P | |
| (Outer) Destination Address | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
|C|R|K|S|s|Recur| Flags | Ver | Protocol Type = TBD_INT | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Checksum (optional) | Offset (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ G | |
| Key (Optional) | R | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ E | |
| Sequence Number (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Routing (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
|Type=3 |1| Rsvd| Length=3 | Protocol = 0x0800 | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Ver=2 |0| Reserved | I | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ N | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| T | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
| Inner IP Header + Payload + Pad (L3/ESP...) | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
## Example with INT-MD over IPv4/GRE (Original packet CE or IP) | |
In this scenario host1 sends an IPv4 packet to host2. | |
The ToR switch of host1 (Switch1) acts as the INT source. It adds an Ethernet | |
(L2) header, does IPv4/GRE encapsulation and inserts INT-MD headers before the | |
inner (original) packet that starts with an Ethernet header. Switch2 prepends | |
its metadata. Finally, the ToR switch of host2 (Switch3) acts as the INT sink | |
and removes the INT-MD headers and decapsulates outer L2 header, outer IPv4/GRE | |
before forwarding the packet to host2. | |
[]{tex-cmd: "\newpage"} | |
Below is the packet received by INT sink Switch3, starting from the | |
outer Ethernet header. The G bit of INT shim is set to 1 to indicate that | |
GRE was inserted by the INT source. | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
| Ethernet Header (L2) | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
| Ver=4 | IHL=5 | DSCP |ECN| Length | O | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ U | |
| Identification |Flags| Fragment Offset | T | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ E | |
| Time to Live | Proto = 0x2F | Header Checksum | R | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| (Outer) Source Address | I | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ P | |
| (Outer) Destination Address | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
|C|R|K|S|s|Recur| Flags | Ver | Protocol Type = TBD_INT | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Checksum (optional) | Offset (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ G | |
| Key (Optional) | R | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ E | |
| Sequence Number (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Routing (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
|Type=1 |1| Rsvd| Length=7 | Protocol = 0x6558 | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Ver=2 |0|0|0| Reserved | HopML=2 |RemainingHopC=6| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| I | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ N | |
| node id of hop2 | T | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| queue occupancy of hop2 | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| node id of hop1 | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| queue occupancy of hop1 | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
| Payload (Original Packet starting with an Ethernet header) | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
[]{tex-cmd: "\newpage"} | |
## Example with INT-MX over IPv4/GRE (Original packet CE or IP) | |
In this scenario host1 sends an IPv4 packet to host2. | |
The ToR switch of host1 (Switch1) acts as the INT source. It adds an Ethernet | |
(L2) header, does IPv4/GRE encapsulation and inserts INT-MX header before the | |
inner (original) packet that starts with an Ethernet header. Switch2 processes | |
the INT-MX header. Finally, the ToR switch of host2 (Switch3) acts as the INT | |
sink and removes the INT-MX header and decapsulates outer L2 header, outer | |
IPv4/GRE before forwarding the packet to host2. | |
Below is the packet received by INT sink Switch3, starting from the | |
outer Ethernet header. The G bit of INT shim is set to 1 to indicate that | |
GRE was inserted by the INT source. | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
| Ethernet Header (L2) | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
| Ver=4 | IHL=5 | DSCP |ECN| Length | O | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ U | |
| Identification |Flags| Fragment Offset | T | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ E | |
| Time to Live | Proto = 0x2F | Header Checksum | R | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| (Outer) Source Address | I | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ P | |
| (Outer) Destination Address | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
|C|R|K|S|s|Recur| Flags | Ver | Protocol Type = TBD_INT | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Checksum (optional) | Offset (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ G | |
| Key (Optional) | R | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ E | |
| Sequence Number (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Routing (Optional) | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
|Type=3 |1| Rsvd| Length=3 | Protocol = 0x6558 | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
| Ver=2 |0| Reserved | I | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ N | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| T | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+ | |
| Payload (Original Packet starting with an Ethernet header) | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
[]{tex-cmd: "\newpage"} | |
## Example with INT-MD over VXLAN GPE | |
We now consider a scenario where Host1 and Host2 use VXLAN encapsulation. | |
Host1 acts as VXLAN tunnel endpoint and INT source, inserts VXLAN and INT-MD | |
headers with instruction bits corresponding to the network state to be | |
reported at intermediate switches. In this example, Host1 itself does | |
not insert any INT metadata. Intermediate switches parse through VXLAN | |
header and populate the INT metadata. Host2 acts as INT sink and VXLAN | |
tunnel endpoint, removes INT-MD and VXLAN headers. | |
The packet headers received at Host 2 are as follows, starting with the VXLAN | |
GPE header (encapsulating ethernet, IP and UDP headers are not shown here): | |
VXLAN GPE Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|R|R|Ver|1|1|0|0| Reserved | NextProto=0x82| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| VXLAN Network Identifier (VNI) | Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT Shim Header for VXLAN-GPE, INT-MD type: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Type=1 |Rsvd=0 | Length=9 |0| Reserved | NextProto=0x3 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT-MD Metadata Header and Metadata Stack: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0|0|0| Reserved | HopML=2 |RemainingHopC=5| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop3 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| queue occupancy of hop3 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop2 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| queue occupancy of hop2 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop1 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| queue occupancy of hop1 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
## Example with INT-MX over VXLAN GPE | |
We now consider a scenario where Host1 and Host2 use VXLAN encapsulation. | |
Host1 acts as VXLAN tunnel endpoint and INT source, inserts VXLAN and INT-MX | |
header with instruction bits corresponding to the network state to be | |
reported at intermediate switches. In this example, Host1 itself does | |
not send any metadata to the monitoring system. Intermediate switches parse | |
through VXLAN header, process the INT-MX header and send the INT | |
metadata requested to the monitoring system. Host2 acts as INT sink and | |
VXLAN tunnel endpoint, removes INT-MX and VXLAN headers. | |
The packet headers received at Host 2 are as follows, starting with the VXLAN | |
GPE header (encapsulating ethernet, IP and UDP headers are not shown here): | |
VXLAN GPE Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|R|R|Ver|1|1|0|0| Reserved | NextProto=0x82| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| VXLAN Network Identifier (VNI) | Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT Shim Header for VXLAN-GPE, INT-MX type: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Type=3 |Rsvd=0 | Length=3 |0| Reserved | NextProto=0x3 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT-MX Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0| Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
[]{tex-cmd: "\newpage"} | |
## Example with INT-MD over Geneve | |
Finally, we consider a scenario where Host1 and Host2 use Geneve encapsulation. | |
Host1 acts as Geneve tunnel endpoint and INT source, inserts Geneve and INT-MD | |
headers with instruction bits corresponding to the network state to be | |
reported at intermediate switches. In this example, Host1 itself does | |
not insert any INT metadata. Intermediate switches parse through Geneve | |
header and populate the INT metadata. Host2 acts as INT sink and Geneve | |
tunnel endpoint, removes INT-MD and Geneve headers. | |
Following are the Geneve and INT Headers attached to the packet received by | |
Host2. | |
Geneve Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Ver| OptLen=9 |O|C| Rsvd. | Protocol Type=EtherType | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Virtual Network Identifier (VNI) | Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
Geneve Option for INT-MD type: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Option Class=0x0103 | Type=1 |R|R|R| Len=9 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT-MD Metadata Header and Metadata Stack: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0|0|0| Reserved | HopML=2 |RemainingHopC=5| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop3 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| queue occupancy of hop3 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop2 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| queue occupancy of hop2 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop1 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| queue occupancy of hop1 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
## Example with INT-MX over Geneve | |
Finally, we consider a scenario where Host1 and Host2 use Geneve encapsulation. | |
Host1 acts as Geneve tunnel endpoint and INT source, inserts Geneve and INT-MX | |
headers with instruction bits corresponding to the network state to be | |
reported at intermediate switches. In this example, Host1 does not send INT | |
metadata to the monitoring system. Intermediate switches parse through Geneve | |
header, process the INT-MX header and send the INT metadata requested | |
to the monitoring system. Host2 acts as INT sink and Geneve | |
tunnel endpoint, removes INT-MX and Geneve headers. | |
Following are the Geneve and INT Headers attached to the packet received by | |
Host2. | |
Geneve Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Ver| OptLen=9 |O|C| Rsvd. | Protocol Type=EtherType | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Virtual Network Identifier (VNI) | Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
Geneve Option for INT-MX type: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Option Class=0x0103 | Type=3 |R|R|R| Len=3 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT-MX Header: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0| Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
[]{tex-cmd: "\newpage"} | |
## Example with INT-MX including domain specific source-inserted metadata | |
We consider a scenario similar to Section [#example-mx-udp-tcp], with the addition | |
of 'source-inserted' metadata that consists of a Sequence Number with current | |
value 15 and a Flow ID with current value 0x12345678. The *Domain Specific ID* | |
value is 0xabcd. | |
INT Shim Header for UDP, INT type is INT-MX (3) and | |
NPT (Next Protocol Type) is 2 indicating another L4 header follows INT. | |
IP proto is 6 to indicate that TCP follows INT: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|Type=3 | 2 |R R| Length=5 | Reserved | IP proto = 6 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
INT-MX Header, followed by TCP header and payload: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0| Reserved | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0|1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| | |
| Sequence Number (assigned by source node) | | |
|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| | |
| Flow ID (assigned by source node) | | |
|0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| | |
| TCP header | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
[]{tex-cmd: "\newpage"} | |
## Example with INT-MD including domain specific source-only metadata | |
We consider a scenario where Host1 is on a wireless network behind a NAT | |
and its identity can not be confirmed by port location. Host1 does not support | |
INT. The gateway acts at the INT source, inserting the header and the MAC address | |
of Host1. Intermediate switches populate the INT metadata while preserving | |
the source-only domain specific metadata. This example can work with any of | |
the above encapsulation methods such as UDP encapsulation. | |
The MAC address from host1 is copied from the source address in the Ethernet frame. | |
The MAC is a 6 byte id. Since the INT metadata header is measured in blocks of 4 bytes, | |
the last 2 bytes are reserved. For example if a source device's MAC address is | |
"a6:1a:f6:b1:64:7d", the source-only INT metadata would be 0xA61AF6B1647d0000. | |
Following is the INT-MD Header attached to the packet transmitted by | |
the hop2 switch. | |
Details of the Transparent Security domain specific model can be accessed | |
in the Transparent Security INT header reference definition | |
[^transparent-security-INT]. | |
[^transparent-security-INT]: Transparent Security INT header reference definition, [https://github.com/cablelabs/transparent-security/blob/master/docs/int_header/INT_header.md](https://github.com/cablelabs/transparent-security/blob/master/docs/int_header/INT_header.md) | |
INT Metadata Header and Metadata Stack. The *Domain | |
Specific ID* is 0x5453 ('TS' in ascii), bit 0 is set in the domain bitmap to | |
indicate the 16 bit source-only device source and bit 1 is set in the *DS Flags* | |
to indicate that this was set by the gateway: | |
``` | |
0 1 2 3 | |
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Ver=2 |0|0|0| Reserved | HopML=1 |RemainingHopC=5| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 1 0 1 0 1 0 0 0 1 0 1 0 0 1 1| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop2 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of hop1 | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| node id of gateway | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| First 4 bytes of host1 MAC | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| Last 2 bytes of host1 MAC |0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
``` | |
[]{tex-cmd: "\newpage"} | |
# Appendix: An extensive (but not exhaustive) set of Metadata {@h1:"A"} | |
Here we list a set of exemplary metadata that future versions of the spec may | |
support as well as those are supported in the current spec. | |
## Node-level | |
* Node id | |
: The unique ID of an INT node. | |
This is generally administratively assigned. Node IDs must be unique | |
within an INT domain. | |
* Control plane state version number | |
: Whenever a control-plane state changes (e.g., IP FIB update), the node's | |
control plane can also update this version number in the data plane. INT | |
packets may use these version numbers to determine which control-plane state | |
was active at the time packets were forwarded. | |
## Ingress | |
* Ingress interface identifier | |
: The interface on which the INT packet was received. A packet may be received | |
on an arbitrary stack of interface constructs starting with a physical port. | |
For example, a packet may be received on a physical port that belongs to a | |
link aggregation port group, which in turn is part of a Layer 3 Switched | |
Virtual Interface, and at Layer 3 the packet may be received in a tunnel. | |
Although the entire interface stack may be monitored in theory, this | |
specification allows for monitoring of up to two levels of ingress interface | |
identifiers. The semantics of interface identifiers may differ across devices, | |
each INT hop chooses the interface type it reports at each of the two levels. | |
* Ingress timestamp | |
: The device local time when the INT packet was received on the **_ingress_** | |
physical or logical port. | |
* Ingress interface RX pkt count | |
: Total # of packets received so far (since device initialization or counter | |
reset) on the ingress physical port or logical interface where the INT | |
packet was received. | |
* Ingress interface RX byte count | |
: Total # of bytes received so far on the ingress physical port or logical | |
interface where the INT packet was received. | |
* Ingress interface RX drop count | |
: Total # of packet drops occurred so far on the ingress physical port or | |
logical interface where the INT packet was received. | |
* Ingress interface RX utilization | |
: Current utilization of the ingress physical port or logical interface where | |
the INT packet was received. The exact mechanism (bin bucketing, moving | |
average, etc.) is device specific and while the latter is clearly superior | |
to the former, the INT framework leaves those decisions to device vendors. | |
[]{tex-cmd: "\newpage"} | |
## Egress | |
* Egress interface identifier | |
: The interface on which the INT packet was sent out. A packet may be transmitted | |
on an arbitrary stack of interface constructs ending at a physical port. | |
For example, a packet may be transmitted on a tunnel, out of a Layer 3 | |
Switched Virtual Interface, on a Link Aggregation Group, out of a | |
particular physical port belonging to the Link Aggregation Group. | |
Although the entire interface stack may be monitored in theory, this | |
specification allows for monitoring of up to two levels of egress interface | |
identifiers. The semantics of interface identifiers may differ across devices, | |
each INT hop chooses the interface type it reports at each of the two levels. | |
* Egress timestamp | |
: The device local time when the INT packet was processed by the egress | |
physical port or logical interface. | |
* Egress interface TX pkt count | |
: Total # of packets forwarded so far (since device initialization or counter | |
reset) through the egress physical port or logical interface where the INT | |
packet was also forwarded. | |
* Egress interface TX byte count | |
: Total # of bytes forwarded so far through the egress physical port or | |
logical interface where the INT packet was forwarded. | |
* Egress interface TX drop count | |
: Total # of packet drops occurred so far on the egress physical port or | |
logical interface where the INT packet was forwarded. | |
* Egress interface TX utilization | |
: Current utilization of the egress interface via which the INT packet was | |
sent out. | |
## Buffer Information | |
* Queue id | |
: The id of the queue the device used to serve the INT packet. | |
* Instantaneous queue length | |
: The instantaneous length (in bytes, cells, or packets) of the queue the INT | |
packet has observed in the device while being forwarded. The units used | |
need not be consistent across an INT domain, but care must be taken to | |
ensure that there is a known, consistent mapping of {device, queue} values | |
to their respective unit {packets, bytes, cells}. | |
* Average queue length | |
: The average length (in bytes, cells, or packets) of the queue via which the | |
INT packet was served. The calculation mechanism of this value is device | |
specific. | |
* Queue drop count | |
: Total # of packets dropped from the queue. | |
The metadata below are introduced to capture the buffer occupancy INT packet | |
observes in the device while being forwarded. Use case is when buffer is | |
shared between multiple queues. | |
* Buffer id | |
: The id of the buffer the device used to serve the INT packet. | |
[]{tex-cmd: "\newpage"} | |
* Instantaneous buffer occupancy | |
: The instantaneous value (in bytes, or cells) of the buffer occupancy the INT | |
packet has observed in the device while being forwarded. The units used | |
need not be consistent across an INT domain, but care must be taken to | |
ensure that there is a known, consistent mapping of {device, buffer} values | |
to their respective unit {bytes, cells}. | |
* Average buffer occupancy | |
: The average value (in bytes or cells) of the buffer occupancy that the INT | |
packet was observed. The calculation mechanism of this value is device | |
specific. | |
## Miscellaneous | |
* Checksum Complement | |
: This field enables a Checksum-neutral update when INT is encapsulated over | |
an L4 protocol that uses a Checksum field, such as TCP or UDP. | |
# Acknowledgements | |
We thank the following individuals for their contributions to the design, | |
specification and implementation of this spec. | |
* Daniel Alvarez | |
* Parag Bhide | |
* Dennis Cai | |
* Dan Daly | |
* Bruce Davie | |
* Ed Doe | |
* Senthil Ganesan | |
* Anoop Ghanwani | |
* Mukesh Hira | |
* Hugh Holbrook | |
* Raja Jayakumar | |
* Changhoon Kim | |
* Jeongkeun Lee | |
* Randy Levensalor | |
* Tal Mizrahi | |
* Masoud Moshref | |
* Michael Orr | |
* Heidi Ou | |
* Ramesh Sivakolundu | |
* Mickey Spiegel | |
* Bapi Vinnakota | |
[]{tex-cmd: "\newpage"} | |
# Change log | |
* 2015-09-28 | |
- Initial release | |
* 2016-06-19 | |
- Updated section [#sec-int-over-vxlan-gpe], the | |
Length field definition of VXLAN GPE shim header, to be consistent with the | |
example in section [#sec-examples]. | |
* 2017-10-17 | |
- Introduced INT over TCP/UDP (section | |
[#sec-int-over-tcpudp] and new example) | |
- Removed BOS (Bottom-Of-Stack) bit at each 4B metadata, from the header | |
definition and examples | |
- Updated the INT instruction bitmap and the meaning of a few instructions | |
(section [#sec-int-md-metadata-header-format]) | |
- Moved the INT transit P4 program from Appendix to the main section. | |
Re-wrote the program in p4_16. | |
* 2017-12-11 | |
- Increased the size of Version field from 2b to 4b in INT Metadata Header | |
- Improved the header presentation of the examples and clarified the | |
assumptions in section [#sec-examples] | |
- Formatted the spec as a Madoko file | |
- **Tag v0.5 spec** | |
* 2018-02-13 | |
- Elaborated on interactions between INT and MTU settings. Defined switch | |
behavior when inserting INT metadata in a packet would result in egress | |
link MTU to be exceeded. | |
- Defined behavior of INT transit switch when it receives reserved bits | |
set in the INT header | |
* 2018-02-14 | |
- Replaced Max Hop Count and Total Hop Count with Remaining Hop Count | |
* 2018-02-28 | |
- Added Probe Marker approach as another way to indicate the existence of | |
INT over TCP/UDP (section [#sec-int-over-tcpudp]). | |
* 2018-03-08 | |
- Added support for monitoring of two levels of ingress and egress | |
port identifiers | |
* 2018-03-13 | |
- Defined INT domain in section [#sec-terminology]. | |
- Described a possible allocation of non-contiguous DSCP codepoints for | |
INT over TCP/UDP | |
in section [#sec-int-over-tcpudp]. | |
- Relaxed the location of INT stack relative to TCP options in | |
section [#sec-int-over-tcpudp]. | |
* 2018-03-14 | |
- Added the Checksum Complement metadata. | |
* 2018-03-29 | |
- Removed queue congestion status from the list of metadata. | |
- Removed Section 4.2 (Handling INT Packets) on slow path processing using | |
follow-up packets. | |
- Removed the examples of piggybacked metadata for closed loop control. | |
- The expectation is that any of these may be reintroduced in future | |
versions of INT. They could benefit from a better understanding of use | |
cases and some preliminary implementation experience. | |
* 2018-03-31 | |
- Defined checksum update behavior more precisely | |
- Miscellaneous editorial changes in preparation for v1.0 | |
* 2018-04-02 | |
- Revised the example transit code to be compliant with spec v1.0, | |
perform incremental TCP/UDP checksum updates, and | |
against PSA architecture instead of v1model. | |
* 2018-04-03 | |
- Some more editorial changes for v1.0 | |
* 2018-04-10 | |
- Removed the option to modify L4 destination port to indicate INT over | |
TCP/UDP. | |
- Removed INT Tail header from INT over TCP/UDP encapsulation. | |
- Added DSCP to the INT over TCP/UDP shim header. | |
* 2018-04-20 | |
- Some more editorial changes for v1.0 | |
- **Tag v1.0 spec** | |
* 2018-05-08 | |
- Fixed checksum subtract/add calls in the reference code | |
* 2018-08-17 | |
- Fixed INT DSCP mask in the reference code | |
* 2018-12-07 | |
- Added instruction bit for buffer occupancy | |
* 2019-07-03 | |
- Added INT modes of operation: INT-XD/MX/MD and INT-CLONE/PROBE-MD. | |
- Removed the reference code | |
* 2020-01-15 | |
- Added GPE bit to INT shim header for VXLAN GPE encapsulation | |
- Swapped the length and reserved bytes in the shim header to align with | |
other INT transports, and to align with the VXLAN GPE shim header format. | |
- Changed the length definition in the shim header to exclude the shim | |
header itself, in order to align with the VXLAN GPE shim header format. | |
- Increased ingress and egress timestamp size to 8 bytes. | |
* 2020-01-16 | |
- Changed the meaning of the length field in every INT shim header. | |
The length of the shim header is NOT included any more. | |
- Revised INT over UDP encap, using a new UDP destination port number (INT_TBD). | |
* 2020-01-28 | |
- Added Domain ID, Domain Specific (DS) Instructions, and DS Flags. | |
- As a result, the INT common header size is increased from 8B to 12B. | |
- Removed Rep bits and 'C' bit, introduced 'D' bit for Discarding Copy/Clone | |
at INT Sink. | |
* 2020-02-11 | |
- Added 'source-only' metadata as part of DS instruction. | |
- Changed the Hop ML to be required only by Transit and Sink devices. | |
* 2020-02-14 | |
- Added IPv4/GRE transport for INT. | |
- **Tag V2.0 spec** | |
* 2020-03-04 | |
- Added example with 'source-only' domain-specific metadata. | |
* 2020-04-06 | |
- Added INT-MX Header format, per-hop header operations, and examples | |
* 2020-04-17 | |
- Changed 'port identifier' terminology to 'interface identifier' | |
* 2020-04-30 | |
- Added INT-MD alternative MTU processing, generating 'Intermediate Report' | |
- Changed 'switch id' terminology to 'node id' | |
* 2020-05-12 | |
- Changed Geneve Option Class codepoint to value assigned by IANA | |
* 2020-06-10 | |
- Specified that INT-MX 'source-only' metadata must be embedded in the packet | |
and should be reported to the monitoring system by each node. | |
* 2020-06-15 | |
- Simplified INT-MX processing by removing the padding option. | |
* 2020-10-08 | |
- Added an example of INT over UDP with an IPSec payload | |
* 2020-11-11 | |
- Generalized INT-MX 'source-only' metadata to 'source-inserted' metadata. | |
When the bit is defined for a domain, the definition specifies the | |
reporting requirement and whether the 'source-inserted' metadata is mutable. | |
- **Tag V2.1 spec** |