-
Notifications
You must be signed in to change notification settings - Fork 45
/
INT.mdk
2934 lines (2555 loc) · 146 KB
/
INT.mdk
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Title : In-band Network Telemetry (INT) Dataplane Specification
Title Note : Version 2.1
Title Footer: 2020-11-11
Author : The P4.org Applications Working Group. Contributions from
Affiliation : *Alibaba, Arista, CableLabs, Cisco Systems, Dell, Intel, Marvell, Netronome, VMware*
Heading depth: 5
Pdf Latex: xelatex
Document Class: [11pt]article
Package: [top=1in, bottom=1.25in, left=1in, right=1in]{geometry}
Package: fancyhdr
Tex Header:
\setlength{\headheight}{30pt}
\renewcommand{\footrulewidth}{0.5pt}
@if html {
body.madoko {
font-family: utopia-std, serif;
}
title,titlenote,titlefooter,authors,h1,h2,h3,h4,h5 {
font-family: helvetica, sans-serif;
font-weight: bold;
}
pre, code {
language: p4;
font-family: monospace;
font-size: 10pt;
}
}
@if tex {
body.madoko {
font-family: UtopiaStd-Regular;
}
title,titlenote,titlefooter,authors {
font-family: sans-serif;
font-weight: bold;
}
pre, code {
language: p4;
font-family: LuxiMono;
font-size: 75%;
}
}
Colorizer: p4
.token.keyword {
font-weight: bold;
}
@if html {
p4example {
replace: "~ Begin P4ExampleBlock&nl;\
````&nl;&source;&nl;````&nl;\
~ End P4ExampleBlock";
padding:6pt;
margin-top: 6pt;
margin-bottom: 6pt;
border: solid;
background-color: #ffffdd;
border-width: 0.5pt;
}
}
@if tex {
p4example {
replace: "~ Begin P4ExampleBlock&nl;\
````&nl;&source;&nl;````&nl;\
~ End P4ExampleBlock";
breakable: true;
padding: 6pt;
margin-top: 6pt;
margin-bottom: 6pt;
border: solid;
background-color: #ffffdd;
border-width: 0.5pt;
}
}
@if html {
p4pseudo {
replace: "~ Begin P4PseudoBlock&nl;\
````&nl;&source;&nl;````&nl;\
~ End P4PseudoBlock";
padding: 6pt;
margin-top: 6pt;
margin-bottom: 6pt;
border: solid;
background-color: #e9fce9;
border-width: 0.5pt;
}
}
@if tex {
p4pseudo {
replace: "~ Begin P4PseudoBlock&nl;\
````&nl;&source;&nl;````&nl;\
~ End P4PseudoBlock";
breakable : true;
padding: 6pt;
margin-top: 6pt;
margin-bottom: 6pt;
background-color: #e9fce9;
border: solid;
border-width: 0.5pt;
}
}
@if html {
p4grammar {
replace: "~ Begin P4GrammarBlock&nl;\
````&nl;&source;&nl;````&nl;\
~ End P4GrammarBlock";
border: solid;
margin-top: 6pt;
margin-bottom: 6pt;
padding: 6pt;
background-color: #e6ffff;
border-width: 0.5pt;
}
}
@if tex {
p4grammar {
replace: "~ Begin P4GrammarBlock&nl;\
````&nl;&source;&nl;````&nl;\
~ End P4GrammarBlock";
breakable: true;
margin-top: 6pt;
margin-bottom: 6pt;
padding: 6pt;
background-color: #e6ffff;
border: solid;
border-width: 0.5pt;
}
}
[TITLE]
[]{tex-cmd: "\newpage"}
[]{tex-cmd: "\fancyfoot[L]{&date; &time;}"}
[]{tex-cmd: "\fancyfoot[C]{In-band Network Telemetry}"}
[]{tex-cmd: "\fancyfoot[R]{\thepage}"}
[]{tex-cmd: "\pagestyle{fancy}"}
[]{tex-cmd: "\sloppy"}
[TOC]
# Introduction
Inband Network Telemetry (“INT”) is a framework designed to allow the
collection and reporting of network state, by the data plane, without requiring
intervention or work by the control plane in collecting and delivering
the state from the data plane. In the INT architectural model,
packets may contain header fields that are interpreted as “telemetry
instructions” by network devices.
INT traffic sources (applications, end-host networking stacks,
hypervisors, NICs, send-side ToRs, etc.) can embed the instructions either in
normal data packets, cloned copies of the data packets or in special probe packets.
Alternatively, the instructions may be
programmed in the network data plane to match on particular network flows
and to execute the instructions on the matched flows.
These instructions tell an INT-capable device what state to collect.
The network state information may be directly exported by the data plane
to the telemetry monitoring system, or can be
written into the packet as it traverses the
network. When the information is embedded in the packets, INT traffic sinks
retrieve (and optionally report) the collected results of these instructions,
allowing the traffic sinks to monitor the exact data plane state that the
packets “observed” while being forwarded.
Some examples of traffic sink behavior are described below:
* OAM – the traffic sink[^Transit] might simply collect the encoded network state, then
export that state to an external controller. This export could be in a raw
format, or could be combined with basic processing (such as compression,
deduplication, truncation).
* Real-time control or feedback loops – traffic sinks might use the encoded
data plane information to feed back control information to traffic sources,
which could in turn use this information to make changes to traffic engineering
or packet forwarding. (Explicit congestion notification schemes are an example
of these types of feedback loops).
* Network Event Detection - If the collected path state indicates a condition
that requires immediate attention or resolution (such as severe congestion or
violation of certain data-plane invariances), the traffic sinks[^Transit] could generate
immediate actions to respond to the network events, forming a feedback control
loop either in a centralized or a fully decentralized fashion (a la TCP).
[]{tex-cmd: "\newpage"}
The INT architectural model is intended to be generic and enables a
number of interesting high level applications, such as:
* Network troubleshooting and performance monitoring
- Traceroute, micro-burst detection, packet history (a.k.a. postcards[^Postcard])
* Advanced congestion control
* Advanced routing
- Utilization-aware routing (For example, HULA[^HULA], CLOVE[^CLOVE])
* Network data plane verification
A number of use case descriptions and evaluations are described in the Millions
of Little Minions paper [^Minions].
[^Transit]: While this will be commonly done by Sink nodes, Transit nodes may also generate OAM's or carry out Network Event Detection
[^Postcard]: I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks, USENIX NSDI 2014.
[^HULA]: HULA: Scalable Load Balancing Using Programmable Data Planes, ACM SOSR 2016
[^CLOVE]: CLOVE: Congestion-Aware Load Balancing at the Virtual Edge, ACM CoNEXT 2017
[^Minions]: Millions of Little Minions: Using Packets for Low Latency Network Programming and Visibility, ACM SIGCOMM 2014.
# Terminology
* **Monitoring System**:
: A system that collects telemetry data sent from different network devices.
The monitoring system components may be physically distributed but logically
centralized.
* **INT Header**:
: A packet header that carries INT information. There are three types of INT
Headers -- *eMbed data (MD-type)*, *eMbed instruction (MX-type)* and
*Destination-type* (See Section [#sec-int-header-types]).
* **INT Packet**:
: A packet containing an INT Header.
* **INT Node**:
: An INT-capable network device that participates in the INT data plane by
regularly carrying out at least one of the following: inserting, adding to,
removing, or processing instructions from INT Headers in INT packets.
Depending on deployment scenarios, examples of INT Nodes may include devices
such as routers, switches, and NICs.
* **INT Instruction**:
: Instructions indicating which INT Metadata (defined below) to collect at
each INT node. The instructions are either configured at each INT-capable
node's Flow Watchlist or written into the INT Header.
* **Flow Watchlist**:
: A dataplane table that matches on packet headers and inserts or applies
INT instructions on each matched flow. A flow is a set of packets having the
same values on the selected header fields.
* **INT Source**:
: A trusted entity that creates and inserts INT Headers into the packets it
sends. A Flow Watchlist is configured to select the flows in which INT
headers are to be inserted.
* **INT Sink**:
: A trusted entity that extracts the INT Headers and collects the path state
contained in the INT Headers. The INT Sink is responsible for removing INT
Headers so as to make INT transparent to upper layers. (Note that this does
not preclude having nested or hierarchical INT domains.) The INT Sink can
decide to send the collected information to the monitoring system.
* **INT Transit Hop**:
: A trusted entity that collects metadata from the data plane by following
the INT Instructions. Based on the instructions, the data may be directly
exported to the telemetry monitoring system or embedded into the INT Header
of the packet.
Note that one physical device
may play multiple roles -- INT Source, Transit, Sink -- at the same time
for the same or different flows. For example, an INT Source node may
embed its own metadata into the packet, playing the roles of INT Transit as well.
* **INT Metadata**:
: Information that an INT Source or an INT Transit Hop node inserts into the
INT Header, or into a telemetry report. Examples of metadata are described
in Section [#what-to-monitor].
* **INT Domain**:
: A set of inter-connected INT nodes under the same administration. This
specification defines the behavior and packet header formats for
interoperability between INT nodes from different vendors in an INT
domain. The INT nodes within the same domain must be configured in a
consistent way to ensure interoperability between the nodes. Operators of
an INT domain should deploy INT Sink capability at domain edges to prevent
INT information from leaking out of the domain.
# INT Modes of Operation
Since INT was first introduced at P4.org in 2015, a number of variations of INT
have been evolved and discussed in IETF and industry communities.
Also the term 'INT' has been used to broadly indicate data plane telemetry in
general, not limited to the original classic INT where both instructions and
metadata are embedded in the data packets. Hence we define different modes of
INT operation based on the degree of packet modifications, i.e., what to
embed in the packets.
The different modes of operation are described in detail below, and summarized
in Figure [#fig-int-modes].
## INT Application Modes
Original data packets are monitored and may be modified
to carry INT instructions and metadata.
There are three variations based on the level of packet modifications.
* **INT-XD** (eXport Data): INT nodes directly export metadata from their
dataplane to the monitoring system based on the INT instructions configured
at their Flow Watchlists. No packet Modification is needed.
This mode was also known as "Postcard" mode in the previous versions of
the Telemetry Report spec, originally inspired by [^Postcard].
* **INT-MX** (eMbed instruct(X)ions): The INT Source node embeds INT
instructions in the packet header, then the INT Source, each INT Transit, and
the INT sink directly send the metadata to the monitoring system by following
the instructions embedded in the packets. The INT Sink node strips the
instruction header before forwarding the packet to the receiver. Packet
modification is limited to the instruction header, the packet size doesn't
grow as the packet traverses more Transit nodes.
INT-MX also supports 'source-inserted' metadata as part of Domain Specific
Instructions. This allows the INT Source to embed additional metadata that
other nodes or the monitoring system can consume.
This mode is inspired by IOAM's "Direct Export" [^IOAM] [^IOAM-DEX].
* **INT-MD** (eMbed Data): In this mode both INT instructions and metadata are
written into the packets. This is the classic hop-by-hop INT where 1) INT
Source embeds instructions, 2) INT Source & Transit embed metadata, and 3) INT
Sink strips the instructions and aggregated metadata out of the packet and
(selectively) sends the data to the monitoring system.
The packet is modified the most in this mode while it minimizes the overhead
at the monitoring system to collate reports from multiple INT nodes.
Since v2.0, INT-MD mode supports 'source-only' metadata as part of Domain
Specific Instructions. This allows the INT Source to embed additional
metadata for the INT Sink or the monitoring system to consume.
**NOTE: the rest of the spec is assuming INT-MD as the default mode, unless
specified otherwise.**
[^IOAM]: Data Fields for In-situ OAM, [draft-ietf-ippm-ioam-data-09](https://tools.ietf.org/html/draft-ietf-ippm-ioam-data-09), March 2020.
[^IOAM-DEX]: In-situ OAM Direct Exporting, [draft-ietf-ippm-ioam-direct-export-00](https://tools.ietf.org/html/draft-ietf-ippm-ioam-direct-export-00), February 2020.
~ Figure { #fig-int-modes; caption: "Various modes of INT operation." }
![int-modes]
~
[int-modes]: images/INT_modes.pdf { width: 5.5in }
## INT Applied to Synthetic Traffic
INT Source nodes may generate INT-marked synthetic traffic either by
cloning original data packets or by generating special probe packets.
INT is applied to this traffic by transit nodes in exactly the same way
as all traffic.
The only difference between live traffic and Synthetic traffic is that
INT Sink nodes may need to discard synthetic traffic after extracting
the collected INT data as opposed to forwarding the traffic. This is
indicated by using the 'D' bit of the INT Header to mark relevant packets
as being copies/clones or probes, to be 'D'iscarded at the INT Sink.
All INT modes may be used on these synthetic/probe packets, as decided by the INT Source node.
Specifically the **INT-MD** (eMbed Data) mode applied to Synthetic or probe packets allows
functionality similar to **IFA**[^IFA].
It is likely that synthetic traffic created by cloning would be discarded at the Sink,
while Probe packets might be marked for forwarding or discarding, depending on the
use-case. It is the responsibility of the INT Source node to mark packets correctly
to determine if the INT Sink will forward or discard packets after extracting the
INT Data collected along the path.
[^IFA]: Inband Flow Analyzer, [draft-kumar-ippm-ifa-02](https://tools.ietf.org/html/draft-kumar-ippm-ifa-02), April 2020.
# What To Monitor { #what-to-monitor }
In theory, one may be able to define and collect any device-internal
information using the INT approach. In practice, however, it seems useful to
define a small baseline set of metadata that can be made available on a wide
variety of devices: the metadata listed in this section comprises such a set.
As the INT specification evolves, we expect to add more metadata to this
INT specification.
The exact meaning of the following metadata (e.g., the unit of timestamp
values, the precise definition of hop latency, queue occupancy or buffer
occupancy) can vary from one device to another for any number of reasons,
including the heterogeneity of device architecture, feature sets,
resource limits, etc. Thus, defining the exact meaning of each metadata is
beyond the scope of this document. Instead we assume that the semantics of
metadata for each device model used in a deployment is communicated with
the entities interpreting/analyzing the reported data in an out-of-band fashion.
## Device-level Information
* Node id
: The unique ID of an INT node.
This is generally administratively assigned. Node IDs must be unique
within an INT domain.
## Ingress Information
* Ingress interface identifier
: The interface on which the INT packet was received. A packet may be received
on an arbitrary stack of interface constructs starting with a physical port.
For example, a packet may be received on a physical port that belongs to
a link aggregation port group, which in turn is part of a Layer 3 Switched
Virtual Interface, and at Layer 3 the packet may be received in a tunnel.
Although the entire interface stack may be monitored in theory, this
specification allows for monitoring of up to two levels of ingress interface
identifiers. The first level of ingress interface identifier would typically
be used to monitor the physical port on which the packet was received, hence
a 16-bit field (half of a 4-Byte metadata) is deemed adequate. The second
level of ingress interface identifier occupies a full 4-Byte metadata field,
which may be used to monitor a logical interface on which the packet was
received. A 32-bit space at the second level allows for an adequately large
number of logical interfaces at each network element. The semantics of
interface identifiers may differ across devices, each INT hop chooses the
interface type it reports at each of the two levels.
* Ingress timestamp
: The device local time when the INT packet was received on the ingress
physical or logical port.
## Egress Information
* Egress interface identifier
: The interface on which the INT packet was sent out. A packet may
be transmitted on an arbitrary stack of interface constructs ending at a
physical port. For example, a packet may be transmitted on a tunnel,
out of a Layer 3 Switched Virtual Interface, on a Link Aggregation Group,
out of a particular physical port belonging to the Link Aggregation Group.
Although the entire interface stack may be monitored in theory, this
specification allows for monitoring of up to two levels of egress interface
identifiers. The first level of egress interface identifier would typically
be used to monitor the physical port on which the packet was transmitted,
hence a 16-bit field (half of a 4-Byte metadata) is deemed adequate.
The second level of egress interface identifier occupies a full
4-Byte metadata field, which may be used to monitor a logical interface on
which the packet was transmitted. A 32-bit space at the second level
allows for an adequately large number of logical interfaces at each network
element. The semantics of interface identifiers may differ across devices,
each INT hop chooses the interface type it reports at each of the two levels.
* Egress timestamp
: The device local time when the INT packet was processed by the egress
physical or logical port.
* Hop latency
: Time taken for the INT packet to be switched within the device.
* Egress interface TX Link utilization
: Current utilization of the egress interface via which the INT packet was
sent out. Again, devices can use different mechanisms to keep track of the
current rate, such as bin bucketing or moving average. While the latter is
clearly superior to the former, the INT framework does not stipulate the
mechanics and simply leaves those decisions to device vendors.
* Queue occupancy
~ The build-up of traffic in the queue (in bytes, cells, or packets) that the
INT packet observes in the device while being forwarded. The format of this
4-octet metadata field is implementation specific and the metadata semantics
YANG model shall describe the format and units of this metadata field in the
metadata stack.
* Buffer occupancy
: The build-up of traffic in the buffer (in bytes, cells, or packets) that the
INT packet observes in the device while being forwarded. Use case is when
the buffer is shared between multiple queues. The format of this 4-octet
metadata field is implementation specific and the metadata semantics YANG
model shall describe the format and units of this metadata field in the
metadata stack.
A metadata semantics YANG model [^metadata-yang] is being developed that allows
nodes to report details of the metadata format, units, and semantics.
[^metadata-yang]: p4-dtel-metadata-semantics, [https://github.com/p4lang/p4-applications/blob/master/telemetry/code/models/p4-dtel-metadata-semantics.yang](https://github.com/p4lang/p4-applications/blob/master/telemetry/code/models/p4-dtel-metadata-semantics.yang)
[]{tex-cmd: "\newpage"}
# INT Headers
This section specifies the format and location of INT Headers.
INT Headers and their locations are relevant for INT-MX and INT-MD modes where
the INT instructions (and metadata stack in case of MD mode) are written into
the packets.
## INT Header Types
There are three types of INT Headers: MD-type, MX-type and Destination-type.
A given INT packet may carry either of MD or MX type headers, and/or
a Destination-type header. When Destination-type and MD-type or MX-type
headers are present, the MD-type header or MX-type header must precede the
Destination-type header.
* MD-type (**INT Header type 1**)
- Intermediate nodes (INT Transit Hops) must process this type of INT
Header. The format of this header is defined in Section
[#sec-int-md-metadata-header-format].
* Destination-type (**INT Header type 2**)
- Destination headers must only be consumed by the INT Sink. Intermediate
nodes must ignore Destination headers.
- Destination headers can be used to enable Edge-to-Edge communication between
the INT Source and INT Sink. For example:
- INT Source can add a sequence number to detect loss of INT packets.
- INT Source can add the original values of IP TTL and INT Remaining
Hop Count, thus enabling the INT sink to detect network devices
on the path that do not support INT by comparing the IP TTL
decrement against INT Remaining Hop Count decrement (assuming each
network device is an L3 hop)
- The format of Destination-type headers will be defined in a future
revision. Note some Edge-to-Edge INT use cases can be supported by
'source-only' and 'source-inserted' metadata, part of Domain Specific
Instructions in the MD-type and MX-type headers.
* MX-type (**INT Header type 3**)
- Intermediate nodes (INT Transit Hops) must process this type of INT
Header and generate reports to the monitoring system as instructed.
The format of this header is defined in Section [#int-mx-header].
## Per-Hop Header Operations
### INT Source Node
In the INT-MD and INT-MX modes, the INT Source node in the packet forwarding
path creates the INT-MD or INT-MX Header.
In INT-MD, the source node add its own INT metadata after the header. To avoid
exhausting header space in the case of a forwarding loop or any other anomalies,
it is strongly recommended to limit the number of total INT metadata fields added
by Transit Hop nodes by setting the *Remaining Hop Count* field in INT header
appropriately.
The INT-MD and INT-MX headers are described in detail in the subsequent
sections.
### INT Transit Hop Node
In the INT-MD mode, each node in the packet forwarding path creates additional
space in the INT-MD Header on-demand to add its own INT metadata. To avoid
exhausting header space in the case of a forwarding loop or any other anomalies,
each INT Transit Hop must decrement the *Remaining Hop Count* field in the INT
header appropriately.
In the INT-MX mode, each node in the packet forwarding path follows the
intructions in the INT-MX Header, gathers the device specific metadata and
exports the device metadata using the Telemetry Report.
INT Transit Hop nodes may update the *DS Flags* field in the INT-MD or INT-MX
header. The *Hop ML*, *Instruction Bitmap*, *Domain Specific ID* and
*DS Instruction* fields must not be modified by Transit Hop nodes.
### INT Sink Node
In INT-MD mode, the INT Sink node removes the INT Headers and Metadata stack from
the packet, and decides whether to report the collected information.
In INT-MX mode, the INT Sink node removes the INT-MX header, gathers the
device specific metadata and decides whether to report that metadata.
## MTU Settings
In both INT-MX and INT-MD modes, it is possible that insertion of the INT header
at the INT Source node may cause the egress link MTU to be exceeded.
In INT-MD mode, as each hop creates additional space in the INT header
to add its metadata, the packet size increases. This can potentially
cause egress link MTU to be exceeded at an INT node.
This may be addressed in the following ways -
* It is recommended that the MTU of links between INT sources and sinks be
configured to a value higher than the MTU of preceding links (server/VM NIC
MTUs) by an appropriate amount. Configuring an MTU differential of
[Per-hop Metadata Length\*4\*INT Hop Count + Fixed INT Header Length] bytes
(just [Fixed INT Header Length] for INT-MX mode),
based on conservative values of total number of INT hops and Per-hop
Metadata Length, will prevent egress MTU being exceeded due to INT metadata
insertion at INT hops. The Fixed INT Header Length is the sum of INT metadata
header length (12B) and the size of encapsulation-specific shim/option header
(4B) as defined in Section [#sec-header-location].
* An INT source/transit node may optionally participate in dynamic discovery
of Path MTU for flows being monitored by INT by transmitting ICMP message to
the traffic source as per Path MTU Discovery mechanisms of the corresponding
L3 protocol (RFC 1191 for IPv4, RFC 1981 for IPv6). An INT source or transit
node may report a conservative MTU in the ICMP message, assuming that the
packet will go through the maximum number of allowed INT hops (i.e. *Remaining
Hop Count* will decrement to zero), accounting for cumulative metadata
insertion at all INT hops, and assuming that the egress MTU at all downstream
INT hops is the same as its own egress link MTU. This will help the path
MTU discovery source to converge to a path MTU estimate
faster, although this would be a conservative path MTU estimate.
Alternatively, each INT hop may report an MTU only accounting for the metadata
it inserts. This would enable the path MTU discovery source converge to a
precise path MTU, at the cost of receiving more ICMP messages, one from each
INT hop.
Regardless of whether or not an INT transit node participates in Path MTU
discovery, if it cannot insert all requested metadata because doing so will
cause the packet length to exceed egress link MTU, it must either:
- not insert any metadata and set the M bit in the INT header, indicating
that egress MTU was exceeded at an INT hop, or
- report the metadata stack collected from previous hops (setting the
Intermediate Report bit if a Telemetry Report 2.0 [^telem-report] packet is
generated) and remove the reported metadata stack from the packet, including
the metadata from this transit hop in either the report or embedding in the
INT-MD metadata header.
An INT source inserts 12 bytes of fixed INT headers, and may also insert
Per-hop Metadata Length\*4 bytes of its own metadata. If inserting the
fixed headers causes egress link MTU to be exceeded, INT cannot not be
initiated for such packets. If an INT source is programmed to insert
its own INT metadata, and there is enough room in a packet to insert fixed INT
headers, but no additional room for its INT metadata, the source must
initiate INT and set the M bit in the INT header.
In theory, an INT transit node can perform IPv4 fragmentation to overcome
egress MTU limitation when inserting its metadata. However, IPv4 fragmentation
can have adverse impact on applications. Moreover, IPv6 packets cannot be
fragmented at intermediate hops. Also, fragmenting packets at INT transit hops,
with or without copying preceding INT metadata into fragments imposes
extra complexity of correlating fragments in the INT monitoring engine.
Considering all these factors, this specification requires that an INT
node must not fragment packets in order to append INT information to
the packet.
## Congestion Considerations
Use of the INT encapsulation should not increase the impact of congestion on
the network. While many transport protocols (e.g. TCP, SCTP, DCCP, QUIC)
inherently provide congestion control mechanisms, other transport protocols
(e.g. UDP) do not. For the latter case, applications may provide congestion
control or limit the traffic volume.
It is recommended not to apply INT to application traffic that is known not to
be congestion controlled (as described in RFC 8085 [^RFC8085] Section 3.1.11).
In order to achieve this, packet filtering mechanisms such as access control
lists should be provided, with match criteria including IP protocol and L4
ports.
Because INT encapsulation endpoints are located within the same administrative
domain, an operator may allow for INT encapsulation of traffic that is known
not to be congestion controlled. In this case, the operator should carefully
consider the potential impact of congestion, and implement appropriate
mechanisms for controlling or mitigating the effects of congestion. This
includes capacity planning, traffic engineering, rate limiting, and other
mechanisms.
[^RFC8085]: UDP Usage Guidelines, [RFC 8085](https://www.rfc-editor.org/info/rfc8085), March 2017.
## INT over any encapsulation
The specific location for INT Headers is intentionally not specified:
an INT Header can be inserted as an option or payload of any encapsulation
type. The only requirements are that the encapsulation header provides
sufficient space to carry the INT information and that all INT nodes
(Sources, transit hops and Sinks) agree on the location of the INT Headers.
The following choices are potential encapsulations using common protocol
stacks, although a deployment may choose a different encapsulation format
if better suited to their needs and environment.
* INT over VXLAN (as VXLAN payload, per GPE extension)
* INT over Geneve (as Geneve option)
* INT over NSH (as NSH payload)
* INT over TCP (as payload)
* INT over UDP (as payload)
* INT over GRE (as a shim between GRE header and encapsulated payload)
## Checksum Update
As described above in Section [#sec-int-over-any-encapsulation], INT
headers and metadata may be carried in an L4 protocol such as TCP or UDP,
or in an encapsulation header that includes an L4 header, such as VXLAN.
The checksum field in the TCP or UDP L4 header needs
to be updated as INT nodes modify the L4 payload via insertion/removal of
INT headers and metadata. However, there are certain exceptions.
For example, when UDP is transported over IPv4, it is possible to assign a
zero checksum, causing the receiver to ignore the value of the checksum field
(as defined in RFC 768). For UDP over IPv6, there are specific use cases in
which it is possible to assign a zero Checksum (as defined in RFC 6936).
INT source, transit and sink nodes must comply with IETF standards
for Layer 4 transport protocols with respect to whether or not Layer 4
checksum is to be updated upon modification of Layer 4 payload.
For example, if an INT source/transit/sink hop receives
UDP traffic with zero L4 checksum, it must not update the L4 checksum
in conformance with the behavior defined in relevant IETF standards
such as RFC 768 and RFC 6936.
When L4 checksum update is required, an INT source/transit node may update
the checksum in one of two ways:
* Update the L4 Checksum field such that the new value is equal to the checksum
of the new packet, after the INT-related updates (header additions/removals,
field updates), or
* If the INT source indicates that Checksum-neutral updates are allowed by
setting an instruction bit corresponding to the Checksum Complement metadata,
then the INT source/transit nodes may assign a value to the Checksum
Complement metadata which guarantees that the existing L4 Checksum is the
correct value of the packet after the INT-related updates.
The motivation for the Checksum Complement is that some hardware implementations
process data packets in a serial order, which may impose a problem when INT
fields and metadata that reside after the L4 Checksum field are inserted or
modified. Therefore, the Checksum Complement metadata, if present, is the last
metadata field in the stack.
Note that when the Checksum Complement metadata is present source/transit
nodes may choose to update the L4 Checksum field instead of using the
Checksum Complement metadata. In this case the Checksum Complement metadata
must be assigned the reserved value 0xFFFFFFFF. A host that verifies the
L4 Checksum will be unaffected by whether some or all of the nodes chose
not to use the Checksum Complement, since the value of the L4 Checksum
should fit the Checksum of the payload in either of the cases.
INT sink cannot perform a Checksum-neutral update using Checksum Complement
metadata as it removes all INT headers from the packet. Thus, an INT sink
when performing a checksum update has to do so by updating the L4 Checksum
field.
Regardless of whether checksum update is performed via modifying the L4
checksum field or via use of Checksum Complement metadata, performing the
update based on an incremental checksum calculation (as is typically done)
will ensure that any potential corruption is detected at the point of
checksum validation. If full checksum computation is performed at an INT
node, it should be preceded by checksum validation so as to not mask out any
corruption at preceding hops.
## Header Location
We describe four encapsulation formats in this specification, covering
different deployment scenarios, with and without network virtualization:
1. *INT over IPv4/GRE* - INT headers are carried between the GRE header and the
encapsulated GRE payload.
2. *INT over TCP/UDP* - A shim header is inserted following TCP/UDP
header. INT Headers are carried between this shim header and TCP/UDP payload.
Since v2.0, the spec also supports an option to insert a new UDP header
(followed by INT headers) before the existing L4 header.
This approach doesn’t rely on any tunneling/virtualization mechanism and is
versatile to apply INT to both native and virtualized traffic.
3. *INT over VXLAN* - VXLAN generic protocol extensions [^VXLAN-GPE] are used
to carry INT Headers between the VXLAN header and the encapsulated VXLAN
payload.
4. *INT over Geneve* - Geneve is an extensible tunneling framework, allowing
Geneve options to be defined for INT Headers.
[^VXLAN-GPE]: Generic Protocol Extension for VXLAN, [draft-ietf-nvo3-vxlan-gpe-09](https://tools.ietf.org/html/draft-ietf-nvo3-vxlan-gpe-09), December 2019.
### INT over IPv4/GRE
In case the traffic being monitored is not encapsulated by any virtualization
header, INT over VXLAN or INT over Geneve is not helpful. Instead, a GRE
encapsulation as defined in RFC 2784 [^RFC2784] can be utilized. The INT
metadata header and INT metadata follows the GRE header. In an administrative
domain where INT is used, insertion of the INT metadata header and metadata in
GRE is enabled at the INT source and deletion of INT metadata header
and metadata is enabled at the INT sink by means of configuration.
There are two scenarios when utilizing GRE encapsulation to support INT:
1. If the incoming packet at the source node of the INT domain is GRE
encapsulated, then the source node should add the INT Metadata Header
and Metadata following the GRE header. The sink node of the INT domain
should remove the INT Metadata Header and Metadata stack before forwarding
the GRE encapsulated packet to the destination.
2. If the incoming packet at the source node of the INT domain is not GRE
encapsulated, then the source node should add a GRE encapsulation and insert
the INT Metadata Header and Metadata following the GRE header. The sink node
of the INT domain should remove the GRE encapsulation along with removing
the INT Metadata Header and the Metadata stack before forwarding the packet
to the destination.
IPv4 GRE Option format for carrying INT Header and
Metadata:
`
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+
|C|R|K|S|s|Recur| Flags | Ver | Protocol Type = TBD_INT | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| Checksum (optional) | Offset (Optional) | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ G
| Key (Optional) | R
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ E
| Sequence Number (Optional) | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| Routing (Optional) | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+
| Type |G| Rsvd| Length | Next Protocol | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ I
| | N
| Variable Option Data (INT Metadata Headers and Metadata) | T
| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<-+
`
The GRE header and fields are defined in RFC 2784 [^RFC2784]. The GRE Protocol
Type value is TBD_INT.
The INT Shim header for GRE option is defined as follows:
* **Type (4b):** This field indicates the type of INT Header following the shim
header. The Type values are defined in Section [#sec-int-header-types].
* **G (1b):** Indicates whether the GRE headers were inserted to transport INT
by the INT source.
- **0:** Original packet (before insertion of INT headers and metadata) had
GRE encapsulation.
- **1:** Original packet had no GRE encapsulation, hence the INT source
inserted GRE.
- This is a hint that helps the INT sink (when it is not the GRE tunnel endpoint)
determine whether to remove the GRE headers part of INT decapsulation (if G=1).
* **Rsvd (3b):** reserved for future use, set to zero upon transmission and
ignored upon reception.
* **Length (8b):** This is the total length of INT metadata header, INT stack
excluding the shim header in 4-byte words. A non-INT device may read this
field and skip over INT headers.
* **Next Protocol (16b):** this field contains an EtherType value (defined in
the IANA registry [^ETYPES])
indicating the type of the protocol following the INT stack. An implementation
receiving a packet containing a type value which is not listed in the registry
should discard the packet.
[^RFC2784]: Generic Routing Encapsulation (GRE), [RFC 2784](https://www.rfc-editor.org/info/rfc2784), March 2000.
[^ETYPES]: [IANA Ethernet Numbers](https://www.iana.org/assignments/ieee-802-numbers/ieee-802-numbers.xhtml).
### INT over TCP/UDP
In case the traffic being monitored is not encapsulated by any virtualization
header, one can also put the INT metadata just after layer 4 headers (TCP/UDP).
The scheme assumes that the non-INT devices between the INT source and the
INT sink either do not parse beyond layer-4 headers or can skip through the
INT stack using the Length field in the INT shim header. If TCP has any options,
the INT stack may come before or after the TCP options but the decision must
be consistent within an INT domain.
Note that INT over UDP can be used even when the packet is encapsulated by VXLAN,
Geneve, or GUE (Generic UDP Encapsulation). INT over TCP/UDP also makes it
easier to add INT stack into outer, inner, or even both layers. In such cases
both INT header stacks carry information for respective layers and need not be
considered interfering with each other.
A field in Ethernet, IP, or TCP/UDP should indicate if the
INT header exists after the TCP/UDP header. We propose three options.
1. UDP destination port field: a new UDP port number (INT_TBD) will be assigned
by IANA to indicate the existence of INT after UDP. This option supports two
cases:
- The original packet already has UDP header either as user application
protocol or as part of another UDP-based encapsulation such as VXLAN,
GENEVE, RoCEv2. INT is inserted after the UDP header with the UDP
destination port number changed to INT_TBD.
The original destination port number
is carried in the shim header for the INT sink to restore, when it
removes the INT stack from the packet.
- A new UDP header for INT is inserted between IP and the existing L4 header.
The protocol field of IP header is set to 17 for UDP and the original
IP protocol value is carried in the INT shim header.
In the new UDP header, INT_TBD is used as the destination port number.
It is recommended that the source port number of the new UDP header be
calculated using a hash of fields from the original packet, for example
the original outer 5 tuple or the original L4 header fields.
This is to enable a level of entropy for ECMP/LAG load balancing logic.
It is recommended that the checksum in the new UDP header be set to zero.
For IPv6 packets, this falls under the case of tunnel protocols,
which are allowed to use zero UDP checksums as specified in RFC 6936.
The existing L4 header will typically include a checksum computed
using the encapsulating IPv6 header fields, thus offering some protection
against IPv6 header corruption.
In both cases, traffic with INT headers is likely to be hashed
to a different path in the network as the new UDP
destination port (INT_TBD) becomes part of the outer 5 tuple used by ECMP.
The INT shim header for UDP has a field NPT (Next Protocol Type) that
indicates which of the two cases are applied to a given INT packet. In case
a new UDP header was inserted,
INT sink must copy the original IP protocol number from the shim header
to IP header, and strip the newly added UDP header with all INT headers.
For the case that original packet already had UDP header, INT sink must
restore the original destination port number from the shim header
into the UDP header and strip the INT headers.
2. IPv4 DSCP or IPv6 Traffic Class field: A value or a bit can be used to
indicate the existence of INT after TCP/UDP. When the INT source inserts the
INT header into a packet, it sets the reserved value in the field or sets the
bit. The INT source may write the original DSCP value in the INT headers so
that the INT sink can restore the original value. Restoring the original value
is optional.
- Allocating a bit, as opposed to a value codepoint, will allow the rest of
DSCP field to be used for QoS, hence allowing the coexistence of DSCP-based
QoS and INT. If the traffic being monitored is subjected to QoS services
such as rate limiting, shaping, or differentiated queueing based on DSCP
field, QoS classification in the network must be programmed to
ignore the designated bit position to ensure that the INT-enabled traffic
receives the same treatment as the original traffic being monitored.
- In brownfield scenarios, however, the network operator may not find a bit
available to allocate for INT but may still have a fragmented space of 32
unused DSCP values. The operator can allocate an INT-enabled DSCP value
for every QoS DSCP value, map the INT-enabled DSCP value to the same
QoS behavior as the corresponding QoS DSCP value. This may double the
number of QoS rules but will allow the co-existence of DSCP-based QoS and
INT even when a single DSCP bit is not available for INT.
- Within an INT domain, DSCP values used for INT must exclusively be used
for INT. INT transit and sink nodes must not receive non-INT packets
marked with DSCP values used for INT. Any time a node forwards a packet
into the INT domain and there is no INT header present, it must ensure that
the DSCP/Traffic class value is not the same as any of the values used
to indicate INT.
3. Probe Marker fields: If DSCP field or values cannot be reserved for INT,
probe marker option could be used. A specific 64-bit value can be inserted
after the TCP/UDP header to indicate the existence of INT after TCP/UDP.
These fields must be
interpreted as unsigned integer values in network byte order. This approach is
a variation of an early IETF draft with existing implementation[^DPP].
[^DPP]: Data-plane probe for in-band telemetry collection, [draft-lapukhov-dataplane-probe-01](https://tools.ietf.org/html/draft-lapukhov-dataplane-probe-01), June 2016.
INT probe marker for TCP/UDP:
```
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Probe Marker (1) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Probe Marker (2) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
```
With arbitrary values being inserted after TCP/UDP header as probe markers,
the likelihood of conflicting with user traffic in a data center is
low, but cannot be completely eliminated. To further reduce the chance of
conflict, a deployment could choose to also examine TCP/UDP port numbers
to validate INT probe marker.
Any of the above options may be used in an INT domain, provided that the INT
transit and sink nodes in the INT domain comply with the mechanism chosen
at the INT sources, and are able to correctly identify the presence and location
of INT headers. The above approaches are not intended to interoperate in a
mixed environment, for example it would be incorrect to mark a packet for INT
using both DSCP and probe marker, as INT nodes that only understand
DSCP marking and do not recognize probe markers may incorrectly interpret the
first four bytes of the probe marker as INT shim header.
It is strongly recommended that only one option be used within an INT domain.
We introduce an INT shim header for TCP/UDP. The INT
metadata header and INT metadata stack will be encapsulated between
the shim header and the TCP/UDP payload.
INT shim header for TCP/UDP:
```
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type |NPT|R|R| Length | UDP port, IP Proto, or DSCP |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
```
* **Type (4b):** This field indicates the type of INT Header following the shim
header.
The Type values are defined in Section [#sec-int-header-types].
* **NPT (Next Protocol Type, 2b):** This field is meaningful only when the UDP destination
port number (INT_TBD) is used to indicate the existence of INT. In the other cases,
this field must be zero. When UDP destination port is INT_TBD, this field may have
one of the two values:
- **one (1):** indicates that the original UDP payload follows the INT stack,
and the last two bytes of the shim header carry the original UDP destination port.
- **two (2):** indicates that another (the original) L4 header follows
the INT stack, and the last byte of the shim header carries the IP protocol
value for the L4 layer.
* **Length (8b):** This is the total length of INT metadata header and INT stack
in 4-byte words. The length of the shim header (1 word) is NOT counted
since INT version 2.0.
A non-INT device may read this field and skip over INT headers.
* **UDP port, IP proto, or DSCP (16b):** The contents of this field differ
depending on the value of NPT.
- **NPT=0:** The first byte and the last two bits of this 16b field are reserved,
set to zero upon transmission and ignored upon reception. The first 6 bits of
the second byte may optionally carry the original DSCP value.
- **NPT=1:** The original UDP destination port value.
- **NPT=2:** The first byte is reserved, set to zero upon transmission and ignored
upon reception. The second byte carries the original IP protocol value.
The other bits in the shim header are reserved (R) for future use,
set to zero upon transmission and ignored upon reception.
### INT over VXLAN GPE
VXLAN is a common tunneling protocol for network virtualization and is supported
by most software virtual switches and hardware network elements. The VXLAN
header as defined in RFC 7348 is a fixed 8-byte header as shown below.
VXLAN Header:
```
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1