# Achieving sub-second IGP convergence in large IP networks Article Summary

José Duarte

March 2019

## 1 Introduction

The paper describes and analyses the various factors influencing convergence time in IGP networks, presenting a short description of the IS-IS protocol followed by a description of the convergence time. Showing that the main problems achieving sub-second convergence lie on the Routing Information Base (RIB) and Forwarding Information Base (FIB) update, being able to reduce its influence by introducing prefix reduction during the design stages, prefix prioritization during RIB/FIB updates and incremental updates to the FIB.

A simulation model was used to study the convergence time in larger networks concluding that the sub-second convergence goal is achievable on ISP scale networks.

## 2 IS-IS Protocol

In the IS-IS protocol the router exchanges HELLO PDUs with its neighbors to determine its local topology, afterwards it will flood a link-state packet (LSP) describing its local topology, this packet will contain at least the identifier of its neighbors.

For broadcast networks, IS-IS routers will elect a router to "represent" the broadcast network, the router will then generate a LSP describing this network and all attached routers.

Two situations may force the flooding of the LSP, the information contained in the LSP changed, meaning that a new LSP must be generated and flooded, or the LSP lifetime ended and it must be flooded again. In order to ensure reliability when flooding the network, each LSP is acknowledged on each link.

When a router receives a LSP describing a topology change, it updates the Link State Database (LSDB), this event triggers the update of the RIB which in turn triggers the update of the FIB. In order to update the RIB a new Shortest Path Tree (SPT) must be computed based on the information contained in the LSDB.

## 3 Convergence Time Components

The convergence time can be characterized as D + O + F + SPT + RIB + DD where:

- ullet D Link failure detection time
- ullet O Time to originate the LSP describing the new topology
- $\bullet$  F Complete flooding time from the node detecting the failure to the rerouting nodes that require a FIB update to bring the network to a consistent forwarding state
- SPT Shortest Path Tree computation time

- $\bullet$  RIB Time taken to update the RIB and the FIB
- $\bullet$  DD Time to distribute the FIB updates to the linecards (in the case of a distributed router architecture)

### 3.1 Router Architecture, Processor Performance, Operating System

One of the main bottlenecks on convergence time is the time taken updating the RIB and FIB components, it is clear that the faster the processor, the faster the convergence.

A distributed router architecture with hardware packet processors is presented as a very well suited solution to the problem given that the the CPU (RP) is able to dedicated all power to the control plane operation. Handling all routing operations and delegating to the linecards the write of the FIB updates to the hardware packet processors.

The operating system (OS) running on the RP and the LineCard CPU's (LC) implements a process scheduler with multiple priorities and preemption capabilities, allowing for the IS-IS process to be scheduled immediately upon link failure.

During convergence on a distributed platform, two processes share the CPU: the IS-IS process to update the RIB and the FIB and the process distributing the FIB updates to the LC CPU's.

Given that the main bottleneck is present in the RIB update, the IS-IS process will start by updating the prefixes with the higher priority, in order to ensure the update of the LC's a quantum is used, when the quantum is over the IPC process is scheduled, distributing the updates to the LC's, going back to the prefix update when the quantum is over, repeating the process.

#### 3.2 Link Failure Detection

The use of Packet over SDH/SONET (POS) links is pointed as being one of the major enablers of sub-second IGP convergence given their ability to detect failures in tens of milliseconds.

Mechanisms present in SDH and SONET allow the LC hardware to detect failure in less than 10 milliseconds. When a failure is detected an high-priority interrupt is fired causing a POS routine to be executed, which enforces a user-defined hold-time delaying the communication of the failure to the central CPU and allowing for SDH/SONET protection to occur. If protection is not available, the failure is immediately signaled to the common CPU, updating the interface and scheduling IS-IS for reaction.