From a9f53a74744bee5acdd78e0c3440159f96ce5b45 Mon Sep 17 00:00:00 2001 From: Olivier Bonaventure Date: Thu, 12 Sep 2013 09:04:19 +0200 Subject: [PATCH] reliability, resource sharing --- book-2nd/README.rst | 14 +- book-2nd/conf.py | 4 +- book-2nd/index.rst | 25 ++- book-2nd/preface.rst | 13 +- book-2nd/principles/dv.rst | 8 +- book-2nd/principles/linkstate.rst | 10 +- book-2nd/principles/network.rst | 316 ++++++++++++++++++++------- book-2nd/principles/reliability.rst | 319 +++++++++++++--------------- book-2nd/principles/sharing.rst | 195 ++++++++++++++--- 9 files changed, 596 insertions(+), 308 deletions(-) diff --git a/book-2nd/README.rst b/book-2nd/README.rst index 61696cc..d157c4f 100644 --- a/book-2nd/README.rst +++ b/book-2nd/README.rst @@ -1,6 +1,9 @@ Computer Networking : Principles, Protocols and Practice, 2nd Edition -======================================================== +===================================================================== + +This is the current draft for the second edition of the Computer Networking : Principles, Protocols and Practice open-source ebook. This draft will be updated on a regular basis until the end of the year. + (c) Olivier Bonaventure, Universite catholique de Louvain, Belgium http://perso.uclouvain.be/olivier.bonaventure @@ -8,13 +11,6 @@ Computer Networking : Principles, Protocols and Practice, 2nd Edition All the files in this subversion repository are licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. -The book was compiled on MacOS/X Snow Leopard using sphinx. inkscape is required to convert some of the images in png format. Most of the images will be converted to the SVG format to improve the portability of the textbook. python is also required. When US and British spelling disagree, we opt for US spelling. - -The textbook was written on a Mac running Snow Leopard, but it should rebuild on other Unix based systems. Comments on issues in rebuilding the textbook are welcome. - - +The ebook compiles on MacOS/X Snow Leopard using sphinx. inkscape is required to convert some of the images in png format. Most of the images will be converted to the SVG format to improve the portability of the textbook. python is also required. When US and British spelling disagree, we opt for US spelling. - - Olivier Bonaventure - Louvain-la-Neuve, winter 2013 diff --git a/book-2nd/conf.py b/book-2nd/conf.py index 9c00073..dc3f0dc 100644 --- a/book-2nd/conf.py +++ b/book-2nd/conf.py @@ -22,7 +22,7 @@ # Add any Sphinx extension module names here, as strings. They can be extensions # coming with Sphinx (named 'sphinx.ext.*') or your custom ones.# -extensions = ['sphinx.ext.todo', 'sphinx.ext.pngmath', 'sphinxcontrib.mscgen'] +extensions = ['sphinx.ext.todo', 'sphinx.ext.pngmath', 'sphinxcontrib.mscgen','sphinx.ext.graphviz'] #'rst2pdf.pdfbuilder', 'rst2pdf.pdfmath', 'sphinx.ext.pngmath' ] #, # 'sphinx.ext.autodoc','rst2pdf.pdfbuilder'] @@ -72,7 +72,7 @@ # List of files that should not be automatically compiled by sphynx because they are included -exclude_patterns = [ '.#*', # emacs backups +exclude_patterns = [ '*.#*', # emacs backups # 'intro/organisation.rst', # 'intro/referencemodels.rst', # 'intro/services-protocols.rst', diff --git a/book-2nd/index.rst b/book-2nd/index.rst index db19484..9190c1c 100644 --- a/book-2nd/index.rst +++ b/book-2nd/index.rst @@ -6,25 +6,28 @@ Computer Networking : Principles, Protocols and Practice, 2nd edition ##################################################################### +.. only:: html + + .. figure:: cnp3.png :align: center :scale: 60 -.. only:: html - - This is the current HTML version of the `Computer Networking : Principles, Protocols and Practice `_. You can login with your yahoo, google or openid account to provide comments and suggestions to improve the text. You can also directly download the textbook in various formats from the links below : + This is the draft of the second edition of `Computer Networking : Principles, Protocols and Practice`. The ebook is being entirely rewritten until the end of 2013. It will be updated on a weekly basis. You can also directly download the current ebook draft in various formats from the links below : - :download:`distrib/cnp3b.epub` suitable for viewing on tablets like ipad - :download:`distrib/cnp3b.mobi` suitable for viewing on amazon kindle - :download:`distrib/cnp3b.pdf` suitable for viewing and printing anywhere - The development of this edition of the textbook is done on `github - `_ + The development of this edition of the textbook is carried out on `github + `_ + + You can help to improve this ebook by : - - posting comments, suggestions or bug reports on github - - proposing new exercices or sending patches on the `CNP3 `_ mailing list. + - posting comments, suggestions or bug reports on `github `_ + - proposing new exercices or sending patches on `github `_ - The source code of the entire textbook is written in `reStructuredText `_ and uses several `sphinx `_ features. You can browse it from https://github.com/obonaventure/cnp3 + The source code of the entire textbook is written in `reStructuredText `_ and uses several `sphinx `_ features. You can browse it from `github `_ Table of Contents @@ -44,6 +47,12 @@ Part 1: Principles principles/reliability principles/network + principles/dv + principles/linkstate + principles/sharing + principles/transport + + .. intro/organisation diff --git a/book-2nd/preface.rst b/book-2nd/preface.rst index 09bfe13..7a94e99 100644 --- a/book-2nd/preface.rst +++ b/book-2nd/preface.rst @@ -6,18 +6,17 @@ Preface ======= +This is the current draft of the second edition of the `Computer Networking : Principles, Protocols and Practice`. The document is updated every week. -This textbook came from a frustration of its main author. Many authors chose to write a textbook because there are no textbooks in their field or because they are not satisfied with the existing textbooks. This frustration has produced several excellent textbooks in the networking community. At a time when networking textbooks were mainly theoretical, `Douglas Comer`_ chose to write a textbook entirely focused on the TCP/IP protocol suite [Comer1988]_, a difficult choice at that time. He later extended his textbook by describing a complete TCP/IP implementation, adding practical considerations to the theoretical descriptions in [Comer1988]_. `Richard Stevens`_ approached the Internet like an explorer and explained the operation of protocols by looking at all the packets that were exchanged on the wire [Stevens1994]_. `Jim Kurose`_ and `Keith Ross`_ reinvented the networking textbooks by starting from the applications that the students use and later explained the Internet protocols by removing one layer after the other [KuroseRoss09]_. +.. This textbook came from a frustration of its main author. Many authors chose to write a textbook because there are no textbooks in their field or because they are not satisfied with the existing textbooks. This frustration has produced several excellent textbooks in the networking community. At a time when networking textbooks were mainly theoretical, `Douglas Comer`_ chose to write a textbook entirely focused on the TCP/IP protocol suite [Comer1988]_, a difficult choice at that time. He later extended his textbook by describing a complete TCP/IP implementation, adding practical considerations to the theoretical descriptions in [Comer1988]_. `Richard Stevens`_ approached the Internet like an explorer and explained the operation of protocols by looking at all the packets that were exchanged on the wire [Stevens1994]_. `Jim Kurose`_ and `Keith Ross`_ reinvented the networking textbooks by starting from the applications that the students use and later explained the Internet protocols by removing one layer after the other [KuroseRoss09]_. -.. comment:: I'm having some trouble with the second last sentence,perhaps: "the organisation of the information on these websites are badly suited to student learning",or "is not best suited to facilitate student learning". I will propose more +.. The frustrations that motivated this book are different. When I started to teach networking in the late 1990s, students were already Internet users, but their usage was limited. Students were still using reference textbooks and spent time in the library. Today's students are completely different. They are avid and experimented web users who find lots of information on the web. This is a positive attitude since they are probably more curious than their predecessors. Thanks to the information that is available on the Internet, they can check or obtain additional information about the topics explained by their teachers. This abundant information creates several challenges for a teacher. Until the end of the nineteenth century, a teacher was by definition more knowledgeable than his students and it was very difficult for the students to verify the lessons given by their teachers. Today, given the amount of information available at the fingertips of each student through the Internet, verifying a lesson or getting more information about a given topic is sometimes only a few clicks away. Websites such as `wikipedia `_ provide lots of information on various topics and students often consult them. Unfortunately, the organisation of the information on these websites is not well suited to allow students to learn from them. Furthermore, there are huge differences in the quality and depth of the information that is available for different topics. -The frustrations that motivated this book are different. When I started to teach networking in the late 1990s, students were already Internet users, but their usage was limited. Students were still using reference textbooks and spent time in the library. Today's students are completely different. They are avid and experimented web users who find lots of information on the web. This is a positive attitude since they are probably more curious than their predecessors. Thanks to the information that is available on the Internet, they can check or obtain additional information about the topics explained by their teachers. This abundant information creates several challenges for a teacher. Until the end of the nineteenth century, a teacher was by definition more knowledgeable than his students and it was very difficult for the students to verify the lessons given by their teachers. Today, given the amount of information available at the fingertips of each student through the Internet, verifying a lesson or getting more information about a given topic is sometimes only a few clicks away. Websites such as `wikipedia `_ provide lots of information on various topics and students often consult them. Unfortunately, the organisation of the information on these websites is not well suited to allow students to learn from them. Furthermore, there are huge differences in the quality and depth of the information that is available for different topics. +.. The second reason is that the computer networking community is a strong participant in the open-source movement. Today, there are high-quality and widely used open-source implementations for most networking protocols. This includes the TCP/IP implementations that are part of linux_, freebsd_ or the uIP_ stack running on 8bits controllers, but also servers such as bind_, unbound_, apache_ or sendmail_ and implementations of routing protocols such as xorp_ or quagga_ . Furthermore, the documents that define almost all of the Internet protocols have been developed within the Internet Engineering Task Force (IETF_) using an open process. The IETF publishes its protocol specifications in the publicly available RFC_ and new proposals are described in `Internet drafts`_. -The second reason is that the computer networking community is a strong participant in the open-source movement. Today, there are high-quality and widely used open-source implementations for most networking protocols. This includes the TCP/IP implementations that are part of linux_, freebsd_ or the uIP_ stack running on 8bits controllers, but also servers such as bind_, unbound_, apache_ or sendmail_ and implementations of routing protocols such as xorp_ or quagga_ . Furthermore, the documents that define almost all of the Internet protocols have been developed within the Internet Engineering Task Force (IETF_) using an open process. The IETF publishes its protocol specifications in the publicly available RFC_ and new proposals are described in `Internet drafts`_. +.. This open textbook aims to fill the gap between the open-source implementations and the open-source network specifications by providing a detailed but pedagogical description of the key principles that guide the operation of the Internet. The book is released under a `creative commons licence `_. Such an open-source license is motivated by two reasons. The first is that we hope that this will allow many students to use the book to learn computer networks. The second is that I hope that other teachers will reuse, adapt and improve it. Time will tell if it is possible to build a community of contributors to improve and develop the book further. As a starting point, the first release contains all the material for a one-semester first upper undergraduate or a graduate networking course. -This open textbook aims to fill the gap between the open-source implementations and the open-source network specifications by providing a detailed but pedagogical description of the key principles that guide the operation of the Internet. The book is released under a `creative commons licence `_. Such an open-source license is motivated by two reasons. The first is that we hope that this will allow many students to use the book to learn computer networks. The second is that I hope that other teachers will reuse, adapt and improve it. Time will tell if it is possible to build a community of contributors to improve and develop the book further. As a starting point, the first release contains all the material for a one-semester first upper undergraduate or a graduate networking course. - -As of this writing, most of the text has been written by `Olivier Bonaventure`_. `Laurent Vanbever`_, `Virginie Van den Schriek`_, `Damien Saucez`_ and `Mickael Hoerdt`_ have contributed to exercises. Pierre Reinbold designed the icons used to represent switches and Nipaul Long has redrawn many figures in the SVG format. Stephane Bortzmeyer sent many suggestions and corrections to the text. Additional information about the textbook is available at http://inl.info.ucl.ac.be/CNP3 +The first edition of this ebook has been written by `Olivier Bonaventure`_. `Laurent Vanbever`_, `Virginie Van den Schriek`_, `Damien Saucez`_ and `Mickael Hoerdt`_ have contributed to exercises. Pierre Reinbold designed the icons used to represent switches and Nipaul Long has redrawn many figures in the SVG format. Stephane Bortzmeyer sent many suggestions and corrections to the text. Additional information about the textbook is available at http://inl.info.ucl.ac.be/CNP3 .. The overall objective of the book is to explain the principles and the protocols used in computer networks such as the Internet and also provide the students with some intuition about the important practical problems that often arise. The textbook was developed for the .. The course follows a hybrid problem-based learning (:term:`PBL`) approach. During each week, the students follow a 2 hours theoretical course that describes the principles and some of the protocols. They also receive a set of small problems that they need to solve in groups. These problems are designed to reinforce the student's knowledge but also to explore the practical problems that arise in real networks by allowing the students to perform experiments by writing prototype networking code. diff --git a/book-2nd/principles/dv.rst b/book-2nd/principles/dv.rst index 068d7f3..87b2c72 100644 --- a/book-2nd/principles/dv.rst +++ b/book-2nd/principles/dv.rst @@ -61,7 +61,7 @@ The first condition ensures that the router discovers the shortest path towards To understand the operation of a distance vector protocol, let us consider the network of five routers shown below. -.. figure:: svg/dv-1.png +.. figure:: ../../book/network/svg/dv-1.png :align: center :scale: 100 @@ -78,7 +78,7 @@ Assume that `A` is the first to send its distance vector `[A=0]`. At this point, all routers can reach all other routers in the network thanks to the routing tables shown in the figure below. -.. figure:: svg/dv-full.png +.. figure:: ../../book/network/svg/dv-full.png :align: center :scale: 100 @@ -97,7 +97,7 @@ At this point, all routers have a routing table allowing them to reach all anoth .. _fig-afterfailure: -.. figure:: svg/dv-failure-2.png +.. figure:: ../../book/network/svg/dv-failure-2.png :align: center :scale: 100 @@ -149,7 +149,7 @@ This technique is called `split-horizon`. With this technique, the count to infi Unfortunately, split-horizon, is not sufficient to avoid all count to infinity problems with distance vector routing. Consider the failure of link `A-B` in the network of four routers below. -.. figure:: svg/dv-infinity.png +.. figure:: ../../book/network/svg/dv-infinity.png :align: center :scale: 100 diff --git a/book-2nd/principles/linkstate.rst b/book-2nd/principles/linkstate.rst index 3bffd51..ab6f577 100644 --- a/book-2nd/principles/linkstate.rst +++ b/book-2nd/principles/linkstate.rst @@ -24,7 +24,7 @@ Other variants are possible. Some networks use optimisation algorithms to find t When a link-state router boots, it first needs to discover to which routers it is directly connected. For this, each router sends a HELLO message every `N` seconds on all of its interfaces. This message contains the router's address. Each router has a unique address. As its neighbouring routers also send HELLO messages, the router automatically discovers to which neighbours it is connected. These HELLO messages are only sent to neighbours who are directly connected to a router, and a router never forwards the HELLO messages that they receive. HELLO messages are also used to detect link and router failures. A link is considered to have failed if no HELLO message has been received from the neighbouring router for a period of :math:`k \times N` seconds. -.. figure:: svg/ls-hello.png +.. figure:: ../../book/network/svg/ls-hello.png :align: center :scale: 100 @@ -77,7 +77,7 @@ In this pseudo-code, `LSDB(r)` returns the most recent `LSP` originating from ro Flooding is illustrated in the figure below. By exchanging HELLO messages, each router learns its direct neighbours. For example, router `E` learns that it is directly connected to routers `D`, `B` and `C`. Its first LSP has sequence number `0` and contains the directed links `E->D`, `E->B` and `E->C`. Router `E` sends its LSP on all its links and routers `D`, `B` and `C` insert the LSP in their LSDB and forward it over their other links. -.. figure:: svg/ls-flooding.png +.. figure:: ../../book/network/svg/ls-flooding.png :align: center :scale: 100 @@ -88,7 +88,7 @@ Flooding allows LSPs to be distributed to all routers inside the network without To ensure that all routers receive all LSPs, even when there are transmissions errors, link state routing protocols use `reliable flooding`. With `reliable flooding`, routers use acknowledgements and if necessary retransmissions to ensure that all link state packets are successfully transferred to all neighbouring routers. Thanks to reliable flooding, all routers store in their LSDB the most recent LSP sent by each router in the network. By combining the received LSPs with its own LSP, each router can compute the entire network topology. -.. figure:: svg/ls-lsdb.png +.. figure:: ../../book/network/svg/ls-lsdb.png :align: center :scale: 100 @@ -104,7 +104,7 @@ To ensure that all routers receive all LSPs, even when there are transmissions e When a link fails, the two routers attached to the link detect the failure by the lack of HELLO messages received in the last :math:`k \times N` seconds. Once a router has detected a local link failure, it generates and floods a new LSP that no longer contains the failed link and the new LSP replaces the previous LSP in the network. As the two routers attached to a link do not detect this failure exactly at the same time, some links may be announced in only one direction. This is illustrated in the figure below. Router `E` has detected the failures of link `E-B` and flooded a new LSP, but router `B` has not yet detected the failure. -.. figure:: svg/ls-twoway.png +.. figure:: ../../book/network/svg/ls-twoway.png :align: center :scale: 100 @@ -117,7 +117,7 @@ When a router has failed, its LSP must be removed from the LSDB of all routers [ To compute its routing table, each router computes the spanning tree rooted at itself by using Dijkstra's shortest path algorithm [Dijkstra1959]_. The routing table can be derived automatically from the spanning as shown in the figure below. -.. figure:: svg/ls-computation.png +.. figure:: ../../book/network/svg/ls-computation.png :align: center :scale: 100 diff --git a/book-2nd/principles/network.rst b/book-2nd/principles/network.rst index 7068d33..4c528a6 100644 --- a/book-2nd/principles/network.rst +++ b/book-2nd/principles/network.rst @@ -38,6 +38,25 @@ Even if we only consider the point-to-point datalink layers, there is an importa As a first step, let us assume that we only need to exchange small amount of data. In this case, there is no issue with the maximum length of the frames. However, there are other more interesting problems that we need to tackle. To understand these problems, let us consider the network represented in the figure below. + +.. graphviz:: + + graph foo { + A [shape=box]; + B [shape=box]; + C [shape=box]; + A--R1; + R1--R3; + R3--R5; + R1--R2; + R2--R4; + R4--R5; + R3--R4; + R2--C; + R4--C; + R5--B; + } + .. figure:: todo TODO figure with 5 routers and hosts @@ -86,8 +105,12 @@ The computation of the forwarding tables of all the routers inside a network is In a network, a path can be defined as the list of all intermediate routers for a given source destination pair. For a given source/destination pair, the path can be derived by first consulting the forwarding table of the router attached the source to determine the next router on the path towards the chosen destination. Then, the forwarding table of this router is queried for the same destination... The queries continue until the destination is reached. In a network that has valid forwarding tables, all the paths between all source/destination pairs contain a finite number of intermediate routers. However, if forwarding tables have not been correctly computed, two types of invalid path can occur. +.. index:: black hole + A path may lead to a black hole. In a network, a black hole is a router that receives packets for at least one given source/destination pair but does not have any entry inside its forwarding table for this destination. Since it does not know how to reach the destination, the router cannot forward the received packets and must discard them. Any centralized or distributed algorithm that computes forwarding tables must ensure that there are not black holes inside the network. +.. index:: forwarding loop + A second type of problem may exist in networks using the datagram organisation. Consider a path that contain a cycle. For example, router `R1` sends all packets towards destination `D` via router `R2`, router `R2` forwards these packets to router `R3` and finally router `R3`'s forwarding table uses router `R1` as its nexthop to reach destination `D`. In this case, if a packet destined to `D` is received by router `R1`, it will loop on the `R1 -> R2 -> R3 -> R1` cycle and will never reach its final destination. As in the black hole case, the destination is not reachable from all sources in the network. However, in practice the loop problem is worse than the black hole problem because when a packet is caught in a forwarding loop, it unnecessarily consumes bandwidth. In the black hole case, the problematic packet is quickly discarded. We will see later that network layer protocols include techniques to minimize the impact of such forwarding loops. Any solution which is used to compute the forwarding tables of a network must ensure that all destinations are reachable from any source. This implies that it must guarantee the absence of black holes and forwarding loops. @@ -104,143 +127,290 @@ Besides the `data plane`, a network is also characterized by its `control plane` In most networks, manual forwarding tables are not a solution for two reasons. First, most networks are too large to enable a manual computation of the forwarding tables. Second, with manually computed forwarding tables, it is very difficult to deal with link and router failures. Networks need to operate 24h a day, 365 days per year. During the lifetime of a network, many events can affect the routers and links that it contains. Link failures are regular events in deployed networks. Links can fail for various reasons, including electromagnetic interference, fiber cuts, hardware or software problems on the terminating routers, ... Some links also need to be added to the network or removed because their utilisation is too low or their cost is too high. Similarly, routers also fail. There are two types of failures that affect routers. A router may stop forwarding packets due to hardware or software problem (e.g. due to a crash of its operating system). A router may also need to be halted from time to time (e.g. to upgrade its operating system to fix some bugs). These planned and unplanned events affect the set of links and routers that can be used to forward packets in the network. Still, most network users expect that their network will continue to correctly forward packets despite all these events. With manually computed forwarding tables, it is usually impossible to precompute the forwarding tables while taking into account all possible failure scenarios. + An alternative to manually computed forwarding tables is to use a network management platform that tracks the network status and can push new forwarding tables on the routers when it detects any modification to the network topology. This solution gives some flexibility to the network managers in computing the paths inside their network. However, this solution only works if the network management platform is always capable of reaching all routers even when the network topology changes. This may require a dedicated network that allows the management platform to push information on the forwarding tables. -Nowadays, most deployed networks rely on distributed algorithms, called routing protocols, to compute the forwarding tables that are installed on the routers. These distributed algorithms are part of the `control plane`. Their are usually implemented in software and are executed on the router's CPU. We will discuss later the two main families of routing protocols : distance vector routing and link state routing. Both are capable of discovering autonomously the network and react dynamically to topology changes. +.. todo:: cite references + +.. Openflow is an example of this kind of solution. + +Another interesting point that is worth being discussed is when the forwarding tables are computed. A widely used solution is to compute the entries of the forwarding tables for all destinations on all routers. This ensures that each router has a valid route towards each destination. These entries can be updated when an event occurs and the network topology changes. A drawback of this approach is that the forwarding tables can become large in large networks since each router must maintain one entry for each destination at all times inside its forwarding table. + +Some networks use the arrival of packets as the trigger to compute the corresponding entries in the forwarding tables. Several technologies have been built upon this principle. When a packet arrives, the router consults its forwarding table to find a path towards the destination. If the destination is present in the forwarding table, the packet is forwarded. Otherwise, the router needs to find a way to forward the packet and update its forwarding table. + +Several techniques to update the forwarding tables upon the arrival of a packet have been used in deployed networks. In this section, we briefly present the principles that underly three of these techniques. + +The first technique assumes that the underlying network topology is a tree. A tree is the simplest network to be considered when forwarding packets. The main advantage of using a tree is that there is only one path between any pair of nodes inside the network. Since a tree does not contain any cycle, it is impossible to have forwarding loops in a tree-shaped network. + +.. index:: port-address table -The datagram organisation has been very popular in computer networks. Datagram based network layers include IPv4 and IPv6 in the global Internet, CLNP defined by the ISO, IPX defined by Novell or XNS defined by Xerox [Perlman2000]_. +In a tree-shaped network, it is relatively simple to for each node to automatically compute its forwarding table by inspecting the packets that it receives. For this, each node uses the sources and destination addresses present inside each packet. The source address allows to learn the location of the different sources inside the network. Each source has a unique address. When a node receives a packet over a given interface, it learns that the source (address) of this packet is reachable via this interface. The node maintains a data structure that maps each known source address to an incoming interface. This data structure is often called the port-address table since it indicates the interface (or port) to reach a given address. Learning the location of the sources is not sufficient, nodes also need to forward packets towards their destination. When a node receives a packet whose destination address is already present inside its port-address table, it simply forwards the packet on the interface listed in the port-address table. In this case, the packet will follow the port-address table entries in the downstream nodes and will reach the destination. If the destination address is not included in the port-address table, the node simply forwards the packet on all its interfaces, except the interface from which the packet was received. Forwarding a packet over all interfaces is usually called `broadcasting` in the terminology of computer networks. Sending the packet over all interfaces except one is a costly operation since the packet will be sent over links that do not reach the destination. Given the tree-shape of the network, the packet will explore all downstream branches of the tree and will thus finally reach its destination. In practice, the `broadcasting` operation does not occur too often and its cost is limited. + +To understand the operation of the port-address table, let us consider the example network shown in the figure below. This network contains three hosts : `A`, `B` and `C` and five nodes, `R1` to `R5`. When the network boots, all the forwarding tables of the nodes are empty. + +.. graphviz:: + + graph foo { + A [shape=box]; + B [shape=box]; + C [shape=box]; + A--R1; + R1--R3; + R3--R5; + R1--R2; + R3--R4; + R2--C; + R5--B; + } + + +Host `A` sends a packet towards `B`. When receiving this packet, `R1` learns that `A` is reachable via its `North` interface. Since it does not have an entry for destination `B` in its port-address table, it forwards the packet to both `R2` and `R3`. When `R2` receives the packet, it updates its own forwarding table and forward the packet to `C`. Since `C` is not the intended recipient, it simply discards the received packet. Node `R3` also received the packet. It learns that `A` is reachable via its `North` interface and broadcasts the packet to `R4` and `R5`. `R5` also updates its forwarding table and finally forwards it to destination `B`.`Let us now consider what happens when `B` sends a reply to `A`. `R5` first learns that `A` is attached to its `South` port. It then consults its port-address table and finds that `A` is reachable via its `North` interface. The packet is then forwarded hop-by-hop to `A` without any broadcasting. If `C` sends a packet to `B`, this packet will reach `R1` that contains a valid forwarding entry in its forwarding table. + +By inspecting the source and destination addresses of packets, network nodes can automatically derive their forwarding tables. As we will discuss later, this technique is used in Ethernet networks. Despite being widely used, it has two important drawbacks. First, packets sent to unknown destinations are broadcasted in the network even if the destination is not attached to the network. Consider the transmission of ten packets destined to `Z` in the network above. When a node receives a packet towards this destination, it can only broadcast the packet. Since `Z` is not attached to the network, no node will ever receive a packet whose source is `Z` to update its forwarding table. The second and more important problem is that few networks have a tree-shaped topology. It is interesting to analyze what happens when a port-address table is used in a network that contains a cycle. Consider the simple network shown below with a single host. + +.. graphviz:: + + graph foo { + A [shape=box]; + B [shape=box]; + A--R1 ; + R1--R2 ; + R1--R3; + R2--R3; + R3--B; + } + +Assume that the network has started and all port-station and forwarding tables are empty. Host `A` sends a packet towards `B`. Upon reception of this packet, `R1` updates its port-address table. Since `B` is not present in the port-address table, the packet is broadcasted. Both `R2` and `R3` receive a copy of the packet sent by `A`. They both update their port-address table. Unfortunately, they also both broadcast the received packet. `B` receives a first copy of the packet, but `R3` and `R2` receive it again. `R3` will then broadcast this copy of the packet to `B` and `R1` while `R2` will broadcast its copy to `R1`. Although `B` has already received two copies of the packet, it is still inside the network and will continue to loop. Due to the presence of the cycle, a single packet towards an unknown destination generates copies of this packet that loop and will saturate the network bandwidth. Network operators who are using port-address tables to automatically compute the forwarding tables also use distributed algorithms to ensure that the network topology is always a tree. .. - .. figure:: svg/simple-lan.png - :align: center - :scale: 80 - - A local area network + // imagepath="../svg/icons/:../../svg/icons/"; + // r1 [label="R1" labelloc=bottom shapefile="router.png" ]; + // r2 [label="R2" labelloc=bottom shape=box imagescale=true image="router.png" ]; + // r3 [label="R3" labelloc=bottom shape=none image="router.png" ]; -.. An important difference between the point-to-point datalink layers and the datalink layers used in LANs is that in a LAN, each communicating device is identified by a unique `datalink layer address`. This address is usually embedded in the hardware of the device and different types of LANs use different types of datalink layer addresses. A communicating device attached to a LAN can send a datalink frame to any other communicating device that is attached to the same LAN. Most LANs also support special broadcast and multicast datalink layer addresses. A frame sent to the broadcast address of the LAN is delivered to all communicating devices that are attached to the LAN. The multicast addresses are used to identify groups of communicating devices. When a frame is sent towards a multicast datalink layer address, it is delivered by the LAN to all communicating devices that belong to the corresponding group. +.. http://support.novell.com/techcenter/articles/ana19910501.html reference source routing token ring +.. index:: source routing -.. index:: NBMA, Non-Broadcast Multi-Access Networks +Another technique can be used to automatically compute forwarding tables. It has been used in interconnected Token Ring networks and is used in some wireless networks. Intuitively, `Source routing` enables a destination to automatically discover the paths from a given source towards itself. This technique requires nodes to change some information inside some packets. For simplicity, let us assume that the `data plane` supports two types of packets : -.. The third type of datalink layers are used in Non-Broadcast Multi-Access (NBMA) networks. These networks are used to interconnect devices like a LAN. All devices attached to an NBMA network are identified by a unique datalink layer address. However, and this is the main difference between an NBMA network and a traditional LAN, the NBMA service only supports unicast. The datalink layer service provided by an NBMA network supports neither broadcast nor multicast. + - the `data packets` + - the `control packets` +`Data packets` are used to exchange data while `control packets` are used to discover the paths between endhosts. With `Source routing`, network nodes can be kept as simple as possible and all the complexity is placed on the endhosts. This is in contrast with the previous technique where the nodes had to maintain a port-address and a forwarding table while the hosts simply sent and received packets. Each node is configured with one unique address and there is one identifier per outgoing link. For simplicity and to avoid cluttering the figures with those identifiers, we will assume that each node uses as link identifiers north, west, south, ... In practice, a node would associate one integer to each outgoing link. -The network layer itself relies on the following principles : +.. graphviz:: - #. Each network layer entity is identified by a `network layer address`. This address is independent of the datalink layer addresses that it may use. - #. The service provided by the network layer does not depend on the service or the internal organisation of the underlying datalink layers. - #. The network layer is conceptually divided into two planes : the `data plane` and the `control plane`. The `data plane` contains the protocols and mechanisms that allow hosts and routers to exchange packets carrying user data. The `control plane` contains the protocols and mechanisms that enable routers to efficiently learn how to forward packets towards their final destination. + graph foo { + A [shape=box]; + B [shape=box]; + A--R1 ; + R1--R2 []; + R1--R3 []; + R2--R3 []; + R3--R4 []; + R4--B ; + } +In this network above, node `R2` is attached to two outgoing links. `R2` is connected to both `R1` and `R3`. `R2` can easily determine that it is connected to these two nodes by exchanging packets with them or observing the packets that it receives over each interface. Assume for example that when a host or node starts, it sends a special control packet over each of its interfaces to advertise its own address to its neighbors. When a host or node receives such a packet, it automatically replies with its own address. This exchange can also be used to verify whether a neighbor, either node or host, is still alive. whose identifiers are respectively `R2.ne` and `R2.se`. With `source routing`, the data plane packets include a list identifiers. This list is called a `source route` and indicates the path to be followed by the packet as a sequence of link identifiers. When a node receives such a `data plane` packet, it first checks whether the packet's destination is direct neighbor. In this case, the packet is forwarded to the destination. Otherwise, the node extracts the next address from the list and forwards it to the neighbor. This allows the source to specify the explicit path to be followed for each packet. For example, in the figure above there are two possible paths between `A` and `B`. To use the path via `R2`, `A` would send a packet that contains `R1,R2,R3` as source route. To avoid going via `R2`, `A` would place `R1,R3` as the source route in its transmitted packet. If `A` knows the complete network topology and all link identifiers, it can easily compute the source route towards each destination. If needed, it could even use different paths, e.g. for redundancy, to reach a given destination. However, in a real network hosts do not usually have a map of the entire network topology. -The independence of the network layer from the underlying datalink layer is a key principle of the network layer. It ensures that the network layer can be used to allow hosts attached to different types of datalink layers to exchange packets through intermediate routers. Furthermore, this allows the datalink layers and the network layer to evolve independently from each other. This enables the network layer to be easily adapted to a new datalink layer every time a new datalink layer is invented. +.. index:: record route -There are two types of service that can be provided by the network layer : +In networks that rely on source routing, hosts use control packets to automatically discover the best path(s). In addition to the source and destination addresses, `control packets` contain a list that records the intermediate nodes. This list is often called the `record route` because it allows to record the route followed by a given packet. When a node receives a `control packet`, it first checks whether its address is included in the record route. If yes, the control packet is silently discarded. Otherwise, it adds its own address to the `record route` and forwards the packet to all its interfaces, except the interface over which the packet has been received. Thanks to this, the `control packet` will be able to explore all paths between a source and a given destination. + + +For example, consider again the network topology above. `A` sends a control packet towards `B`. The initial `record route` is empty. When `R1` receives the packet, it adds its own address to the `record route` and forwards a copy to `R2` and another to `R2`. `R2` receives the packet, adds itself to the `record route` and forwards it to `R3`. `R3` receives two copies of the packet. The first contains the `[R1,R2]` `record route` and the second `[R1,R2,R3]`. In the end, `B` will receive two control packets containing `[R1,R2,R3,R4]` and `[R1,R3,R4]` as `record routes`. +`B` can keep these two paths or select the best one and discard the second. A popular heuristic is to select the `record route` of the first received packet as being the best one since this likely corresponds to the shortest delay path. + +With the received `record route`, `B` can send a `data packet` to `A`. For this, it simply reverses the chosen `record route`. However, we still need to communicate the chosen path to `A`. This can be done by putting the `record route` inside a control packet which is sent back to `A` over the reverse path. An alternative is to simply send a `data packet` back to `A`. This packet will travel back to `A`. To allow `A` to inspect the entire path followed by the `data packet`, its `source route` must contain all intermediate routers when it is received by `A`. This can be achieved by encoding the `source route` using a data structure that contains an index and the ordered list of node addresses. The index always points to the next address in the `source route`. It is initialized at `0` when a packet is created and incremented by each intermediate node. - - an `unreliable connectionless` service - - a `connection-oriented`, reliable or unreliable, service -Connection-oriented services have been popular with technologies such as :term:`X.25` and :term:`ATM` or :term:`frame-relay`, but nowadays most networks use an `unreliable connectionless` service. This is our main focus in this chapter. +Flat or hierarchical addresses +------------------------------ -Organisation of the network layer -================================= +The last, but important, point to discuss about the `data plane` of the networks that rely on the datagram mode is their addressing scheme. In the examples above, we have used letters to represent the addresses of the hosts and network nodes. In practice, all addresses are encoded as a bit string. Most network technologies use a fixed size bit string to represent source and destination address. These addresses can be organized in two different ways. -.. index:: datagram, virtual circuit +The first organisation, which is the one that we have implicitly assumed until now, is the `flat addressing` scheme. Under this scheme, each host and network node has a unique address. The unicity of the addresses is important for the operation of the network. If two hosts have the same address, it can become difficult for the network to forward packets towards this destination. `Flat addresses` are typically used in situations where network nodes and hosts need to be able to communicate immediately with unique addresses. These `flat addresses` are often embedded inside the hardware of network interface cards. The network card manufacturer creates one unique address for each interface and this address is stored in the read-only memory of the interface. An advantage of this addressing scheme is that it easily supports ad-hoc and mobile networks. When a host moves, it can attach to another network and remain confident that its address is unique and enables it to communicate inside the new network. -There are two possible internal organisations of the network layer : +With `flat addressing` the lookup operation in the forwarding table can be implemented as an exact match. The `forwarding table` contains the (sorted) list of all known destination addresses. When a packet arrives, a network node only needs to check whether this address is part of the forwarding table or not. In software, this is an `O(log(n))` operation if the list is sorted. In hardware, Content Addressable Memories can perform this lookup operation efficiently, but their size is usually limited. - - datagram - - virtual circuits +.. https://www.pagiamtzis.com/pubs/pagiamtzis-jssc2006.pdf -The internal organisation of the network is orthogonal to the service that it provides, but most of the time a datagram organisation is used to provide a connectionless service while a virtual circuit organisation is used in networks that provide a connection-oriented service. +A drawback of the `flat addressing scheme` is that the forwarding tables grow linearly with the number of hosts and nodes in the network. With this addressing scheme, each forwarding table must contain an entry that points to every address reachable inside the network. Since large networks can contain ten of millions or more of hosts, this is a major problem on network nodes that need to be able to quickly forward packets. As an illustration, it is interesting to consider the case of an interface running at 10 Gbps. Such interfaces are found on high-end servers and in various network nodes today. Assuming a packet size of 1000 bits, a pretty large and conservative number, such interface must forward ten million packets every second. This implies that a network node that receives packets over such a link must forward one 1000 bits packet every 100 nanoseconds. This is the same order of magnitude as the memory access times of old DRAMs. -Datagram organisation ---------------------- +A widely used alternative to the `flat addressing scheme` is the `hierarchical addressing scheme`. This addressing scheme builds upon the fact that networks usually contain much more hosts than network nodes. In this case, a first solution to reduce the size of the forwarding tables is to create a hierarchy of addresses. This is the solution chosen by the post office were addresses contain a country, sometimes a state or province, a city, a street and finally a street number. When an enveloppe is forwarded by a postoffice in a remote country, it only looks at the destination country, while a post office in the same province will look at the city information. Only the post office responsible for a given city will look at the street name and only the postman will use the street number. `Hierarchical addresses` provide a similar solution for network addresses. For example, the address of an Internet host attached to a campus network could contain in the high-order bits an identification of the Internet Service Provider (ISP) that serves the campus network. Then, a subsequent block of bits identifies the campus network which is one of the customers from the ISP. Finally, the low order bits of the address identify the host in the campus network. +This hierarchical allocation of addresses can be applied in any type of network. In practice, the allocation of the addresses must follow the network topology. Usually, this is achieved by dividing the addressing space in consecutive blocks and then allocating these blocks to different parts of the network. In a small network, the simplest solution is to allocate one block of addresses to each network node and assign the host addresses from the attached node. + +.. graphviz:: + + graph foo { + A [shape=box]; + B [shape=box]; + A--R1 ; + R1--R2 []; + R1--R3 []; + R2--R3 []; + R3--R4 []; + R4--B ; + } + +In the above figure, assume that the network uses 16 bits addresses and that the prefix `01001010` has been assigned to the entire network. Since the network contains four routers, the network operator could assign one block of sixty-four addresses to each router. `R1` would use address `0100101000000000` while `A` could use address `0100101000000001`. `R2` could be assigned all adresses from `0100101001000000` to `0100101001111111`. `R4` could then use `0100101011000000` and assign ``0100101011000001` to `B`. Other allocation schemes are possible. For example, `R3` could be allocated a larger block of addresses than `R2` and `R4` could use a sub-block from `R3`'s address block. + +The main advantage of hierarchical addresses is that it is possible to significantly reduce the size of the forwarding tables. In many networks, the number of nodes can be several orders of magnitude smaller than the number of hosts. A campus network may contain a few dozen of network nodes for thousands of hosts. The largest Internet Services Providers typically contain not more than a few tends of thousands of network nodes but still serve tens or hundreds of millions of hosts. + +Despite their popularity, `hierarchical addresses` have some drawbacks. Their first drawback is that a lookup in the forwarding table is more complex than when using `flat addresses`. For example, on the Internet, network nodes have to perform a longest-match to forward each packet. This is partially compensated by the reduction in the size of the forwarding tables, but the additional complexity of the lookup operation has been a difficulty to implement hardware support for packet forwarding. A second drawback of the utilisation of hierarchical addresses is that when a host connects for the first time to a network, it must contact one network node to determine its own address. This requires some packet exchanges between the host and some network nodes. Furthermore, if a host moves and is attached to another network node, its network address will change. This can be an issue with some mobile hosts. Virtual circuit organisation ---------------------------- -The main advantage of the datagram organisation is its simplicity. The principles of this organisation can easily be understood. Furthermore, it allows a host to easily send a packet towards any destination at any time. However, as each packet is forwarded independently by intermediate routers, packets sent by a host may not follow the same path to reach a given destination. This may cause packet reordering, which may be annoying for transport protocols. Furthermore, as a router using `hop-by-hop forwarding` always forwards packets sent towards the same destination over the same outgoing interface, this may cause congestion over some links. The second organisation of the network layer, called `virtual circuits`, has been inspired by the organisation of telephone networks. Telephone networks have been designed to carry phone calls that usually last a few minutes. Each phone is identified by a telephone number and is attached to a telephone switch. To initiate a phone call, a telephone first needs to send the destination's phone number to its local switch. The switch cooperates with the other switches in the network to create a bi-directional channel between the two telephones through the network. This channel will be used by the two telephones during the lifetime of the call and will be released at the end of the call. Until the 1960s, most of these channels were created manually, by telephone operators, upon request of the caller. Today's telephone networks use automated switches and allow several channels to be carried over the same physical link, but the principles remain roughly the same. -In a network using virtual circuits, all hosts are identified with a network layer address. However, a host must explicitly request the establishment of a `virtual circuit` before being able to send packets to a destination host. -The request to establish a virtual circuit is processed by the `control plane`, which installs state to create the virtual circuit between the source and the destination through intermediate routers. All the packets that are sent on the virtual circuit contain a virtual circuit identifier that allows the routers to determine to which virtual circuit each packet belongs. This is illustrated in the figure below with one virtual circuit between host `A` and host `I` and another one between host `A` and host `J`. +.. index:: label switching -.. figure:: svg/simple-internetwork-vc.png - :align: center - :scale: 70 +In a network using virtual circuits, all hosts are also identified with a network layer address. However, packet forwarding is not performed by looking at the destination address of each packet. With the `virtual circuit` organisation, each data packet contains one label [#flabels]_. A label is an integer which is part of the packet header. Network nodes implement `label switching` to forward `labelled data packet`. Upon reception of a packet, a network nodes consults its `label forwarding table` to find the outgoing interface for this packet. In contrast with the datagram mode, this lookup is very simple. The `label forwarding table` is an array stored in memory and the label of the incoming packet is the index to access this array. This implies that the lookup operation has an `O(1)` complexity in contrast with other packet forwarding techniques. To ensure that on each node the packet label is an index in the `label forwarding table`, each network node that forwards a packet replaces the label of the forwarded packet with the label found in the `label forwarding table`. Each entry of the `label forwarding label` contains two informations : - A simple internetwork using virtual-circuits + - the outgoing interface for the packet + - the label for the outgoing packet - -The establishment of a virtual circuit is performed using a `signalling protocol` in the `control plane`. Usually, the source host sends a signalling message to indicate to its router the address of the destination and possibly some performance characteristics of the virtual circuit to be established. The first router can process the signalling message in two different ways. +For example, consider the `label forwarding table` of a network node below. -A first solution is for the router to consult its routing table, remember the characteristics of the requested virtual circuit and forward it over its outgoing interface towards the destination. The signalling message is thus forwarded hop-by-hop until it reaches the destination and the virtual circuit is opened along the path followed by the signalling message. This is illustrated with the red virtual circuit in the figure below. -.. figure:: svg/simple-internetwork-vc-estab.png - :align: center - :scale: 70 ++--------+--------------------+----------+ +| index | outgoing interface | label | ++--------+--------------------+----------+ +| 0 | South | 7 | ++--------+--------------------+----------+ +| 1 | none | none | ++--------+--------------------+----------+ +| 2 | West | 2 | ++--------+--------------------+----------+ +| 3 | East | 2 | ++--------+--------------------+----------+ - Virtual circuit establishment +If this node receives a packet with `label=2`, it forwards the packet on its `West` interface and sets the `label` of the outgoing packet to `2`. If the received packet's `label` is set to `3`, then the packet is forwarded over the `East` interface and the `label` of the outgoing packet is set to `2`. If a packet is received with a label field set to `1`, the packet is discard since the corresponding `label forwarding table` entry is invalid. +`Label switching` enables a full control on the path followed by packets inside the network. Consider the network below and assume that we want to use two virtual circuits : `R1->R3->R4->R2->R5` and `R2->R1->R3->R4->R5`. -.. index:: source routing, label +.. graphviz:: -A second solution can be used if the routers know the entire topology of the network. In this case, the first router can use a technique called `source routing`. Upon reception of the signalling message, the first router chooses the path of the virtual circuit in the network. This path is encoded as the list of the addresses of all intermediate routers to reach the destination. It is included in the signalling message and intermediate routers can remove their address from the signalling message before forwarding it. This technique enables routers to spread the virtual circuits throughout the network better. If the routers know the load of remote links, they can also select the least loaded path when establishing a virtual circuit. This solution is illustrated with the blue circuit in the figure above. - -The last point to be discussed about the virtual circuit organisation is its `data plane`. The `data plane` mainly defines the format of the data packets and the algorithm used by routers to forward packets. The data packets contain a virtual circuit identifier, encoded as a fixed number of bits. These virtual circuit identifiers are usually called `labels`. + graph foo { + R1--R2 []; + R1--R3 []; + R2--R4 []; + R3--R4 []; + R4--R5 []; + R2--R5 []; + } -Each host maintains a flow table that associates a label with each virtual circuit that is has established. When a router receives a packet containing a label, it extracts the label and consults its `label forwarding table`. This table is a data structure that maps each couple `(incoming interface, label)` to the outgoing interface to be used to forward the packet as well as the label that must be placed in the outgoing packets. In practice, the label forwarding table can be implemented as a vector and the couple `(incoming interface, label)` is the index of the entry in the vector that contains the outgoing interface and the outgoing label. Thus a single memory access is sufficient to consult the label forwarding table. The utilisation of the label forwarding table is illustrated in the figure below. -.. figure:: svg/label-forwarding.png - :align: center - :scale: 70 +To create these virtual circuits, we need to configure the +label forwarding tables` of all network nodes. For simplicity, assume that a label forwarding table only contain two entries. Assume that `R5` wants to receive the packets from the virtual circuit created by `R1` (resp. `R2`) with `label=1` (`label=0`). `R4` could use the following `label forwarding table`: - Label forwarding tables in a network using virtual circuits ++--------+--------------------+----------+ +| index | outgoing interface | label | ++--------+--------------------+----------+ +| 0 | ->R2 | 1 | ++--------+--------------------+----------+ +| 1 | ->R5 | 0 | ++--------+--------------------+----------+ -The virtual circuit organisation has been mainly used in public networks, starting from X.25 and then Frame Relay and Asynchronous Transfer Mode (ATM) network. +Since a packet received with `label=1` must be forwarded to `R5` with `label=1`, `R2`'s `label forwarding table` could contain : ++--------+--------------------+----------+ +| index | outgoing interface | label | ++--------+--------------------+----------+ +| 0 | none | none | ++--------+--------------------+----------+ +| 1 | ->R5 | 1 | ++--------+--------------------+----------+ -Both the datagram and virtual circuit organisations have advantages and drawbacks. The main advantage of the datagram organisation is that hosts can easily send packets to any number of destinations while the virtual circuit organisation requires the establishment of a virtual circuit before the transmission of a data packet. This solution can be costly for hosts that exchange small amounts of data. On the other hand, the main advantage of the virtual circuit organisation is that the forwarding algorithm used by routers is simpler than when using the datagram organisation. Furthermore, the utilisation of virtual circuits may allow the load to be better spread through the network thanks to the utilisation of multiple virtual circuits. The MultiProtocol Label Switching (MPLS) technique that we will discuss in another revision of this book can be considered as a good compromise between datagram and virtual circuits. MPLS uses virtual circuits between routers, but does not extend them to the endhosts. Additional information about MPLS may be found in [ML2011]_. +Two virtual circuits pass through `R3`. They both need to be forwarded to `R4`, but `R4` expects `label=1` for packets belonging to the virtual circuit originated by `R2` and `label=0` for packets belonging to the other virtual circuit. `R3` could choose to leave the labels unchanged. ++--------+--------------------+----------+ +| index | outgoing interface | label | ++--------+--------------------+----------+ +| 0 | ->R4 | 0 | ++--------+--------------------+----------+ +| 1 | ->R4 | 1 | ++--------+--------------------+----------+ -.. maybe add more information +With the above `label forwarding table`, `R1` needs to originate the packets that belong to the `R1->R3->R4->R2->R5` with `label=1`. The packets received from `R2` and belonging to the `R2->R1->R3->R4->R5` would then use `label=0` on the `R1-R3` link. `R1`'s label forwarding could be built as follows : -The control plane -================= ++--------+--------------------+----------+ +| index | outgoing interface | label | ++--------+--------------------+----------+ +| 0 | ->R3 | 0 | ++--------+--------------------+----------+ +| 1 | none | 1 | ++--------+--------------------+----------+ -One of the objectives of the `control plane` in the network layer is to maintain the routing tables that are used on all routers. As indicated earlier, a routing table is a data structure that contains, for each destination address (or block of addresses) known by the router, the outgoing interface over which the router must forward a packet destined to this address. The routing table may also contain additional information such as the address of the next router on the path towards the destination or an estimation of the cost of this path. -In this section, we discuss the three main techniques that can be used to maintain the routing tables in a network. -Static routing --------------- +We will discuss later Multi-Protocol Label Switching (MPLS) as the example of a deployed networking technology that relies on label switching. MPLS is more complex than the above description because it has been designed to be easily integrated with datagram technologies. However, the principles remain. `Asynchronous Transfer Mode`(ATM) and Frame Relay are other examples of technologies that rely on `label switching`. + + +Nowadays, most deployed networks rely on distributed algorithms, called routing protocols, to compute the forwarding tables that are installed on the network nodes. These distributed algorithms are part of the `control plane`. Their are usually implemented in software and are executed on the main CPU of the network nodes. There are two main families of routing protocols : distance vector routing and link state routing. Both are capable of discovering autonomously the network and react dynamically to topology changes. + +.. The datagram organisation has been very popular in computer networks. Datagram based network layers include IPv4 and IPv6 in the global Internet, CLNP defined by the ISO, IPX defined by Novell or XNS defined by Xerox [Perlman2000]_. + +.. + .. figure:: svg/simple-lan.png + :align: center + :scale: 80 + + A local area network + +.. An important difference between the point-to-point datalink layers and the datalink layers used in LANs is that in a LAN, each communicating device is identified by a unique `datalink layer address`. This address is usually embedded in the hardware of the device and different types of LANs use different types of datalink layer addresses. A communicating device attached to a LAN can send a datalink frame to any other communicating device that is attached to the same LAN. Most LANs also support special broadcast and multicast datalink layer addresses. A frame sent to the broadcast address of the LAN is delivered to all communicating devices that are attached to the LAN. The multicast addresses are used to identify groups of communicating devices. When a frame is sent towards a multicast datalink layer address, it is delivered by the LAN to all communicating devices that belong to the corresponding group. -.. comment:: comment formaliser l'absence de boucles -The simplest solution is to pre-compute all the routing tables of all routers and to install them on each router. Several algorithms can be used to compute these tables. +.. index:: NBMA, Non-Broadcast Multi-Access Networks -A simple solution is to use shortest path routing and to minimise the number of intermediate routers to reach each destination. More complex algorithms can take into account the expected load on the links to ensure that congestion does not occur for a given traffic demand. These algorithms must all ensure that : +.. The third type of datalink layers are used in Non-Broadcast Multi-Access (NBMA) networks. These networks are used to interconnect devices like a LAN. All devices attached to an NBMA network are identified by a unique datalink layer address. However, and this is the main difference between an NBMA network and a traditional LAN, the NBMA service only supports unicast. The datalink layer service provided by an NBMA network supports neither broadcast nor multicast. + +.. The network layer itself relies on the following principles : + + + +.. #. Each network layer entity is identified by a `network layer address`. This address is independent of the datalink layer addresses that it may use. +.. #. The service provided by the network layer does not depend on the service or the internal organisation of the underlying datalink layers. +.. #. The network layer is conceptually divided into two planes : the `data plane` and the `control plane`. The `data plane` contains the protocols and mechanisms that allow hosts and routers to exchange packets carrying user data. The `control plane` contains the protocols and mechanisms that enable routers to efficiently learn how to forward packets towards their final destination. + +.. The independence of the network layer from the underlying datalink layer is a key principle of the network layer. It ensures that the network layer can be used to allow hosts attached to different types of datalink layers to exchange packets through intermediate routers. Furthermore, this allows the datalink layers and the network layer to evolve independently from each other. This enables the network layer to be easily adapted to a new datalink layer every time a new datalink layer is invented. + +.. + + There are two types of service that can be provided by the network layer : + +.. + + - an `unreliable connectionless` service + - a `connection-oriented`, reliable or unreliable, service + +.. Connection-oriented services have been popular with technologies such as :term:`X.25` and :term:`ATM` or :term:`frame-relay`, but nowadays most networks use an `unreliable connectionless` service. This is our main focus in this chapter. + + +.. maybe add more information + +The control plane +================= + +One of the objectives of the `control plane` in the network layer is to maintain the routing tables that are used on all routers. As indicated earlier, a routing table is a data structure that contains, for each destination address (or block of addresses) known by the router, the outgoing interface over which the router must forward a packet destined to this address. The routing table may also contain additional information such as the address of the next router on the path towards the destination or an estimation of the cost of this path. + +In this section, we discuss the three main techniques that can be used to maintain the routing tables in a network. - - all routers are configured with a route to reach each destination - - none of the paths composed with the entries found in the routing tables contain a cycle. Such a cycle would lead to a forwarding loop. -The figure below shows sample routing tables in a five routers network. -.. figure:: svg/routing-tables.png - :align: center - :scale: 70 - Routing tables in a simple network +.. rubric:: Footnotes -The main drawback of static routing is that it does not adapt to the evolution of the network. When a new router or link is added, all routing tables must be recomputed. Furthermore, when a link or router fails, the routing tables must be updated as well. +.. [#flabels] We will see later a more detailed description of Multiprotocol Label Switching, a networking technology that is capable of using one or more labels. -.. include:: dv.rst +.. include:: /links.rst -.. include:: linkstate.rst diff --git a/book-2nd/principles/reliability.rst b/book-2nd/principles/reliability.rst index 7c5f14e..e5a6132 100644 --- a/book-2nd/principles/reliability.rst +++ b/book-2nd/principles/reliability.rst @@ -56,7 +56,7 @@ A `time-sequence diagram` describes the interactions between communicating hosts d [label="", linecolour=white]; a=>b [ label = "DATA.req(0)" ] , - b>>c [ label = "0", arcskip=1]; + b>>c [ label = "0", arcskip="1"]; c=>d [ label = "DATA.ind(0)" ]; @@ -75,7 +75,7 @@ The first problem is that electrical transmission can be affected by electromagn d [label="", linecolour=white]; a=>b [ label = "DATA.req(0)" ] , - b>>c [ label = "", arcskip=1]; + b>>c [ label = "", arcskip="1"]; c=>d [ label = "DATA.ind(1)" ]; @@ -90,14 +90,14 @@ With the above transmission scheme, a bit is transmitted by setting the voltage d [label="", linecolour=white]; a=>b [ label = "DATA.req(0)" ] , - b>>c [ label = "", arcskip=1]; + b>>c [ label = "", arcskip="1"]; c=>d [ label = "DATA.ind(0)" ]; a=>b [ label = "DATA.req(0)" ]; a=>b [ label = "DATA.req(1)" ] , - b>>c [ label = "", arcskip=1]; + b>>c [ label = "", arcskip="1"]; c=>d [ label = "DATA.ind(1)" ]; @@ -307,7 +307,7 @@ Reliable data transfer on top of a perfect physical service The datalink layer will send and receive frames on behalf of a user. We model these interactions by using the `DATA.req` and `DATA.ind` primitives. However, to simplify the presentation and to avoid confusion between a `DATA.req` primitive issued by the user of the datalink layer entity, and a `DATA.req` issued by the datalink layer entity itself, we will use the following terminology : - the interactions between the user and the datalink layer entity are represented by using the classical `DATA.req` and the `DATA.ind` primitives - - the interactions between the datalink layer entity and the framing layer are represented by using `send` instead of `DATA.req` and `recvd` instead of `DATA.ind` + - the interactions between the datalink layer entity and the framing sublayer are represented by using `send` instead of `DATA.req` and `recvd` instead of `DATA.ind` This is illustrated in the figure below. @@ -319,42 +319,106 @@ This is illustrated in the figure below. TODO Interactions between the transport layer, its user, and its network layer provider -When running on top of a perfect connectionless network service, a transport level entity can simply issue a `send(SDU)` upon arrival of a `DATA.req(SDU)`. Similarly, the receiver issues a `DATA.ind(SDU)` upon receipt of a `recvd(SDU)`. Such a simple protocol is sufficient when a single SDU is sent. +When running on top of a perfect framing sublayer, a datalink entity can simply issue a `send(SDU)` upon arrival of a `DATA.req(SDU)`. Similarly, the receiver issues a `DATA.ind(SDU)` upon receipt of a `recvd(SDU)`. Such a simple protocol is sufficient when a single SDU is sent. This is illustrated in the figure below. + + + .. msc:: + + a [label="", linecolour=white], + b [label="Host A", linecolour=black], + z [label="", linecolour=white], + c [label="Host B", linecolour=black], + d [label="", linecolour=white]; + + a=>b [ label = "DATA.req(SDU)" ] , + b>>c [ label = "Frame(SDU)", arcskip="1"]; + c=>d [ label = "DATA.ind(SDU)" ]; -.. figure:: ../../book/transport/svg/transport-fig-004.* + +.. .. figure:: ../../book/transport/svg/transport-fig-004.* :align: center :scale: 70 - The simplest transport protocol + The simplest reliable protocol + + +Unfortunately, this is not always sufficient to ensure a reliable delivery of the SDUs. Consider the case where a client sends tens of SDUs to a server. If the server is faster that the client, it will be able to receive and process all the segments sent by the client and deliver their content to its user. However, if the server is slower than the client, problems may arise. The datalink entity contains buffers to store SDUs that have been received as a `Data.request` but have not yet been sent. If the application is faster than the physical link, the buffer may become full. At this point, the operating system suspends the application to let the datalink entity empty its transmission queue. The datalink entity also uses a buffer to store the received frames that have not yet been processed by the application. If the application is slow to process the data, this buffer may overflow and the datalink entity will not able to accept any additional frame. The buffers of the datalink entity have a limited size [#fqueuesize]_ and if they overflow, the arriving frames will be discarded, even if they are correct. + +To solve this problem, a reliable protocol must include a feedback mechanism that allows the receiver to inform the sender that it has processed a frame and that another one can be sent. This feedback is required even though there are no transmission errors. To include such a feedback, our reliable protocol must process two types of frames : + - data frames carrying a SDU + - control frames carrying an acknowledgment indicating that the previous frames was processed correctly -Unfortunately, this is not always sufficient to ensure a reliable delivery of the SDUs. Consider the case where a client sends tens of SDUs to a server. If the server is faster that the client, it will be able to receive and process all the segments sent by the client and deliver their content to its user. However, if the server is slower than the client, problems may arise. The transport layer entity contains buffers to store SDUs that have been received as a `Data.request` from the application but have not yet been sent via the network service. If the application is faster than the network layer, the buffer becomes full and the operating system suspends the application to let the transport entity empty its transmission queue. The transport entity also uses a buffer to store the segments received from the network layer that have not yet been processed by the application. If the application is slow to process the data, this buffer becomes full and the transport entity is not able to accept anymore the segments from the network layer. The buffers of the transport entity have a limited size [#fqueuesize]_ and if they overflow, the transport entity is forced to discard received segments. +These two types of frames can be distinguished by dividing the frame in two parts : -To solve this problem, our transport protocol must include a feedback mechanism that allows the receiver to inform the sender that it has processed a segment and that another one can be sent. This feedback is required even though the network layer provides a perfect service. To include such a feedback, our transport protocol must process two types of segments : + - the `header` that contains one bit set to `0` in data frames and set to `1` in control frames + - the payload that contains the SDU supplied by the application - - data segments carrying a SDU - - control segments carrying an acknowledgment indicating that the previous segment was processed correctly +The datalink entity can then be modelled as a finite state machine, containing two states for the receiver and two states for the sender. The figure below provides a graphical representation of this state machine with the sender above and the receiver below. -These two types of segments can be distinguished using a segment composed of two parts : +.. digraph:: sender + + rankdir=LR; + node [shape = circle label="Wait\nfor\nSDU"] Wait_SDU; + node [shape = circle label="Wait\nfor\n\OK"] Wait_OK; + Wait_SDU -> Wait_OK [label=<DATA.req(SDU)
========
send(D(SDU))>]; + Wait_OK -> Wait_SDU [label=<recvd(C(OK))
=======
>]; + +next FSM + +.. digraph:: receiver + + rankdir=LR; + node [shape=circle label=for
frame>] Wait_frame; + node [shape=circle label=SDU>] Process_SDU; + Wait_frame -> Process_SDU [label=< + recvd(D(SDU)) +
=========
+ DATA.ind(SDU) + >]; + + + +.. Process_SDU -> Wait_frame [label=< +..
=======
+.. send(C(OK)) +.. >]; - - the `header` that contains one bit set to `0` in data segments and set to `1` in control segments - - the payload that contains the SDU supplied by the user application -The transport entity can then be modelled as a finite state machine, containing two states for the receiver and two states for the sender. The figure below provides a graphical representation of this state machine with the sender above and the receiver below. .. figure:: ../../book/transport/png/transport-fig-008-c.png :align: center :scale: 60 - Finite state machine of the simplest transport protocol + Finite state machine of the simplest reliable protocol -The above FSM shows that the sender has to wait for an acknowledgement from the receiver before being able to transmit the next SDU. The figure below illustrates the exchange of a few segments between two hosts. +The above FSM shows that the sender has to wait for an acknowledgement from the receiver before being able to transmit the next SDU. The figure below illustrates the exchange of a few frames between two hosts. + + .. msc:: + + a [label="", linecolour=white], + b [label="Host A", linecolour=black], + z [label="", linecolour=white], + c [label="Host B", linecolour=black], + d [label="", linecolour=white]; + + a=>b [ label = "DATA.req(a)"], b>>c [ label = "D(a)", arcskip="1"]; + c=>d [ label = "DATA.ind(a)" ],c>>b [label= "C(OK)", arcskip="1"]; + |||; + a=>b [ label = "DATA.req(b)" ], b>>c [ label = "D(b)",arcskip="1"]; + c=>d [ label = "DATA.ind(b)" ], c>>b [label= "C(OK)", arcskip="1"]; + |||; + + +Time sequence diagram illustrating the operation of the simplest transport protocol + + +.. + .. figure:: ../../book/transport/svg/transport-fig-009.* + :align: center + :scale: 80 -.. figure:: ../../book/transport/svg/transport-fig-009.* - :align: center - :scale: 80 - Time sequence diagram illustrating the operation of the simplest transport protocol .. note:: Services and protocols @@ -362,24 +426,27 @@ The above FSM shows that the sender has to wait for an acknowledgement from the -Reliable data transfer on top of an imperfect network service -============================================================= +Reliable data transfer on top of an imperfect link +================================================== + +The datalink layer must deal with the transmission errors. In practice, we mainly have to deal with two types of errors in the datalink layer : -The transport layer must deal with the imperfections of the network layer service. There are three types of imperfections that must be considered by the transport layer : + #. Frames can be corrupted by transmission errors + #. Frames can be lost or unexpected frames can appear - #. Segments can be corrupted by transmission errors - #. Segments can be lost - #. Segments can be reordered or duplicated +.. #. Segments can be reordered or duplicated -To deal with these types of imperfections, transport protocols rely on different types of mechanisms. The first problem is transmission errors. The segments sent by a transport entity is processed by the network and datalink layers and finally transmitted by the physical layer. All of these layers are imperfect. For example, the physical layer may be affected by different types of errors : +A first glance, loosing frames might seem strange on single link. However, if we take framing into account, transmission errors can affect the frame delineation mechanism and make the frame unreadable. For the same reason, a receiver could receive two (likely invalid) frames after a sender has transmitted a single frame. + +To deal with these types of imperfections, reliable protocols rely on different types of mechanisms. The first problem is transmission errors. Data transmission on a physical link can be affected by the following errors : - random isolated errors where the value of a single bit has been modified due to a transmission error - random burst errors where the values of `n` consecutive bits have been changed due to transmission errors - random bit creations and random bit removals where bits have been added or removed due to transmission errors -The only solution to protect against transmission errors is to add redundancy to the segments that are sent. `Information Theory` defines two mechanisms that can be used to transmit information over a transmission channel affected by random errors. These two mechanisms add redundancy to the information sent, to allow the receiver to detect or sometimes even correct transmission errors. A detailed discussion of these mechanisms is outside the scope of this chapter, but it is useful to consider a simple mechanism to understand its operation and its limitations. +The only solution to protect against transmission errors is to add redundancy to the frames that are sent. `Information Theory` defines two mechanisms that can be used to transmit information over a transmission channel affected by random errors. These two mechanisms add redundancy to the transmitted information, to allow the receiver to detect or sometimes even correct transmission errors. A detailed discussion of these mechanisms is outside the scope of this chapter, but it is useful to consider a simple mechanism to understand its operation and its limitations. -`Information theory` defines `coding schemes`. There are different types of coding schemes, but let us focus on coding schemes that operate on binary strings. A coding scheme is a function that maps information encoded as a string of `m` bits into a string of `n` bits. The simplest coding scheme is the even parity coding. This coding scheme takes an `m` bits source string and produces an `m+1` bits coded string where the first `m` bits of the coded string are the bits of the source string and the last bit of the coded string is chosen such that the coded string will always contain an even number of bits set to `1`. For example : +`Information theory` defines `coding schemes`. There are different types of coding schemes, but let us focus on coding schemes that operate on binary strings. A coding scheme is a function that maps information encoded as a string of `m` bits into a string of `n` bits. The simplest coding scheme is the (even) parity coding. This coding scheme takes an `m` bits source string and produces an `m+1` bits coded string where the first `m` bits of the coded string are the bits of the source string and the last bit of the coded string is chosen such that the coded string will always contain an even number of bits set to `1`. For example : - `1001` is encoded as `10010` - `1101` is encoded as `11011` @@ -395,19 +462,21 @@ For example, consider a sender that sends `111`. If there is one bit in error, t This simple coding scheme forces the sender to transmit three bits for each source bit. However, it allows the receiver to correct single bit errors. More advanced coding systems that allow to recover from errors are used in several types of physical layers. -Transport protocols use error detection schemes, but none of the widely used transport protocols rely on error correction schemes. To detect errors, a segment is usually divided into two parts : +Reliable protocols use error detection schemes, but none of the widely used reliable protocols rely on error correction schemes. To detect errors, a frame is usually divided into two parts : - - a `header` that contains the fields used by the transport protocol to ensure reliable delivery. The header contains a checksum or Cyclical Redundancy Check (CRC) [Williams1993]_ that is used to detect transmission errors - - a `payload` that contains the user data passed by the application layer. + - a `header` that contains the fields used by the reliable protocol to ensure reliable delivery. The header contains a checksum or Cyclical Redundancy Check (CRC) [Williams1993]_ that is used to detect transmission errors + - a `payload` that contains the user data -Some segment headers also include a `length` , which indicates the total length of the segment or the length of the payload. +Some headers also include a `length` field, which indicates the total length of the frame or the length of the payload. -The simplest error detection scheme is the checksum. A checksum is basically an arithmetic sum of all the bytes that a segment is composed of. There are different types of checksums. For example, an eight bit checksum can be computed as the arithmetic sum of all the bytes of (both the header and trailer of) the segment. The checksum is computed by the sender before sending the segment and the receiver verifies the checksum upon reception of each segment. The receiver discards segments received with an invalid checksum. Checksums can be easily implemented in software, but their error detection capabilities are limited. Cyclical Redundancy Checks (CRC) have better error detection capabilities [SGP98]_, but require more CPU when implemented in software. +The simplest error detection scheme is the checksum. A checksum is basically an arithmetic sum of all the bytes that a frame is composed of. There are different types of checksums. For example, an eight bit checksum can be computed as the arithmetic sum of all the bytes of (both the header and trailer of) the frame. The checksum is computed by the sender before sending the frame and the receiver verifies the checksum upon frame reception. The receiver discards frames received with an invalid checksum. Checksums can be easily implemented in software, but their error detection capabilities are limited. Cyclical Redundancy Checks (CRC) have better error detection capabilities [SGP98]_, but require more CPU when implemented in software. .. note:: Checksums, CRCs, ... - Most of the protocols in the TCP/IP protocol suite rely on the simple Internet checksum in order to verify that the received segment has not been affected by transmission errors. Despite its popularity and ease of implementation, the Internet checksum is not the only available checksum mechanism. Cyclical Redundancy Checks (CRC_) are very powerful error detection schemes that are used notably on disks, by many datalink layer protocols and file formats such as zip or png. They can easily be implemented efficiently in hardware and have better error-detection capabilities than the Internet checksum [SGP98]_ . However, when the first transport protocols were designed, CRCs were considered to be too CPU-intensive for software implementations and other checksum mechanisms were used instead. The TCP/IP community chose the Internet checksum, the OSI community chose the Fletcher checksum [Sklower89]_ . Now, there are efficient techniques to quickly compute CRCs in software [Feldmeier95]_ , the SCTP protocol initially chose the Adler-32 checksum but replaced it recently with a CRC (see :rfc:`3309`). + Most of the protocols in the TCP/IP protocol suite rely on the simple Internet checksum in order to verify that a received packet has not been affected by transmission errors. Despite its popularity and ease of implementation, the Internet checksum is not the only available checksum mechanism. Cyclical Redundancy Checks (CRC_) are very powerful error detection schemes that are used notably on disks, by many datalink layer protocols and file formats such as zip or png. They can easily be implemented efficiently in hardware and have better error-detection capabilities than the Internet checksum [SGP98]_ . However, CRCs are sometimes considered to be too CPU-intensive for software implementations and other checksum mechanisms are preferred. The TCP/IP community chose the Internet checksum, the OSI community chose the Fletcher checksum [Sklower89]_ . Now, there are efficient techniques to quickly compute CRCs in software [Feldmeier95]_ + +.. , the SCTP protocol initially chose the Adler-32 checksum but replaced it recently with a CRC (see :rfc:`3309`). .. CRC, checksum, fletcher, crc-32, Internet checksum .. real checksum http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.8520 @@ -417,9 +486,34 @@ The simplest error detection scheme is the checksum. A checksum is basically an .. tcp offload engine http://www.10gea.org/tcp-ip-offload-engine-toe.htm .. stcp used Adler-32 but it now uses CRC :rfc:`3309` -The second imperfection of the network layer is that segments may be lost. As we will see later, the main cause of packet losses in the network layer is the lack of buffers in intermediate routers. Since the receiver sends an acknowledgement segment after having received each data segment, the simplest solution to deal with losses is to use a retransmission timer. When the sender sends a segment, it starts a retransmission timer. The value of this retransmission timer should be larger than the `round-trip-time`, i.e. the delay between the transmission of a data segment and the reception of the corresponding acknowledgement. When the retransmission timer expires, the sender assumes that the data segment has been lost and retransmits it. This is illustrated in the figure below. +.. The second imperfection of the network layer is that segments may be lost. As we will see later, the main cause of packet losses in the network layer is the lack of buffers in intermediate routers. Since the receiver sends an acknowledgement segment after having received each data segment, the simplest solution to deal with losses is to use a retransmission timer. When the sender sends a segment, it starts a retransmission timer. The value of this retransmission timer should be larger than the `round-trip-time`, i.e. the delay between the transmission of a data segment and the reception of the corresponding acknowledgement. When the retransmission timer expires, the sender assumes that the data segment has been lost and retransmits it. This is illustrated in the figure below. + + +.. msc:: -.. figure:: ../../book/transport/svg/transport-fig-018.* + a [label="", linecolour=white], + b [label="Host A", linecolour=black], + z [label="", linecolour=white], + c [label="Host B", linecolour=black], + d [label="", linecolour=white]; + + a=>b [ label = "DATA.req(a)\nstart timer" ] , + b>>c [ label = "D(a)", arcskip="1"]; + c=>d [ label = "DATA.ind(a)" ]; + c>>b [label= "C(OK)", arcskip="1"], + b->a [linecolour=white, label="cancel timer"]; + + a=>b [ label = "DATA.req(b)\nstart timer" ] , + b>>c [ label = "D(b)", arcskip="1"], + c-x d [label="lost frame", linecolour=white]; + |||; + a=>b [ linecolour=white, label = "timer expires" ] , + b>>c [ label = "D(b)", arcskip="1"]; + c=>d [ label = "DATA.ind(b)" ], + c>>b [label= "C(OK)", arcskip="1"]; + + +.. figure:: ../../book/transport/svg/transport-fig-018.png :align: center :scale: 70 @@ -429,7 +523,7 @@ The second imperfection of the network layer is that segments may be lost. As we Unfortunately, retransmission timers alone are not sufficient to recover from segment losses. Let us consider, as an example, the situation depicted below where an acknowledgement is lost. In this case, the sender retransmits the data segment that has not been acknowledged. Unfortunately, as illustrated in the figure below, the receiver considers the retransmission as a new segment whose payload must be delivered to its user. -.. figure:: ../../book/transport/svg/transport-fig-019.* +.. figure:: ../../book/transport/svg/transport-fig-019.png :align: center :scale: 70 @@ -445,7 +539,9 @@ To solve this problem, datalink protocols associate a `sequence number` to each The Alternating Bit Protocol uses a single bit to encode the sequence number. It can be implemented easily. The sender and the receivers only require a four states Finite State Machine. -.. figure:: ../../book/transport/svg/transport-fig-021.* + + +.. figure:: ../../book/transport/svg/transport-fig-021.png :align: center :scale: 80 @@ -457,7 +553,7 @@ The initial state of the sender is `Wait for D(0,...)`. In this state, the sende The receiver first waits for `D(0,...)`. If the frame contains a correct `CRC`, it passes the SDU to its user and sends `OK0`. If the frame contains an invalid CRC, it is immediately discarded. Then, the receiver waits for `D(1,...)`. In this state, it may receive a duplicate `D(0,...)` or a data frame with an invalid CRC. In both cases, it returns an `OK0` frame to allow the sender to recover from the possible loss of the previous `OK0` frame. -.. figure:: ../../book/transport/svg/transport-fig-022.* +.. figure:: ../../book/transport/svg/transport-fig-022.png :align: center :scale: 70 @@ -469,7 +565,7 @@ The receiver first waits for `D(0,...)`. If the frame contains a correct `CRC`, The figure below illustrates the operation of the alternating bit protocol. -.. figure:: ../../book/transport/svg/transport-fig-023.* +.. figure:: ../../book/transport/svg/transport-fig-023.png :align: center :scale: 70 @@ -548,7 +644,7 @@ The simplest sliding window protocol uses the `go-back-n` recovery. Intuitively, The figure below shows the FSM of a simple `go-back-n` receiver. This receiver uses two variables : `lastack` and `next`. `next` is the next expected sequence number and `lastack` the sequence number of the last data frame that has been acknowledged. The receiver only accepts the frame that are received in sequence. `maxseq` is the number of different sequence numbers (:math:`2^n`). -.. figure:: ../../book/transport/svg/transport-fig-029.* +.. figure:: ../../book/transport/svg/transport-fig-029.png :align: center :scale: 70 @@ -558,7 +654,7 @@ The figure below shows the FSM of a simple `go-back-n` receiver. This receiver u A `go-back-n` sender is also very simple. It uses a sending buffer that can store an entire sliding window of frames [#fsizesliding]_ . The frames are sent with increasing sequence numbers (modulo `maxseq`). The sender must wait for an acknowledgement once its sending buffer is full. When a `go-back-n` sender receives an acknowledgement, it removes from the sending buffer all the acknowledged frames and uses a retransmission timer to detect frame losses. A simple `go-back-n` sender maintains one retransmission timer per connection. This timer is started when the first frame is sent. When the `go-back-n sender` receives an acknowledgement, it restarts the retransmission timer only if there are still unacknowledged frames in its sending buffer. When the retransmission timer expires, the `go-back-n` sender assumes that all the unacknowledged frames currently stored in its sending buffer have been lost. It thus retransmits all the unacknowledged frames in the buffer and restarts its retransmission timer. -.. figure:: ../../book/transport/svg/transport-fig-030.* +.. figure:: ../../book/transport/svg/transport-fig-030.png :align: center :scale: 70 @@ -581,7 +677,7 @@ The main advantage of `go-back-n` is that it can be easily implemented, and it c .. index:: selective repeat -`Selective repeat` is a better strategy to recover from segment losses. Intuitively, `selective repeat` allows the receiver to accept out-of-sequence segments. Furthermore, when a `selective repeat` sender detects losses, it only retransmits the frames that have been lost and not the frames that have already been correctly received. +`Selective repeat` is a better strategy to recover from losses. Intuitively, `selective repeat` allows the receiver to accept out-of-sequence frames. Furthermore, when a `selective repeat` sender detects losses, it only retransmits the frames that have been lost and not the frames that have already been correctly received. A `selective repeat` receiver maintains a sliding window of `W` frames and stores in a buffer the out-of-sequence frames that it receives. The figure below shows a five frames receive window on a receiver that has already received frames `7` and `9`. @@ -674,138 +770,17 @@ Reliable protocols often need to send data in both directions. To reduce the ove Piggybacking -.. todo:: update - -Connection establishment and release ------------------------------------- - -The last points to be discussed about the transport protocol are the mechanisms used to establish and release a transport connection. - -We explained in the first chapters the service primitives used to establish a connection. The simplest approach to establish a transport connection would be to define two special control segments : `CR` and `CA`. The `CR` segment is sent by the transport entity that wishes to initiate a connection. If the remote entity wishes to accept the connection, it replies by sending a `CA` segment. The transport connection is considered to be established once the `CA` segment has been received and data segments can be sent in both directions. - -.. figure:: ../../book/transport/png/transport-fig-045-c.png - :align: center - :scale: 70 - - Naive transport connection establishment - -Unfortunately, this scheme is not sufficient for several reasons. First, a transport entity usually needs to maintain several transport connections with remote entities. Sometimes, different users (i.e. processes) running above a given transport entity request the establishment of several transport connections to different users attached to the same remote transport entity. These different transport connections must be clearly separated to ensure that data from one connection is not passed to the other connections. This can be achieved by using a connection identifier, chosen by the transport entities and placed inside each segment to allow the entity which receives a segment to easily associate it to one established connection. - -Second, as the network layer is imperfect, the `CR` or `CA` segment can be lost, delayed, or suffer from transmission errors. To deal with these problems, the control segments must be protected by using a CRC or checksum to detect transmission errors. Furthermore, since the `CA` segment acknowledges the reception of the `CR` segment, the `CR` segment can be protected by using a retransmission timer. - -Unfortunately, this scheme is not sufficient to ensure the reliability of the transport service. Consider for example a short-lived transport connection where a single, but important transfer (e.g. money transfer from a bank account) is sent. Such a short-lived connection starts with a `CR` segment acknowledged by a `CA` segment, then the data segment is sent, acknowledged and the connection terminates. Unfortunately, as the network layer service is unreliable, delays combined to retransmissions may lead to the situation depicted in the figure below, where a delayed `CR` and data segments from a former connection are accepted by the receiving entity as valid segments, and the corresponding data is delivered to the user. Duplicating SDUs is not acceptable, and the transport protocol must solve this problem. - - -.. figure:: ../../book/transport/png/transport-fig-047-c.png - :align: center - :scale: 70 - - Duplicate transport connections ? - - -.. index:: Maximum Segment Lifetime (MSL), transport clock - - -To avoid these duplicates, transport protocols require the network layer to bound the `Maximum Segment Lifetime (MSL)`. The organisation of the network must guarantee that no segment remains in the network for longer than `MSL` seconds. On today's Internet, `MSL` is expected to be 2 minutes. To avoid duplicate transport connections, transport protocol entities must be able to safely distinguish between a duplicate `CR` segment and a new `CR` segment, without forcing each transport entity to remember all the transport connections that it has established in the past. - -A classical solution to avoid remembering the previous transport connections to detect duplicates is to use a clock inside each transport entity. This `transport clock` has the following characteristics : - - - the `transport clock` is implemented as a `k` bits counter and its clock cycle is such that :math:`2^k \times cycle >> MSL`. Furthermore, the `transport clock` counter is incremented every clock cycle and after each connection establishment. This clock is illustrated in the figure below. - - the `transport clock` must continue to be incremented even if the transport entity stops or reboots - -.. figure:: ../../book/transport/png/transport-fig-048-c.png - :align: center - :scale: 70 - - Transport clock - - -It should be noted that `transport clocks` do not need and usually are not synchronised to the real-time clock. Precisely synchronising real-time clocks is an interesting problem, but it is outside the scope of this document. See [Mills2006]_ for a detailed discussion on synchronising the real-time clock. - -The `transport clock` is combined with an exchange of three segments, called the `three way handshake`, to detect duplicates. This `three way handshake` occurs as follows : - - #. The initiating transport entity sends a `CR` segment. This segment requests the establishment of a transport connection. It contains a connection identifier (not shown in the figure) and a sequence number (`seq=x` in the figure below) whose value is extracted from the `transport clock` . The transmission of the `CR` segment is protected by a retransmission timer. - - #. The remote transport entity processes the `CR` segment and creates state for the connection attempt. At this stage, the remote entity does not yet know whether this is a new connection attempt or a duplicate segment. It returns a `CA` segment that contains an acknowledgement number to confirm the reception of the `CR` segment (`ack=x` in the figure below) and a sequence number (`seq=y` in the figure below) whose value is extracted from its transport clock. At this stage, the connection is not yet established. - - #. The initiating entity receives the `CA` segment. The acknowledgement number of this segment confirms that the remote entity has correctly received the `CA` segment. The transport connection is considered to be established by the initiating entity and the numbering of the data segments starts at sequence number `x`. Before sending data segments, the initiating entity must acknowledge the received `CA` segments by sending another `CA` segment. - - #. The remote entity considers the transport connection to be established after having received the segment that acknowledges its `CA` segment. The numbering of the data segments sent by the remote entity starts at sequence number `y`. - - The three way handshake is illustrated in the figure below. - -.. figure:: ../../book/transport/png/transport-fig-049-c.png - :align: center - :scale: 70 - - Three-way handshake - -Thanks to the three way handshake, transport entities avoid duplicate transport connections. This is illustrated by the three scenarios below. - -The first scenario is when the remote entity receives an old `CR` segment. It considers this `CR` segment as a connection establishment attempt and replies by sending a `CA` segment. However, the initiating host cannot match the received `CA` segment with a previous connection attempt. It sends a control segment (`REJECT` in the figure below) to cancel the spurious connection attempt. The remote entity cancels the connection attempt upon reception of this control segment. - -.. figure:: ../../book/transport/png/transport-fig-050-c.png - :align: center - :scale: 70 - - Three-way handshake : recovery from a duplicate `CR` - -A second scenario is when the initiating entity sends a `CR` segment that does not reach the remote entity and receives a duplicate `CA` segment from a previous connection attempt. This duplicate `CA` segment cannot contain a valid acknowledgement for the `CR` segment as the sequence number of the `CR` segment was extracted from the transport clock of the initiating entity. The `CA` segment is thus rejected and the `CR` segment is retransmitted upon expiration of the retransmission timer. - - -.. figure:: ../../book/transport/png/transport-fig-051-c.png - :align: center - :scale: 70 - - Three-way handshake : recovery from a duplicate `CA` - -The last scenario is less likely, but it it important to consider it as well. The remote entity receives an old `CR` segment. It notes the connection attempt and acknowledges it by sending a `CA` segment. The initiating entity does not have a matching connection attempt and replies by sending a `REJECT`. Unfortunately, this segment never reaches the remote entity. Instead, the remote entity receives a retransmission of an older `CA` segment that contains the same sequence number as the first `CR` segment. This `CA` segment cannot be accepted by the remote entity as a confirmation of the transport connection as its acknowledgement number cannot have the same value as the sequence number of the first `CA` segment. - -.. figure:: ../../book/transport/png/transport-fig-052-c.png - :align: center - :scale: 70 - - Three-way handshake : recovery from duplicates `CR` and `CA` - - -.. index:: abrupt connection release - -When we discussed the connection-oriented service, we mentioned that there are two types of connection releases : `abrupt release` and `graceful release`. - -The first solution to release a transport connection is to define a new control segment (e.g. the `DR` segment) and consider the connection to be released once this segment has been sent or received. This is illustrated in the figure below. - - -.. figure:: ../../book/transport/png/transport-fig-053-c.png - :align: center - :scale: 70 - - Abrupt connection release - -As the entity that sends the `DR` segment cannot know whether the other entity has already sent all its data on the connection, SDUs can be lost during such an `abrupt connection release`. - -.. index:: graceful connection release - -The second method to release a transport connection is to release independently the two directions of data transfer. Once a user of the transport service has sent all its SDUs, it performs a `DISCONNECT.req` for its direction of data transfer. The transport entity sends a control segment to request the release of the connection *after* the delivery of all previous SDUs to the remote user. This is usually done by placing in the `DR` the next sequence number and by delivering the `DISCONNECT.ind` only after all previous `DATA.ind`. The remote entity confirms the reception of the `DR` segment and the release of the corresponding direction of data transfer by returning an acknowledgement. This is illustrated in the figure below. - -.. figure:: ../../book/transport/png/transport-fig-054-c.png - :align: center - :scale: 70 - - Graceful connection release - -.. rubric:: Footnotes - -.. [#fqueuesize] In the application layer, most servers are implemented as processes. The network and transport layer on the other hand are usually implemented inside the operating system and the amount of memory that they can use is limited by the amount of memory allocated to the entire kernel. +.. .. [#fqueuesize] In the application layer, most servers are implemented as processes. The network and transport layer on the other hand are usually implemented inside the operating system and the amount of memory that they can use is limited by the amount of memory allocated to the entire kernel. -.. [#fsizesliding] The size of the sliding window can be either fixed for a given protocol or negotiated during the connection establishment phase. We'll see later that it is also possible to change the size of the sliding window during the connection's lifetime. +.. [#fsizesliding] The size of the sliding window can be either fixed for a given protocol or negotiated during the connection establishment phase. Some protocols allow to change the maximum window size during the data transfert. We will see explain with real protocols later. -.. [#fautotune] For a discussion on how the sending buffer can change, see e.g. [SMM1998]_ +.. .. [#fautotune] For a discussion on how the sending buffer can change, see e.g. [SMM1998]_ -.. [#facklost] Note that if the receive window shrinks, it might happen that the sender has already sent a segment that is not anymore inside its window. This segment will be discarded by the receiver and the sender will retransmit it later. +.. .. [#facklost] Note that if the receive window shrinks, it might happen that the sender has already sent a segment that is not anymore inside its window. This segment will be discarded by the receiver and the sender will retransmit it later. -.. [#fmsl] As we will see in the next chapter, the Internet does not strictly enforce this MSL. However, it is reasonable to expect that most packets on the Internet will not remain in the network during more than 2 minutes. There are a few exceptions to this rule, such as :rfc:`1149` whose implementation is described in http://www.blug.linux.no/rfc1149/ but there are few real links supporting :rfc:`1149` in the Internet. +.. .. [#fmsl] As we will see in the next chapter, the Internet does not strictly enforce this MSL. However, it is reasonable to expect that most packets on the Internet will not remain in the network during more than 2 minutes. There are a few exceptions to this rule, such as :rfc:`1149` whose implementation is described in http://www.blug.linux.no/rfc1149/ but there are few real links supporting :rfc:`1149` in the Internet. .. include:: /links.rst diff --git a/book-2nd/principles/sharing.rst b/book-2nd/principles/sharing.rst index 2a618f8..5578a35 100644 --- a/book-2nd/principles/sharing.rst +++ b/book-2nd/principles/sharing.rst @@ -6,34 +6,169 @@ Sharing ressources ------------------ +A network is designed to support a potentially large number of users that exchange information with each other. These users produce and consume information which is exchanged through the network. To support its users, a network uses several types of ressources. It is important to keep in mind the different ressources that are shared inside the network. +The first and more important ressource inside a network is the link bandwidth. There are two situations where link bandwidth needs to be shared between different users. The first situation is when several hosts are attached to the same physical link. This situation mainly occurs in Local Area Networks (LAN). A LAN is a network that efficiently interconnects several hosts (usually a few dozens to a few hundreds) in the same room, building or campus. Consider for a example a network with five hosts. Any of these hosts needs to be able to exchange information with any of the other five hosts. A first organisation for this LAN is the full-mesh. +.. figure:: ../../book/intro/svg/fullmesh.* + :align: center + :scale: 50 + + A Full mesh network + +The full-mesh is the most reliable and highest performing network to interconnect these five hosts. However, this network organisation has two important drawbacks. First, if a network contains `n` hosts, then :math:`\frac{n\times(n-1)}{2}` links are required. If the network contains more than a few hosts, if becomes impossible to lay down the required physical links. Second, if the network contains `n` hosts, then each host must have :math:`n-1` interfaces to terminante :math:`n-1` links. This is beyond the capabilities of most hosts. Furthermore, if a new host is added to the network, new links have to be laid down and one interface has to be added to each participating host. However, full-mesh has the advantage of providing the lowest delay between the hosts and the best resiliency against link failures. In practice, full-mesh networks are rarely used expected when there are few network nodes and resiliency is key. -MAC ---- -Medium Access Control -##################### +The second possible physical organisation, which is also used inside computers to connect different extension cards, is the bus. In a bus network, all hosts are attached to a shared medium, usually a cable through a single interface. When one host sends an electrical signal on the bus, the signal is received by all hosts attached to the bus. A drawback of bus-based networks is that if the bus is physically cut, then the network is split into two isolated networks. For this reason, bus-based networks are sometimes considered to be difficult to operate and maintain, especially when the cable is long and there are many places where it can break. Such a bus-based topology was used in early Ethernet networks. -Point-to-point datalink layers need to select one of the framing techniques described above and optionally add retransmission algorithms such as those explained for the transport layer to provide a reliable service. Datalink layers for Local Area Networks face two additional problems. A LAN is composed of several hosts that are attached to the same shared physical medium. From a physical layer perspective, a LAN can be organised in four different ways : +.. figure:: ../../book/intro/svg/bus.* + :align: center + :scale: 50 + + A network organized as a Bus - - a bus-shaped network where all hosts are attached to the same physical cable - - a ring-shaped where all hosts are attached to an upstream and a downstream node so that the entire network forms a ring - - a star-shaped network where all hosts are attached to the same device - - a wireless network where all hosts can send and receive frames using radio signals +A third organisation of a computer network is a star topology. In such topologies, hosts have a single physical interface and there is one physical link between each host and the center of the star. The node at the center of the star can be either a piece of equipment that amplifies an electrical signal, or an active device, such as a piece of equipment that understands the format of the messages exchanged through the network. Of course, the failure of the central node implies the failure of the network. However, if one physical link fails (e.g. because the cable has been cut), then only one node is disconnected from the network. In practice, star-shaped networks are easier to operate and maintain than bus-shaped networks. Many network administrators also appreciate the fact that they can control the network from a central point. Administered from a Web interface, or through a console-like connection, the center of the star is a useful point of control (enabling or disabling devices) and an excellent observation point (usage statistics). -These four basic physical organisations of Local Area Networks are shown graphically in the figure below. We will first focus on one physical organisation at a time. -.. figure:: svg/bus-ring-star.png +.. figure:: ../../book/intro/svg/star.* :align: center - :scale: 90 + :scale: 50 + + A network organised as a Star + +A fourth physical organisation of a network is the ring topology. Like the bus organisation, each host has a single physical interface connecting it to the ring. Any signal sent by a host on the ring will be received by all hosts attached to the ring. From a redundancy point of view, a single ring is not the best solution, as the signal only travels in one direction on the ring; thus if one of the links composing the ring is cut, the entire network fails. In practice, such rings have been used in local area networks, but are now often replaced by star-shaped networks. In metropolitan networks, rings are often used to interconnect multiple locations. In this case, two parallel links, composed of different cables, are often used for redundancy. With such a dual ring, when one ring fails all the traffic can be quickly switched to the other ring. + +.. figure:: ../../book/intro/svg/ring.* + :align: center + :scale: 50 + + A network organised as a ring + +A fifth physical organisation of a network is the tree. Such networks are typically used when a large number of customers must be connected in a very cost-effective manner. Cable TV networks are often organised as trees. + +.. figure:: ../../book/intro/svg/tree.* + :align: center + :scale: 50 + + A network organised as a Tree - Bus, ring and star-shaped Local Area Network +In all these networks, except the full-mesh, the link bandwidth is shared among all connected hosts. Various algorithms have been proposed and are used to efficiently share the access to this ressource. We explain several of them in the Medium Access Control section below. + + +Sharing bandwidth among the hosts directly attached to a link is not the only bandwidth sharing problem that occurs in computer networks. To understand the general problem, let us consider a very simple network which contains only point-to-point links. This network contains three hosts and two network nodes. All links inside the network have the same capacity. For example, let us assume that all links have a bandwidth of 1000 bits per second and that the hosts send packets containing exactly one thousand bits. + + +.. graphviz:: + + graph foo { + A [shape=box]; + B [shape=box]; + C [shape=box]; + A--R1 ; + B--R1; + R1--R2 []; + R2--C []; + } + + +In the network above, consider the case where host `A` is transmitting packets to destination `C`. `A` can send one packet per second and its packets will be delivered to `C`. Now, let us explore what happens when host `B` also starts to transmit a packet. Node `R1` will receive two packets that must be forwarded to `R2`. Unfortunately, due to the limited bandwidth on the `R1-R2` link, only one of these two packets can be transmitted. The outcome of the second packet will depend on the available buffers on `R1`. If `R1` has one available buffer, it could store the packet that has not been transmitted on the `R1-R2` link until the link becomes available. If `R1` does not have available buffers, then the packet needs to be discarded. + +.. index:: network congestion + +Besides the link bandwidth, the buffers on the network nodes are the second type of ressource that needs to be shared inside the network. The node buffers play an important role in the operation of the network because that can be used to absorb transient traffic peaks. Consider again the example above. Assume that one average host `A` and host `B` send a group of three packets every ten seconds. Their combined transmission rate (0.6 packets per second) is, on average, lower than the network capacity (1 packet per second). However, if they both start to transmit at the same time, node `R1` will have to absorb a burst of packets. This burst of packets is a small `network congestion`. We will say that a network is congested, when the sum of the traffic demand from the hosts is larger than the network capacity :math:`\sum{demand}>capacity`. This `network congestion` problem is one of the most difficult ressource sharing problem in computer networks. `Congestion` occurs is almost all networks. Minimizing the amount of congestion is a key objective for many network operators. In most cases, they will have to accept transient congestion, i.e. congestion lasting a few seconds or perhaps minutes, but will want to prevent congestion that lasts days or months. For this, they can rely on a wide range of solutions. We briefly present some of these in the paragraphs below. A detailed overview of the congestion problem would require an entire book. + +.. todo:: provide references to congestion books + +.. index:: congestion collapse + +If `R1` has enough buffers, it will be able to absorb the load without having to discard packets. The packets sent by hosts `A` and `B` will reach their final destination `C`, but will experience a longer delay than when they are transmitting alone. The amount of buffering on the network node is the first paper that a network operator can tune to control congestion inside her network. Given the decreasing cost of memory, one could be tempted to put as many buffers [#fbufferbloat]_ as possible on the network nodes. Let us consider this case in the network above and assume that `R1` has infinite buffers. Assume now that hosts `A` and `B` try to transmit a file that corresponds to one thousand packets each. Both are using a reliable protocol that relies on go-back-n to recover from transmission errors. The transmission starts and packets start to accumulate in `R1`'s buffers. These presence of these packets in the buffers increases the delay between the transmission of a packet by `A` and the return of the corresponding acknowledgement. Given the increasing delay, host `A` (and `B` as well) will consider that some of the packets that it sent have been lost. These packets will be retransmitted and will enter the buffers of `R1`. The occupancy of the buffers of `R1` will continue to increase and the delays as well. This will cause new retransmissions, ... In the end, several copies of the same packet will be transmitted over the `R1-R2`, but only one file will be delivered (very slowly) to the destination. This is known as the `congestion collapse` problem :rfc:`896`. Congestion collapse is the nightmare for network operators. When it happens, the network carries packet without delivering useful data to the end users. + +.. note:: Congestion collapse on the Internet + + Congestion collapse is unfortunately not only an academic experience. Van Jacobson reports in [Jacobson1988]_ one of these events that affected him while he was working at the Lawrence Berkeley Laboratory (LBL). LBL was two network nodes away from the University of California in Berkeley. At that time, the link between the two sites had a bandwidth of 32 Kbps, but some hosts were already attached to 10 Mbps LANs. `In October 1986, the data throughput from LBL to UC Berkeley ... dropped from 32 Kbps to 40 +bps. We were fascinated by this sudden factor-of-thousand +drop in bandwidth and embarked on an investigation of why +things had gotten so bad.` This work lead to the development of various congestion control techniques that have allowed the Internet to continue to grow without experiencing widespread congestion collapse events. + +Besides bandwidth and memory, a third ressource that needs to be shared inside a network is the (packet) processing capacity. To forward a packet, a network node needs bandwidth on the outgoing link, but it also needs to analyze the packet header to perform a lookup inside its forwarding table. Performing these lookup operations require ressources such as CPU cycles or memory accesses. Network nodes are usually designed to be able to sustain a given packet processing rate, measured in packets per second. + +.. note:: Packets per second versus bits per second + + + The performance of network nodes can be characterized by two key metrics : + + - the node's capacity measured in bits per second + - the node's lookup performance measured in packets per second + + The node's capacity in bits per second mainly depends on the physical interfaces that it uses and also on the capacity of the internal interconnection (bus, crossbar switch, ...) between the different interfaces inside the node. Many vendors, in particular for low-end devices will use the sum of the bandwidth of the nodes' interfaces as the node capacity in bits per second. Measurements do not always match this maximum theoretical capacity. A well designed network node will usually have a capacity in bits per second larger than the sum of its link capacities. Such nodes will usually reach this maximum capacity when forwarding large packets. + + When a network node forwards small packets, its performance is usually limited by the number of lookup operations that it can perform every second. This lookup performance is measured in packets per second. The performance may depend on the length of the forwarded packets. The key performance factor is the number of minimal size packets that are forwarded by the node every second. This rate can lead to a capacity in bits per second which is much lower than the sum of the bandwidth of the node's links. + +.. add something on bisection bandwidth ? +.. http://courses.cs.washington.edu/courses/csep524/99wi/lectures/lecture7/sld006.htm + +Let us now try to present a broad overview of the congestion problem in networks. We will assume that the network is composed of dedicated links having a fixed bandwidth [#fadjust]_. A network contains hosts that generate and receive packets and nodes that forward packets. Assuming that each host is connected via a single link to the network, the largest demand is :math:`sum{Access Links}`. In practice, this largest demand is never reached and the network will be engineered to sustain a much lower traffic demand. The difference between the worst-case traffic demand and the sustainable traffic demand can be large, up to several orders of magnitude. Fortunately, the hosts are not completely dump and they can adapt their traffic demand to the current state of the network and the available bandwidth. For this, the hosts need to `sense` the current level of congestion and adjust their own traffic demand based on the estimated congestion. Network nodes can react in different ways to network congestion and hosts can sense the level of congestion in different ways. + +Let us first explore which mechanisms can be used inside a network to control congestion and how these mechanisms can influence the behavior of the end hosts. + +As explained earlier, one of the first manifestation of congestion on network nodes is the saturation of the network links that leads to a growth in the occupancy of the buffers of the node. This growth of the buffer occupancy implies that some packets will spend more time in the buffer and thus in the network. If hosts measure the network delays (e.g. by measuring the round-trip-time between the transmission of a packet and the return of the corresponding acknowledgement) they could start to sense congestion. On low bandwidth links, a growth in the buffer occupancy can lead to an increase of the delays which can be easily measured by the end hosts. On high bandwidth links, a few packets inside the buffer will cause a small variation in the delay which may not necessarily be larger that the natural fluctuations of the delay measurements. + +If the buffer's occupancy continues to grow, it will overflow and packets will need to be discarded. Discarding packets during congestion is the second possible reaction of a network node to congestion. Before looking at how a node can discard packets, it is interesting to discuss qualitatively the impact of the buffer occupancy on the reliable delivery of data through a network. This is illustrated by the figure below, adapted from [Jain1990]_. + +.. figure:: jain.png + :align: center + + Network congestion + + +When the network load is low, buffer occupancy and link utilizations are low. The buffers on the network nodes are mainly used to absorb very short bursts of packets, but on average the traffic demand is lower than the network capacity. If the demand increases, the average buffer occupancy will increase as well. Measurements have shown that the total throughput increases as well. If the buffer occupancy is zero or very low, transmission opportunities on network links can be missed. This is not the case when the buffer occupancy is small but non zero. However, if the buffer occupancy continues to increase, the buffer becomes overloaded and the throughput does not increase anymore. When the buffer occupancy is close to the maximum, the throughput may decrease. This drop is throughput can be caused by excessive retransmissions of reliable protocols that incorrectly assume that previously sent packets have been lost while they are still waiting in the buffer. The network delay on the other hand increases with the buffer occupancy. In practice, a good operating point for a network buffer is a low occupancy to achieve high link utilization and also low delay for interactive applications. + +.. index:: packet discard mechanism + +Discarding packets is one of the signals that the network nodes can use to inform the hosts of the current level of congestion. Buffers on network nodes are usually used as FIFO queues to preserve packet ordering. Several `packet discard mechanisms` have been proposed for network nodes. These techniques basically answer two different questions : + + - `What triggers a packet to be discarded ?` What are the conditions that lead a network node to decide to discard a packet. The simplest answer to this question is : `When the buffer is full`. Although this is a good congestion indication, it is probably not the best one from a performance viewpoint. An alternative is to discard packets when the buffer occupancy grows too much. In this case, it is likely that the buffer will become full shortly. Since packet discarding is an information that allows hosts to adapt their transmission rate, discarding packets early could allow hosts to react earlier and thus prevent congestion from happening. + - `Which packet(s) should be discarded ?` Once the network node has decided to discard packets, it needs to actually discard real packets. + + +By combining different answers to these questions, network researchers are developed different packet discard mechanisms. + + - `tail drop` is the simplest packet discard technique. When a buffer is full, the arriving packet is discarded. `Tail drop` can be easily implemented. This is, by far, the most widely used packet discard mechanism. However, it suffers from two important drawbacks. First, since `tail drop` discards packets only when the buffer is full, buffers tend to be congested and realtime applications may suffer from the increased delays. Second, `tail drop` is blind when it discards a packet. It may discard a packet from a low bandwidth interactive flow while most of the buffer is used by large file transfers. + - `drop from front` is an alternative packet discard technique. When a packet arrives and the Instead of removing the arriving packet, it removes the packet that was at the head of the queue. Discarding this packet instead of the arriving one can have two advantages. First, it already stayed a long time in the buffer. Second, hosts should be able to detect the loss (and thus the congestion) earlier. + - `probabilistic drop`. Various random drop techniques have been proposed. Compared to the previous techniques. A frequently cited technique is `Random Early Discard` (RED) [FJ1993]_. RED measures the average buffer occupancy and probabilistically discards packets when this average occupancy is too high. Compared to `tail drop` and `drop from front`, an advantage of `RED` is that thanks to the probabilistic drops, packets should be discarded from different flows in proportion of their bandwidth. + + +Discarding packets is a frequent reaction to network congestion. Unfortunately, discarding packets is not optimal since a packet which is discarded on a network node has already consumed resources on the upstream nodes. There are other ways for the network to inform the end hosts of the current congestion level. A first solution is to mark the packets when a node is congested. Several networking technologies have relied on this kind of packet marking. + +.. index:: Forward Explicit Congestion Notification, FECN + +In datagram networks, `Forward Explicit Congestion Notification` (FECN) can be used. One field of the packet header, typically one bit, is used to indicate congestion. When a host sends a packet, the congestion bit is reset. If the packet passes through a congested node, the congestion bit is set. The destination can then determine the current congestion level by measuring the fraction of the packets that it received with the congestion bit set. It may then return this information to the sending host to allow it to adapt its retransmission rate. Compared to packet discarding, the main advantage of FECN is that hosts can detect congestion explicitly without having to rely on packet losses. + +In virtual circuit networks, packet marking can be improved if the return packets follow the reverse path of the forward packets. It this case, a network node can detect congestion on the forward path (e.g. due to the size of its buffer), but mark the packets on the return path. Marking the return packets (e.g. the acknowledgements used by reliable protocols) provides a faster feedback to the sending hosts compared to FECN. This technique is usually called `Backward Explicit Congestion Notification (BECN)`. + +If the packet header does not contain any bit in the header to represent the current congestion level, an alternative is to allow the network nodes to send a control packet to the source to indicate the current congestion level. Some networking technologies use such control packets to explicitly regulate the transmission rate of sources. However, their usage is mainly restricted to small networks. In large networks, network nodes usually avoid using such control packets. These controlled packets are even considered to be dangerous in some networks. First, using them increases the network load when the network is congested. Second, while network nodes are optimized to forward packets, they are usually pretty slow at creating new packets. + +Delays, packet discards, packet markings and control packets are the main types of information that the network can exchange with the end hosts. Discarding packets is the main action that a network node can perform if the congestion is too severe. Besides tackling congestion at each node, it is also possible to change the divert some traffic flows from heavily loaded links to reduce congestion. Early routing algorithms [MRR1980]_ have used delay measurements to detect congestion between network nodes and update the link weights dynamically. By reflecting the delay perceived by applications in the link weights used for the shortest paths computation, these routing algorithms managed to dynamically change the forwarding paths in reaction to congestion. However, deployment experience showed that these dynamic routing algorithms could cause oscillations and did not necessarily lower congestion. Deployed datagram networks rarely use dynamic routing algorithms, except in some wireless networks. In datagram networks, the state of the art reaction to long term congestion, i.e. congestion lasting hours, days or more, is to measure the traffic demand and then select the link weights [FRT2002]_ that allow to minimize the maximum link loads. If the congestion lasts longer, changing the weights is not sufficient anymore and the network needs to be upgraded with few or faster links. However, in Wide Area Networks, adding new links can take months. + +In virtual circuit networks, another way to manage or prevent congestion is to limit the number of circuits that use the network at any time. This technique is usually called `connection admission control`. When a host requests the creation of a new circuit in the network, it specifies the destination and in some networking technologies the required bandwidth. With this information, the network can check whether there are enough resources available to reach this particular destination. If yes, the circuit is established. However, the requested is denied and the host will have to defer the creation of its virtual circuit. `Connection admission control` schemes are widely used in the telephone networks. In these networks, a busy tone corresponds to an unavailable destination or a congested network. + +In datagram networks, this technique cannot be easily used since the basic assumption of such a network is that a host can send any packet towards any destination at any time. A host does not need to request the authorization of the network to send packets towards a particular destination. + +Based on the feedback received from the network, the hosts can adjust their transmission rate. We discuss in section `Congestion control` some techniques that allow hosts to react to congestion. + + +.. note:: Spatial congestion control with Bittorrent + +.. ieee ethernet mac ? + + +MAC +--- .. index:: collision -The common problem among all of these network organisations is how to efficiently share the access to the Local Area Network. If two devices send a frame at the same time, the two electrical, optical or radio signals that correspond to these frames will appear at the same time on the transmission medium and a receiver will not be able to decode either frame. Such simultaneous transmissions are called `collisions`. A `collision` may involve frames transmitted by two or more devices attached to the Local Area Network. Collisions are the main cause of errors in wired Local Area Networks. +The common problem among Local Area Networks is how to efficiently share the access to the shared bandwidth. If two devices send a frame at the same time, the two electrical, optical or radio signals that correspond to these frames will appear at the same time on the transmission medium and a receiver will not be able to decode either frame. Such simultaneous transmissions are called `collisions`. A `collision` may involve frames transmitted by two or more devices attached to the Local Area Network. Collisions are the main cause of errors in wired Local Area Networks. All Local Area Network technologies rely on a `Medium Access Control` algorithm to regulate the transmissions to either minimise or avoid collisions. There are two broad families of `Medium Access Control` algorithms : @@ -65,7 +200,7 @@ Limited resources need to be shared in other environments than Local Area Networ `Time Division Multiplexing` (TDM) is a static bandwidth allocation method that was initially defined for the telephone network. In the fixed telephone network, a voice conversation is usually transmitted as a 64 Kbps signal. Thus, a telephone conservation generates 8 KBytes per second or one byte every 125 microseconds. Telephone conversations often need to be multiplexed together on a single line. For example, in Europe, thirty 64 Kbps voice signals are multiplexed over a single 2 Mbps (E1) line. This is done by using `Time Division Multiplexing` (TDM). TDM divides the transmission opportunities into slots. In the telephone network, a slot corresponds to 125 microseconds. A position inside each slot is reserved for each voice signal. The figure below illustrates TDM on a link that is used to carry four voice conversations. The vertical lines represent the slot boundaries and the letters the different voice conversations. One byte from each voice conversation is sent during each 125 microseconds slot. The byte corresponding to a given conversation is always sent at the same position in each slot. -.. figure:: png/lan-fig-012-c.png +.. figure:: ../../book/lan/png/lan-fig-012-c.png :align: center :scale: 70 @@ -128,7 +263,7 @@ ALOHA and slotted ALOHA can easily be implemented, but unfortunately, they can o .. index:: persistent CSMA, CSMA (persistent) -.. code-block:: text +.. code-block:: python # persistent CSMA N=1 @@ -149,7 +284,7 @@ The above pseudo-code is often called `persistent CSMA` [KT1975]_ as the termina .. index:: non-persistent CSMA, CSMA (non-persistent) -.. code-block:: text +.. code-block:: python # Non persistent CSMA N=1 @@ -180,7 +315,7 @@ Carrier Sense Multiple Access with Collision Detection CSMA improves channel utilization compared to ALOHA. However, the performance can still be improved, especially in wired networks. Consider the situation of two terminals that are connected to the same cable. This cable could, for example, be a coaxial cable as in the early days of Ethernet [Metcalfe1976]_. It could also be built with twisted pairs. Before extending CSMA, it is useful to understand more intuitively, how frames are transmitted in such a network and how collisions can occur. The figure below illustrates the physical transmission of a frame on such a cable. To transmit its frame, host A must send an electrical signal on the shared medium. The first step is thus to begin the transmission of the electrical signal. This is point `(1)` in the figure below. This electrical signal will travel along the cable. Although electrical signals travel fast, we know that information cannot travel faster than the speed of light (i.e. 300.000 kilometers/second). On a coaxial cable, an electrical signal is slightly slower than the speed of light and 200.000 kilometers per second is a reasonable estimation. This implies that if the cable has a length of one kilometer, the electrical signal will need 5 microseconds to travel from one end of the cable to the other. The ends of coaxial cables are equipped with termination points that ensure that the electrical signal is not reflected back to its source. This is illustrated at point `(3)` in the figure, where the electrical signal has reached the left endpoint and host B. At this point, B starts to receive the frame being transmitted by A. Notice that there is a delay between the transmission of a bit on host A and its reception by host B. If there were other hosts attached to the cable, they would receive the first bit of the frame at slightly different times. As we will see later, this timing difference is a key problem for MAC algorithms. At point `(4)`, the electrical signal has reached both ends of the cable and occupies it completely. Host A continues to transmit the electrical signal until the end of the frame. As shown at point `(5)`, when the sending host stops its transmission, the electrical signal corresponding to the end of the frame leaves the coaxial cable. The channel becomes empty again once the entire electrical signal has been removed from the cable. -.. figure:: png/lan-fig-024-c.png +.. figure:: ../../book/lan/png/lan-fig-024-c.png :align: center :scale: 70 @@ -189,7 +324,7 @@ CSMA improves channel utilization compared to ALOHA. However, the performance ca Now that we have looked at how a frame is actually transmitted as an electrical signal on a shared bus, it is interesting to look in more detail at what happens when two hosts transmit a frame at almost the same time. This is illustrated in the figure below, where hosts A and B start their transmission at the same time (point `(1)`). At this time, if host C senses the channel, it will consider it to be free. This will not last a long time and at point `(2)` the electrical signals from both host A and host B reach host C. The combined electrical signal (shown graphically as the superposition of the two curves in the figure) cannot be decoded by host C. Host C detects a collision, as it receives a signal that it cannot decode. Since host C cannot decode the frames, it cannot determine which hosts are sending the colliding frames. Note that host A (and host B) will detect the collision after host C (point `(3)` in the figure below). -.. figure:: png/lan-fig-025-c.png +.. figure:: ../../book/lan/png/lan-fig-025-c.png :align: center :scale: 70 @@ -205,7 +340,7 @@ As shown above, hosts detect collisions when they receive an electrical signal t To better understand these collisions, it is useful to analyse what would be the worst collision on a shared bus network. Let us consider a wire with two hosts attached at both ends, as shown in the figure below. Host A starts to transmit its frame and its electrical signal is propagated on the cable. Its propagation time depends on the physical length of the cable and the speed of the electrical signal. Let us use :math:`\tau` to represent this propagation delay in seconds. Slightly less than :math:`\tau` seconds after the beginning of the transmission of A's frame, B decides to start transmitting its own frame. After :math:`\epsilon` seconds, B senses A's frame, detects the collision and stops transmitting. The beginning of B's frame travels on the cable until it reaches host A. Host A can thus detect the collision at time :math:`\tau-\epsilon+\tau \approx 2\times\tau`. An important point to note is that a collision can only occur during the first :math:`2\times\tau` seconds of its transmission. If a collision did not occur during this period, it cannot occur afterwards since the transmission channel is busy after :math:`\tau` seconds and CSMA/CD hosts sense the transmission channel before transmitting their frame. -.. figure:: png/lan-fig-027-c.png +.. figure:: ../../book/lan/png/lan-fig-027-c.png :align: center :scale: 70 @@ -218,7 +353,7 @@ Furthermore, on the wired networks where CSMA/CD is used, collisions are almost Removing acknowledgements is an interesting optimisation as it reduces the number of frames that are exchanged on the network and the number of frames that need to be processed by the hosts. However, to use this optimisation, we must ensure that all hosts will be able to detect all the collisions that affect their frames. The problem is important for short frames. Let us consider two hosts, A and B, that are sending a small frame to host C as illustrated in the figure below. If the frames sent by A and B are very short, the situation illustrated below may occur. Hosts A and B send their frame and stop transmitting (point `(1)`). When the two short frames arrive at the location of host C, they collide and host C cannot decode them (point `(2)`). The two frames are absorbed by the ends of the wire. Neither host A nor host B have detected the collision. They both consider their frame to have been received correctly by its destination. -.. figure:: png/lan-fig-026-c.png +.. figure:: ../../book/lan/png/lan-fig-026-c.png :align: center :scale: 70 @@ -248,7 +383,7 @@ In the second and third cases, both hosts have flipped different coins. The dela If two hosts are competing, the algorithm above will avoid a second collision 50% of the time. However, if the network is heavily loaded, several hosts may be competing at the same time. In this case, the hosts should be able to automatically adapt their retransmission delay. The `binary exponential back-off` performs this adaptation based on the number of collisions that have affected a frame. After the first collision, the host flips a coin and waits 0 or 1 `slot time`. After the second collision, it generates a random number and waits 0, 1, 2 or 3 `slot times`, etc. The duration of the waiting time is doubled after each collision. The complete pseudo-code for the CSMA/CD algorithm is shown in the figure below. -.. code-block:: text +.. code-block:: python # CSMA/CD pseudo-code N=1 @@ -296,7 +431,7 @@ Compared to CSMA, CSMA/CA defines more precisely when a device is allowed to sen The figure below shows the basic operation of CSMA/CA devices. Before transmitting, host `A` verifies that the channel is empty for a long enough period. Then, its sends its data frame. After checking the validity of the received frame, the recipient sends an acknowledgement frame after a short SIFS delay. Host `C`, which does not participate in the frame exchange, senses the channel to be busy at the beginning of the data frame. Host `C` can use this information to determine how long the channel will be busy for. Note that as :math:`SIFS