# <mark>The Internet</mark>

## `Have a broad understanding of what the internet is and how it works`

*(Come back to this)*

## `Understand the characteristics of the physical network, such as latency and bandwidth`

**Physical Network**
- The Physical layer as the bottommost layer  (Layer 1) of its model in our networked communications model (OSI model).  
- The physical network is the tangible infrastructure that transmits all previous encapsulated data (*from the layers above?*) as bits in the form of the electrical signals, light, and radio waves which carry network communications.
- The functionality at this level is essentially concerned with the transfer of bits (binary data) acrosss a physical medium.
- The physical limitations of networked communication, latency and bandwidth, all come as a result of unavoidable physical laws that govern this layer.

**Latency**
- Latency is a measure of the time it takes for some data to get from one point in a network to another point in a network.
- It is a measure of delay. The difference between the start and end point is the delay.
- It is determined by real physical laws, such as the distance traveled and the speed of the signal traveling (i.e. speed of light, sound, or electricity).
- Latency has four main aspects that occur during each network "hop" that data takes during its overall journey through the network:
    - **Propagation delay**: this is the amount of time it takes for a message to travel from the sender to the receiver, and can be calculated as the ratio between distance and speed.
    - **Transmission delay**: the amount of time it takes to push the data onto the "link" or "node" in the overall network
    - **Processing delay**: Data travelling across the physical network doesn't directly cross from one link to another, but is processed in various ways; amount of time it takes to process the data within one of the "nodes" or "links" in the overall network.
    - **Queuing delay**: The amount of time the data is waiting in the queue or "buffer" to be processed is the queuing delay.
- The total latency between two points, such as a client and a server, is the sum of all these delays (usually given in milliseconds (ms)), plus any of the following delays:
    - **Last-mile latency**: a "slowing down" that takes place at the network edge, as smaller and more frequent hops take place as data moves lower in the network hierarchy
    - **Round-trip Time (RTT)**: the length of time for a signal to be sent, added to the length of time for an acknowledgement or response to be received.
        - Latency overhead associated with additional round trips is often a trade off to consider when dealing with the implementation of network reliability in TCP.
        
**Bandwidth**
- Bandwidth is the amount of data that can be sent along the physical structure of the network in a particular unit of time (typically, a second).
- It is a measure of capacity.
- It is also determined by real physical laws, such as the capacity of the medium down which data is being transported.
- Because this is almost never a constant amount, we consider the bandwidth of a connection to be whatever value is the lowest value over the entire connection.

## `Have a basic understanding of how lower level protocols operate`

**The Link/ Data Link Layer**

- The protocols operating at this layer are primarily concerned with the identification of the next network "node" to which data should be sent, and devices on the physical network and moving data over the physical network between the devices that comprise it, such as hosts (e.g. computers), switches, and routers.
- Ethernet governs communication between devices in a local network, and is responsible for navigating to the correct physical address, rather than logical one (this is left to IP). For this reason, it acts as an interface between the physical infrastructure below it and the more logical layers above.
- The **Ethernet Protocol** is a set of standards and protocols that enables communication between devices on a local network.
- It is the most commonly used protocol at this layer
- The Ethernet protocol provides two main functions:
    - **Framing**, which provides logical structure to the streams of bits traveling through the physical infrastructure/layer of the network by categorizing data into 'fields' that have specific lengths and orders.
        - **Ethernet Frames**: a Protocol Data Unit (PDU) that encapsulates data from the Internet/ Network layer above.
        - The Link/ Data Link layer is the lowest layer at which encapsulation takes place.
        - Adds logical structure to this binary data.  The data in the frame is still in the form of bits, but the structure defines which bits are actually the data payload, and which are metadata to be used in the process of transporting the frame.
        - The "fields" of a frame include:
            - **Source and Destination MAC address**: The source address is the physical address of the device which created the frame. The destination MAC address is the physical address of the device for which the data is ultimately intended.
             - **Data Payload**: Contains the data for the entire Protocol Data Unit (PDU) from the layer above, (commonly) an IP Packet for example.
    - **Addressing** which identifies the next network "node" to which data should be sent with the use of MAC addressing.
        - Ethernet uses **MAC addressing** to identify devices (rather than location) connected to the local network.  This is how Ethernet implements addressing
        - Since this address is linked to the specific physical device, and (usually) doesn't change, it is sometimes referred to as the **physical address** or **burned-in address**.
        - MAC Addresses are formatted as a sequence of six two-digit hexadecimal numbers, e.g. `00:40:96:9d:68:0a`, with different ranges of addresses being assigned to different network hardware manufacturers.
        - MAC addresses work well in LANs, where devices are connected to a central hub that recalls their specific MAC address.
        - They do not work well in large decentralized systems, nor are they scalable:
            - They are physical, not logical, i.e. they do not change based on location. Each MAC Address is tied (burned in) to a specific physical device
            - The are flat, and do not posses a hierarchical structure that allows us to categorize them into searchable subdivisions. The entire address is a single sequence of values and can't be broken down into sub-divisions.
            
**The Internet/ Network Layer**

- Whereas the Ethernet protocol provides communication between devices on the same local network, the Internet Protocol enables communication between two networked devices anywhere in the world.
- The primary function of protocols at this layer is to facilitate communication between hosts (e.g. computers) on different networks (i.e. inter-network communication).
- It comes between protocols at the Link/DataLink Layer and protocols at the Transport Layer
- The **Internet Protocol (IP)** is the predominant protocol used at this layer for inter-network communication. 
- IP provides routing capability between devices on different networks via IP addresses.  It also encapsulates data into packets
- A **Packet** is the Protocol Data Unit (PDU) within the IP Protocol
    - Just as with Ethernet Frames, the Data Payload of an IP Packet is the PDU from the layer above (generally a TCP segment or a UDP datagram from the Transport layer).
    - A packet consists of a header and a data payload
    - The IP packet is responsible for routing all the encapsulated data on its journey, which consists of a series of network "hops", or jumps between various nodes (routers) on the overall network.
    - The Header is split into logical fields which provide metadata used in transporting the packet.
    - The header fields include:
        - **Source Address**: the 32-bit IP address of the source (sender) of the packet. Allows for IP addressing.
        - **Destination Address**: the 32-bit IP address of the destination (intended recipient) of the packet. Allows for IP addressing.
- An **IP Addresss** is a unique address that we can use to identify a device or host on the internet.
    - IP addresses have two main features that allow for inter-network communication across a large distributed system:
        - They are logical: they are assigned as required when devices join a network
        - They are hierarchical: the structure of the address allows us to categorize them into searchable subdivisions (subnets). The overall network is divided into logical sub-networks and numbers are allocated according to this hierarchy.
        - A range of IP addresses is defined by network hierarchy, and each subnetwork is assigned a given range of addresses.
        - The network address is assigned to the first address in the range and the broadcast address is assigned to be the last address in that range.
        - There are two types of IP addresses in two different versions of IP:
            - IPv4 = 32-bit addresses provides 4.3 billion possible addresses, which is not enough for all the devices on the network
            - IPv6 = 128-bit addresses provide 340 undecillion addresses, hopefully will be enough for a long time to come
- MAC addresses, due to their nature (physical (*not logical*), flat (*not hierarchical*), are not scalable. IP addresses fill this gap. Because they are logical and hierarchical, they work well in large distributed systems.
- Unlike MAC Addresses, IP Addresses are logical in nature. This means that they are not tied to a specific device, but can be assigned as required to devices as they join a network.
- The IP address only gets us in communication with the intended device. It does not allow us to isolate any particular application of process running on that device. For that we need the Port numbers provided by the Transport Layer protocol.

## `Know what an IP address is and what a port number is`

**Ports**
- A port is an identifier for a specific process running on a host. 
- This identifier is an integer in the range 0-65535.
- Each specific process is assigned a single port, which can be used to identify that same process running on a different device.
- The source and destination port numbers are included in the Protocol Data Units (PDU) for the transport layer.
- Data from the application layer is encapsulated as the data payload in this PDU, and the source and destination port numbers within the PDU can be used to direct that data to specific processes on a host.
- The entire PDU is then encapsulated as the data payload in an IP packet.
- The IP addresses in the packet header can be used to direct data from one host to another. 
- The IP address and the port number together are what enables end-to-end communication between specific applications on different machines.

**Socket**
- An IP address and port number combined define a communication end-point known as a network socket.
- It is a communication end-point defined by an address-port pair.
- The IP address and the port number together allow the protocols operating in the Transport Layer to facilitate data exchange between specific applications running on separate devices across the network.
- These sockets allow both IP and the protocol operating at the Transport Layer (TCP/UDP) to transmit data between devices and processes.
- The IP address gets us the correct device on the network and the port number gets us to the correct application on that device.
- This is how we can achieve end-to-end communication between devices.<br><br>
- ***clarification for concept of sockets vs. implementation***

**IP Address**
- *Refer to previous question*

## `Have an understanding of how DNS works`

- DNS or the Domain Name System is a distributed database which translates/maps domain names like `www.google.com` to an IP address (like `123.456.123.456`), so that the IP address can then be used to make a request to the server.
- There is a very large world-wide network of hierarchically organized DNS servers, and no single DNS server contains the complete database. 
- If a DNS server does not contain a requested domain name, the DNS server routes the request to another DNS server up the hierarchy. 
- Eventually, the address will be found in the DNS database on a particular DNS server, and the corresponding IP address will be used to receive the request.
- Your typical interaction with the Internet starts with a web browser when you:
    1. Enter a URL like `http://www.google.com` into your web browser's address bar.
    2. The browser creates an HTTP request, which is packaged up and sent to your device's network interface.
    3. If your device already has a record of the IP address for the domain name in its DNS cache, it will use this cached address. If the IP address isn't cached, a DNS request will be made to the Domain Name System to obtain the IP address for the domain.
    4. If the DNS server that recieves the request does not have the correct domain name, it will route the request up the hierarchical system until it finds it.  
    5. The packaged-up HTTP request then goes over the Internet where it is directed to the server with the matching IP address. (DNS then hands that IP address to the lower level protocols that are responsible for routing the HTTP request to the proper location.)
    6. The remote server accepts the request and sends a response over the Internet back to your network interface which hands it to your browser.
    7. Finally, the browser displays the response in the form of a web page.

## `Understand the client-server model of web interactions, and the role of HTTP as a protocol within that model`

*(Come back to this)*

# <mark>TCP & UDP</mark>

## `Have a clear understanding of the TCP and UDP protocols, their similarities and differences`

**TCP**

- Transmission Control Protocol (TCP) is a **connection-oriented protocol** that ensures reliable data transfer between applications on top of the unreliable channel of the lower-layer protocols.
    - A connection-oriented system: instantiates new socket object to establish a dedicated virtual connection channel between two processes running on separate devices.
    - Doesn't start sending application data until a connection has been established between application processes
    - You could have a socket object defined by the host IP and process port, just as in the connectionless system, also using a listen() method to wait for incoming messages
    - When new communication comes into the first listening socket, a new socket is created.   This new socket object wouldn't just be defined by the local IP and port number, but also by the IP and port of the process/host which sent the message. 
    - This socket listens specifically for messages that match its four-tuple, i.e. the IP and port of sender along with the IP and port of the receiver.
    - Implementing communication in this way effectively creates a dedicated virtual connection for communication between a specific process running on one host and a specific process running on another host. 
    - The advantage of having a dedicated connection like this is that it more easily allows you to put in place rules for managing the communication such as the order of messages, acknowledgements that messages had been received, retransmission of messages that weren't received, and so on.
- It provides **multiplexing** services
    - In the context of a communication network, multiplexing is the idea of transmitting multiple signals over a single channel, such as a single device communicating with the browser, the e-mail client, and streaming Spotify all through the same Network connection.
    - Multiplexing is enabled through the use of network ports (port numbers) alongside IP addresses
    - This is important because often there are multiple applications running on a single device, and yet IP addresses only provide a ***single channel***.
    - Each specific process is assigned a single port, which can be used to identify that same process running on a different device.
    - An IP address and port number combined define a communication end-point known as a network socket.
    - These sockets allow both IP and the protocol operating at the Transport Layer to transmit data between devices and processes.
- The purpose of these types of additional communication rules is to add more **reliability to the communication or Network reliability**.
    - Network Reliability ensures that a reliable communication channel is established between processes.
    - That is, that all transmitted data is received at communication end-point in the correct order.
    - Consists of 4 key elements:
        - **In-order delivery**: data is received in the order that it was sent
        - **Error detection**: corrupt data is identified using a checksum
        - **Handling data loss**: missing data is retransmitted based on acknowledgements and timeouts
        - **Handling duplication**: duplicate data is eliminated through the use of sequence numbers
    - Network reliability is implemented by TCP in the Transport Layer.
    - Lower level protocols (Ethernet and the Internet Protocols) are inherently unreliable; they include checksum data as part of their header or trailer so that the data transported as frames and packets can be tested to ensure it hasn't become corrupt during its journey. 
    - If the data is corrupt however, these protocols simply discard it (dropping the frame or packet); there is no provision within these protocols for enabling the replacement of lost data. The possibility of losing data and it not being replaced means that the network up to and including the Internet Protocol is effectively an unreliable communication channel.
- **Segments** are the Protocol Data Unit (PDU) of TCP. Like the PDUs of protocols we've looked at for other network layers, it uses a combination of headers and payload to provide encapsulation of data from the layer above.
    - The Source and Destination port numbers are fields in the segment header, while data such as an HTTP request is part of the payload.
    - It provides five main services:
        - **Multiplexing** through source and destination port numbers
        - **Error detection** through a checksum
        - **In-order deliver, handling data loss, and handling data duplication (data reliability)** through sequence and acknowledgment numbers
        - **Flow control** through window size data
        - **Congestion avoidance** through dynamic adjustment of flow according to data loss
- The **main downsides of TCP** are the latency overhead of establishing a connection, and the potential Head-of-line blocking as a result of in-order delivery.
    - TCP provides reliability at the cost of speed (that is, its reliability functions can contribute greatly to latency)
    -  The added overhead due to the need of establishing a connection with the three-way handshake, which can add up to two round trip times.
    - **Head-of-Line (HOL) blocking** relates to how issues in delivering or processing one message in a sequence of messages can delay or 'block' the delivery or processing of the subsequent messages in the sequence.
        - HOL blocking can occur as a result of the fact that TCP provides for in-order delivery of segments. If one of the segments goes missing and needs to be retransmitted, the segments that come after it in the sequence can't be processed, and need to be buffered until the retransmission has occurred.
        - This can lead to increased queuing delay which is one of the elements of latency.
    
**UDP**

- User Datagram Protocol (UDP) is a very simple protocol compared to TCP. It provides multiplexing (through source and destination port numbers) and ***optional*** error detection (through checksum), but no reliability, no in-order delivery, and no congestion or flow control.
- It establishes end-to-end connections between processes in the Transport Layer.
- UDP is **connectionless**, and so doesn't need to establish a connection before it starts sending data\
    - A connectionless system relies on a single socket for all communication, does not establish dedicated communication channels, and responds to all communications individually as they arrive.
    - One socket object defined by the IP address of the host machine and the port assigned to a particular process running on that machine.
    - That object could call a `listen()` method which would allow it to wait for incoming messages directed to that particular IP/port pair.
    - It would simply process any incoming messages as they arrived and send any responses as necessary.
    - It does not matter from what process transmissions come, a single socket listens to all messages regardless and responds to each as it arrives.
    - This is useful because it is a) a simpler and more flexible process than a connection-oriented system and b) it reduces latency overhead because a connection does not have to be established.
- Specifically, UDP provides speed because it doesn't take the time to establish a dedicated connection, its lack of in-order delivery means no latency due to Head-of-Line blocking, and the one way data flow of a connectionless system cuts down on latency due to extra round trips (there are no acknowledgments), and since it is a connectionless protocol, it provides no connection state tracking,
- Furthermore, UDP acts as a "base template" that programmers can build upon. The specifics of what type of reliability functions to include are left up to the developer to implement at the Application level.
- UDP does not provide any of the reliability of TCP. It is just as inherently unreliable as the layers below it.
- With UDP there is no guarantee of message delivery, delivery order, congestion avoidance, flow control, or state tracking.
- For example, video calling applications and online games that prioritize speed and low latency/lag over the potential for small amounts of lost data, can utilize UDP.

## `Have a broad understanding of the three-way handshake and its purpose`

- The three-way handshake is what TCP uses to establish a dedicated and reliable connections between processes over the network.
- First the sender sends a SYN segment, which ostensibly asks if the receiver is ready to receive.
- Upon receipt of the SYN segment, the receiver sends back a SYN ACK segment, indicating that it received the previous message and ensuring its messages are also being received.
- Finally, upon receiving the SYN ACK, the original sender sends an ACK segment, indicating it is also receiving messages from the receiver, and the connection can be (and subsequently is) established.
- This not only ensures a reliable connection between both devices, but synchronizes sequence numbers that will be used during the connection.
- It is this aspect of TCP that enables network reliability, that is, handling data loss through message acknowledgement, and ensuring in order delivery and de-duplication via the synchronized segment numbers.
- A key characteristic of the process is that the sender cannot send any application data until after it has sent the ACK Segment.
- What this means in practical terms, is that there is an entire round-trip of latency before any application data can be exchanged. Since this hand-shake process occurs every time a TCP connection is made, this clearly has an impact on any application which uses TCP at the transport layer.
- This can contribute to the overall latency of the trip, due to its complexity.

## `Have a broad understanding of flow control and congestion avoidance`

**Flow Congestion**

- Flow congestion is a mechanism to prevent the sender from overwhelming the receiver with too much data at once.
- Provided by TCP, flow control helps to ensure that data is transmitted as efficiently as possible.
- This, in turn, helps to mitigate the increased latency inherent in TCP connections.
- It is implemented via the window field of the TCP segment header.
    - The window header field contains data sent by the receiver letting the sender know the maximum amount of data it can accept at any given time.
    - This number is dynamically generated, and therefore the receiver can lower the amount if the buffer is getting full, and the sender will respond accordingly.
    - Data awaiting processing is stored in a **'buffer'**. The buffer size will depend on the amount of memory allocated according to the configuration of the OS and the physical resources available.

**Congestion Avoidance**

- Congestion avoidance is a service provided by TCP that attempts to prevent network congestion, a situation in which more data is being transmitted than there is capacity.
- To implement this, TCP uses data loss as a feedback mechanism to determine how "congested" the network is, by tracking how many retransmissions are required.
- A lot of data loss, or a lot of retransmissions, indicates there is more data on the network than there is capacity to process that data.
- TCP will take this as a sign to reduce the size of the transmission window, that is, it will send less data along the given channel.
- This is done to make data transmission as efficient as possible to mitigate the latency overhead inherent in TCP connections.

# <mark>URLs</mark>

## `Be able to identify the components of a URL, including query strings`

- A URL or (Universal Resource Locator) is a consistently formatted string that allows us to locate a certain resource on the web.
- It provides us with a systematic means of locating resources that we are requesting (via an HTTP request).
- A **URI or Uniform Resource Identifier** is an identifier for a particular resource within an information space.
- URL refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location").
- A URL, ***unlike*** a URI, must include some piece of data that allows us to locate the resource in question, while a URI does not have this requirement.<br><br>
- URL components include the:
scheme: tells the web client how/which protocol to use to access the resource.
The first part of the URL
A scheme is different from a protocol, although these terms are sometimes used interchangeably
It indicates which protocol group should be used, but not the specific version
Schemes and protocols can be differentiated by their case; the convention is to refer to scheme names in lowercase, e.g. http, and protocol names in uppercase, e.g. HTTP.
host (or hostname): It tells the client where the resource is hosted or located.
This is written in the format of a domain name.
DNS takes this human readable domain and finds the equivalent IP so the request can be routed.
It is a mandatory component of the URL
port: an identifier for the specific process to which the communication should be routed.
It is only required if you want to use a port other than the default.
The default port is 80 for HTTP and 443 for HTTPS.
path: It shows what local resource is being requested from the host.
This part of the URL is optional.
If the resource in question is a home page, the path might consist of a single forward slash (/).
Historically, the path has indicated specifically where the resource was located on the server, but with the proliferation of dynamically generated content, this no longer always follows the absolute file path of the server.
query string/parameters: passes additional information in the form of specially formatted query parameters to the server.
made up of query parameters. It is used to send data to the server. This part of the URL is also optional.
Query strings are used to pass additional data to the server during an HTTP Request. They take the form of name/value pairs separated by an= sign. Multiple name/value pairs are separated by an & sign. The start of the query string is indicated by a ?.
Because query strings are passed in through the URL, they are only used in HTTP GET requests.
Query strings are limited in use in that they have a maximum length, and are not suitable for sensitive information as they are plainly visible in the URL.

## `Be able to construct a valid URL`

## `Have an understanding of what URL encoding is and when it might be used`