# <mark>The Internet</mark>

## `Have a broad understanding of what the internet is and how it works`

*(Come back to this)*

## `Understand the characteristics of the physical network, such as latency and bandwidth`

**Physical Network**
- The Physical layer as the bottommost layer  (Layer 1) of its model in our networked communications model (OSI model).  
- The physical network is the tangible infrastructure that transmits all previous encapsulated data (*from the layers above?*) as bits in the form of the electrical signals, light, and radio waves which carry network communications.
- The functionality at this level is essentially concerned with the transfer of bits (binary data) acrosss a physical medium.
- The physical limitations of networked communication, latency and bandwidth, all come as a result of unavoidable physical laws that govern this layer.

**Latency**
- Latency is a measure of the time it takes for some data to get from one point in a network to another point in a network.
- It is a measure of delay. The difference between the start and end point is the delay.
- It is determined by real physical laws, such as the distance traveled and the speed of the signal traveling (i.e. speed of light, sound, or electricity).
- Latency has four main aspects that occur during each network "hop" that data takes during its overall journey through the network:
    - **Propagation delay**: this is the amount of time it takes for a message to travel from the sender to the receiver, and can be calculated as the ratio between distance and speed.
    - **Transmission delay**: the amount of time it takes to push the data onto the "link" or "node" in the overall network
    - **Processing delay**: Data travelling across the physical network doesn't directly cross from one link to another, but is processed in various ways; amount of time it takes to process the data within one of the "nodes" or "links" in the overall network.
    - **Queuing delay**: The amount of time the data is waiting in the queue or "buffer" to be processed is the queuing delay.
- The total latency between two points, such as a client and a server, is the sum of all these delays (usually given in milliseconds (ms)), plus any of the following delays:
    - **Last-mile latency**: a "slowing down" that takes place at the network edge, as smaller and more frequent hops take place as data moves lower in the network hierarchy
    - **Round-trip Time (RTT)**: the length of time for a signal to be sent, added to the length of time for an acknowledgement or response to be received.
        - Latency overhead associated with additional round trips is often a trade off to consider when dealing with the implementation of network reliability in TCP.
        
**Bandwidth**
- Bandwidth is the amount of data that can be sent along the physical structure of the network in a particular unit of time (typically, a second).
- It is a measure of capacity.
- It is also determined by real physical laws, such as the capacity of the medium down which data is being transported.
- Because this is almost never a constant amount, we consider the bandwidth of a connection to be whatever value is the lowest value over the entire connection.

## `Have a basic understanding of how lower level protocols operate`

**The Link/ Data Link Layer**

- The protocols operating at this layer are primarily concerned with the identification of the next network "node" to which data should be sent, and devices on the physical network and moving data over the physical network between the devices that comprise it, such as hosts (e.g. computers), switches, and routers.
- Ethernet governs communication between devices in a local network, and is responsible for navigating to the correct physical address, rather than logical one (this is left to IP). For this reason, it acts as an interface between the physical infrastructure below it and the more logical layers above.
- The **Ethernet Protocol** is a set of standards and protocols that enables communication between devices on a local network.
- It is the most commonly used protocol at this layer
- The Ethernet protocol provides two main functions:
    - **Framing**, which provides logical structure to the streams of bits traveling through the physical infrastructure/layer of the network by categorizing data into 'fields' that have specific lengths and orders.
        - **Ethernet Frames**: a Protocol Data Unit (PDU) that encapsulates data from the Internet/ Network layer above.
        - The Link/ Data Link layer is the lowest layer at which encapsulation takes place.
        - Adds logical structure to this binary data.  The data in the frame is still in the form of bits, but the structure defines which bits are actually the data payload, and which are metadata to be used in the process of transporting the frame.
        - The "fields" of a frame include:
            - **Source and Destination MAC address**: The source address is the physical address of the device which created the frame. The destination MAC address is the physical address of the device for which the data is ultimately intended.
             - **Data Payload**: Contains the data for the entire Protocol Data Unit (PDU) from the layer above, (commonly) an IP Packet for example.
    - **Addressing** which identifies the next network "node" to which data should be sent with the use of MAC addressing.
        - Ethernet uses **MAC addressing** to identify devices (rather than location) connected to the local network.  This is how Ethernet implements addressing
        - Since this address is linked to the specific physical device, and (usually) doesn't change, it is sometimes referred to as the **physical address** or **burned-in address**.
        - MAC Addresses are formatted as a sequence of six two-digit hexadecimal numbers, e.g. `00:40:96:9d:68:0a`, with different ranges of addresses being assigned to different network hardware manufacturers.
        - MAC addresses work well in LANs, where devices are connected to a central hub that recalls their specific MAC address.
        - They do not work well in large decentralized systems, nor are they scalable:
            - They are physical, not logical, i.e. they do not change based on location. Each MAC Address is tied (burned in) to a specific physical device
            - The are flat, and do not posses a hierarchical structure that allows us to categorize them into searchable subdivisions. The entire address is a single sequence of values and can't be broken down into sub-divisions.
- With Ethernet there's decapsulation and re-encapsulation at every point on the journey. So when a device such as a router receives a frame that has an IP packet as its payload, it decapsulates the packet, and re-encapsulates it it into a new frame for the next 'hop' on its journey.
            
**The Internet/ Network Layer**

- Whereas the Ethernet protocol provides communication between devices on the same local network, the Internet Protocol enables communication between two networked devices anywhere in the world.
- The primary function of protocols at this layer is to facilitate communication between hosts (e.g. computers) on different networks (i.e. inter-network communication).
- It comes between protocols at the Link/DataLink Layer and protocols at the Transport Layer
- The **Internet Protocol (IP)** is the predominant protocol used at this layer for inter-network communication.
- IP provides routing capability between devices on different networks via IP addresses.  It also encapsulates data into packets
- IP is end to end (i.e. it only cares about the two end points in the communication, such as the client and server, not particularly about how the packets are routed through the network). 
- A **Packet** is the Protocol Data Unit (PDU) within the IP Protocol
    - Just as with Ethernet Frames, the Data Payload of an IP Packet is the PDU from the layer above (generally a TCP segment or a UDP datagram from the Transport layer).
    - A packet consists of a header and a data payload
    - The IP packet is responsible for routing all the encapsulated data on its journey, which consists of a series of network "hops", or jumps between various nodes (routers) on the overall network.
    - The Header is split into logical fields which provide metadata used in transporting the packet.
    - The header fields include:
        - **Source Address**: the 32-bit IP address of the source (sender) of the packet. Allows for IP addressing.
        - **Destination Address**: the 32-bit IP address of the destination (intended recipient) of the packet. Allows for IP addressing.
- An **IP Addresss** is a unique address that we can use to identify a device or host on the internet.
    - IP addresses have two main features that allow for inter-network communication across a large distributed system:
        - They are logical: they are assigned as required when devices join a network
        - They are hierarchical: the structure of the address allows us to categorize them into searchable subdivisions (subnets). The overall network is divided into logical sub-networks and numbers are allocated according to this hierarchy.
        - A range of IP addresses is defined by network hierarchy, and each subnetwork is assigned a given range of addresses.
        - The network address is assigned to the first address in the range and the broadcast address is assigned to be the last address in that range.
        - There are two types of IP addresses in two different versions of IP:
            - IPv4 = 32-bit addresses provides 4.3 billion possible addresses, which is not enough for all the devices on the network
            - IPv6 = 128-bit addresses provide 340 undecillion addresses, hopefully will be enough for a long time to come
- MAC addresses, due to their nature (physical (*not logical*), flat (*not hierarchical*), are not scalable. IP addresses fill this gap. Because they are logical and hierarchical, they work well in large distributed systems.
- Unlike MAC Addresses, IP Addresses are logical in nature. This means that they are not tied to a specific device, but can be assigned as required to devices as they join a network.
- The IP address only gets us in communication with the intended device. It does not allow us to isolate any particular application of process running on that device. For that we need the Port numbers provided by the Transport Layer protocol.

## `Know what an IP address is and what a port number is`

**Ports**
- A port is an identifier for a specific process running on a host. 
- This identifier is an integer in the range 0-65535.
- Each specific process is assigned a single port, which can be used to identify that same process running on a different device.
- The source and destination port numbers are included in the Protocol Data Units (PDU) for the transport layer.
- Data from the application layer is encapsulated as the data payload in this PDU, and the source and destination port numbers within the PDU can be used to direct that data to specific processes on a host.
- The entire PDU is then encapsulated as the data payload in an IP packet.
- The IP addresses in the packet header can be used to direct data from one host to another. 
- The IP address and the port number together are what enables end-to-end communication between specific applications on different machines.

**Socket**
- An IP address and port number combined define a communication end-point known as a network socket.
- It is a communication end-point defined by an address-port pair.
- The IP address and the port number together allow the protocols operating in the Transport Layer to facilitate data exchange between specific applications running on separate devices across the network.
- These sockets allow both IP and the protocol operating at the Transport Layer (TCP/UDP) to transmit data between devices and processes.
- The IP address gets us the correct device on the network and the port number gets us to the correct application on that device.
- This is how we can achieve end-to-end communication between devices.<br><br>
- ***clarification for concept of sockets vs. implementation***

**IP Address**
- *Refer to previous question*

## `Have an understanding of how DNS works`

- DNS or the Domain Name System is a distributed database which translates/maps domain names like `www.google.com` to an IP address (like `123.456.123.456`), so that the IP address can then be used to make a request to the server.
- There is a very large world-wide network of hierarchically organized DNS servers, and no single DNS server contains the complete database. 
- If a DNS server does not contain a requested domain name, the DNS server routes the request to another DNS server up the hierarchy. 
- Eventually, the address will be found in the DNS database on a particular DNS server, and the corresponding IP address will be used to receive the request.
- Your typical interaction with the Internet starts with a web browser when you:
    1. Enter a URL like `http://www.google.com` into your web browser's address bar.
    2. The browser creates an HTTP request, which is packaged up and sent to your device's network interface.
    3. If your device already has a record of the IP address for the domain name in its DNS cache, it will use this cached address. If the IP address isn't cached, a DNS request will be made to the Domain Name System to obtain the IP address for the domain.
    4. If the DNS server that recieves the request does not have the correct domain name, it will route the request up the hierarchical system until it finds it.  
    5. The packaged-up HTTP request then goes over the Internet where it is directed to the server with the matching IP address. (DNS then hands that IP address to the lower level protocols that are responsible for routing the HTTP request to the proper location.)
    6. The remote server accepts the request and sends a response over the Internet back to your network interface which hands it to your browser.
    7. Finally, the browser displays the response in the form of a web page.

## `Understand the client-server model of web interactions, and the role of HTTP as a protocol within that model`

*(Come back to this)*

# <mark>TCP & UDP</mark>

## `Have a clear understanding of the TCP and UDP protocols, their similarities and differences`

**TCP**

- Transmission Control Protocol (TCP) is a **connection-oriented protocol** that ensures reliable data transfer between applications on top of the unreliable channel of the lower-layer protocols.
- It establishes end-to-end connections between processes in the Transport Layer.
    - A connection-oriented system: instantiates new socket object to establish a dedicated virtual connection channel between two processes running on separate devices.
    - Doesn't start sending application data until a connection has been established between application processes
    - You could have a socket object defined by the host IP and process port, just as in the connectionless system, also using a listen() method to wait for incoming messages
    - When new communication comes into the first listening socket, a new socket is created.   This new socket object wouldn't just be defined by the local IP and port number, but also by the IP and port of the process/host which sent the message. 
    - This socket listens specifically for messages that match its four-tuple, i.e. the IP and port of sender along with the IP and port of the receiver.
    - Implementing communication in this way effectively creates a dedicated virtual connection for communication between a specific process running on one host and a specific process running on another host. 
    - The advantage of having a dedicated connection like this is that it more easily allows you to put in place rules for managing the communication such as the order of messages, acknowledgements that messages had been received, retransmission of messages that weren't received, and so on.
- It provides **multiplexing** services
    - In the context of a communication network, multiplexing is the idea of transmitting multiple signals over a single channel, such as a single device communicating with the browser, the e-mail client, and streaming Spotify all through the same Network connection.
    - Multiplexing is enabled through the use of network ports (port numbers) alongside IP addresses
    - This is important because often there are multiple applications running on a single device, and yet IP addresses only provide a ***single channel***.
    - Each specific process is assigned a single port, which can be used to identify that same process running on a different device.
    - An IP address and port number combined define a communication end-point known as a network socket.
    - These sockets allow both IP and the protocol operating at the Transport Layer to transmit data between devices and processes.
- The purpose of these types of additional communication rules is to add more **reliability to the communication or Network reliability**.
    - **Network Reliability** ensures that a reliable communication channel is established between processes.
    - That is, that all transmitted data is received at communication end-point in the correct order.
    - Consists of 4 key elements:
        - **In-order delivery**: data is received in the order that it was sent
        - **Error detection**: corrupt data is identified using a checksum
        - **Handling data loss**: missing data is retransmitted based on acknowledgements and timeouts
        - **Handling duplication**: duplicate data is eliminated through the use of sequence numbers
    - Network reliability is implemented by TCP in the Transport Layer.
    - Lower level protocols (Ethernet and the Internet Protocols) are inherently unreliable; they include checksum data as part of their header or trailer so that the data transported as frames and packets can be tested to ensure it hasn't become corrupt during its journey. 
    - If the data is corrupt however, these protocols simply discard it (dropping the frame or packet); there is no provision within these protocols for enabling the replacement of lost data. The possibility of losing data and it not being replaced means that the network up to and including the Internet Protocol is effectively an unreliable communication channel.
- **Segments** are the Protocol Data Unit (PDU) of TCP. Like the PDUs of protocols we've looked at for other network layers, it uses a combination of headers and payload to provide encapsulation of data from the layer above.
    - Data from the application layer is encapsulated as the data payload in this PDU, and the source and destination port numbers within the PDU can be used to direct that data to specific processes on a host. 
    - The Source and Destination port numbers are fields in the segment header, while data such as an HTTP request is part of the payload.
    - It provides five main services:
        - **Multiplexing** through source and destination port numbers
        - **Error detection** through a checksum
        - **In-order deliver, handling data loss, and handling data duplication (data reliability)** through sequence and acknowledgment numbers
        - **Flow control** through window size data
        - **Congestion avoidance** through dynamic adjustment of flow according to data loss
- The **main downsides of TCP** are the latency overhead of establishing a connection, and the potential Head-of-line blocking as a result of in-order delivery.
    - TCP provides reliability at the cost of speed (that is, its reliability functions can contribute greatly to latency)
    -  The added overhead due to the need of establishing a connection with the three-way handshake, which can add up to two round trip times.
    - **Head-of-Line (HOL) blocking** relates to how issues in delivering or processing one message in a sequence of messages can delay or 'block' the delivery or processing of the subsequent messages in the sequence.
        - HOL blocking can occur as a result of the fact that TCP provides for in-order delivery of segments. If one of the segments goes missing and needs to be retransmitted, the segments that come after it in the sequence can't be processed, and need to be buffered until the retransmission has occurred.
        - This can lead to increased queuing delay which is one of the elements of latency.
    
**UDP**

- User Datagram Protocol (UDP) is a very simple protocol compared to TCP. It provides multiplexing (through source and destination port numbers) and ***optional*** error detection (through checksum), but no reliability, no in-order delivery, and no congestion or flow control.
- It establishes end-to-end connections between processes in the Transport Layer.
- UDP is **connectionless**, and so doesn't need to establish a connection before it starts sending data\
    - A connectionless system relies on a single socket for all communication, does not establish dedicated communication channels, and responds to all communications individually as they arrive.
    - One socket object defined by the IP address of the host machine and the port assigned to a particular process running on that machine.
    - That object could call a `listen()` method which would allow it to wait for incoming messages directed to that particular IP/port pair.
    - It would simply process any incoming messages as they arrived and send any responses as necessary.
    - It does not matter from what process transmissions come, a single socket listens to all messages regardless and responds to each as it arrives.
    - This is useful because it is a) a simpler and more flexible process than a connection-oriented system and b) it reduces latency overhead because a connection does not have to be established.
- Specifically, UDP provides speed because it doesn't take the time to establish a dedicated connection, its lack of in-order delivery means no latency due to Head-of-Line blocking, and the one way data flow of a connectionless system cuts down on latency due to extra round trips (there are no acknowledgments), and since it is a connectionless protocol, it provides no connection state tracking,
- Furthermore, UDP acts as a "base template" that programmers can build upon. The specifics of what type of reliability functions to include are left up to the developer to implement at the Application level.
- UDP does not provide any of the reliability of TCP. It is just as inherently unreliable as the layers below it.
- With UDP there is no guarantee of message delivery, delivery order, congestion avoidance, flow control, or state tracking.
- For example, video calling applications and online games that prioritize speed and low latency/lag over the potential for small amounts of lost data, can utilize UDP.

## `Have a broad understanding of the three-way handshake and its purpose`

- The three-way handshake is what TCP uses to establish a dedicated and reliable connections between processes over the network.
- First the sender sends a SYN segment, which ostensibly asks if the receiver is ready to receive.
- Upon receipt of the SYN segment, the receiver sends back a SYN ACK segment, indicating that it received the previous message and ensuring its messages are also being received.
- Finally, upon receiving the SYN ACK, the original sender sends an ACK segment, indicating it is also receiving messages from the receiver, and the connection can be (and subsequently is) established.
- This not only ensures a reliable connection between both devices, but synchronizes sequence numbers that will be used during the connection.
- It is this aspect of TCP that enables network reliability, that is, handling data loss through message acknowledgement, and ensuring in order delivery and de-duplication via the synchronized segment numbers.
- A key characteristic of the process is that the sender cannot send any application data until after it has sent the ACK Segment.
- What this means in practical terms, is that there is an entire round-trip of latency before any application data can be exchanged. Since this hand-shake process occurs every time a TCP connection is made, this clearly has an impact on any application which uses TCP at the transport layer.
- This can contribute to the overall latency of the trip, due to its complexity.

## `Have a broad understanding of flow control and congestion avoidance`

**Flow Control**

- Flow congestion is a mechanism to prevent the sender from overwhelming the receiver with too much data at once.
- Provided by TCP, flow control helps to ensure that data is transmitted as efficiently as possible.
- This, in turn, helps to mitigate the increased latency inherent in TCP connections.
- It is implemented via the window field of the TCP segment header.
    - The window header field contains data sent by the receiver letting the sender know the maximum amount of data it can accept at any given time.  Each side of a connection can let the other side know the amount of data that it is willing to accept
    - This number is dynamically generated, and therefore the receiver can lower the amount if the buffer is getting full, and the sender will respond accordingly.
    - Data awaiting processing is stored in a **'buffer'**. The buffer size will depend on the amount of memory allocated according to the configuration of the OS and the physical resources available.

**Congestion Avoidance**

- Congestion avoidance is a service provided by TCP that attempts to prevent network congestion, a situation in which more data is being transmitted than there is capacity.
- To implement this, TCP uses data loss as a feedback mechanism to determine how "congested" the network is, by tracking how many retransmissions are required.
- A lot of data loss, or a lot of retransmissions, indicates there is more data on the network than there is capacity to process that data.
- TCP will take this as a sign to reduce the size of the transmission window, that is, it will send less data along the given channel.
- This is done to make data transmission as efficient as possible to mitigate the latency overhead inherent in TCP connections.

# <mark>URLs</mark>

## `Be able to identify the components of a URL, including query strings`

- A URL or (Universal Resource Locator) is a consistently formatted string that allows us to locate a certain resource on the web.
- It provides us with a systematic means of locating resources that we are requesting (via an HTTP request).
- A **URI or Uniform Resource Identifier** is an identifier for a particular resource within an information space.
- URL refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location").
- A URL, ***unlike*** a URI, must include some piece of data that allows us to locate the resource in question, while a URI does not have this requirement.<br><br>
- URL components include the:
    - **scheme**: tells the web client how/which protocol to use to access the resource.
        - The first part of the URL
        - A scheme is different from a protocol, although these terms are sometimes used interchangeably
        - The scheme identifies which protocol should be used to access the resource, but not the specific version
        - Schemes and protocols can be differentiated by their case; the convention is to refer to scheme names in lowercase, e.g. http, and protocol names in uppercase, e.g. HTTP.
        - It is a mandatory component of the URL
    - **host (or hostname)**: It tells the client where the resource is hosted or located.
        - This is written in the format of a domain name.
        - DNS takes this human readable domain and finds the equivalent IP so the request can be routed.
        - It is a mandatory component of the URL
    - **port**: an identifier for the specific process to which the communication should be routed.
        - It is only required if you want to use a port other than the default.
        - The default port is 80 for HTTP and 443 for HTTPS.
    - **path**: It shows what local resource is being requested from the host.
        - This part of the URL is optional.
        - If the resource in question is a home page, the path might consist of a single forward slash (`/`).
        - Historically, the path has indicated specifically where the resource was located on the server, but with the proliferation of dynamically generated content, this no longer always follows the absolute file path of the server.
    - **query string/parameters**: passes additional information in the form of specially formatted query parameters to the server."
        - made up of query parameters. It is used to send data to the server. This part of the URL is also optional.
        - Query strings are used to pass additional data to the server during an HTTP Request. They take the form of name/value pairs separated by an `=` sign. Multiple name/value pairs are separated by an `&` sign. The start of the query string is indicated by a `?`.
        - Because query strings are passed in through the URL, they are only used in HTTP GET requests.
        - Query strings are limited in use in that they have a maximum length, are not suitable for sensitive information as they are plainly visible in the URL, and `space` and special characters like `&` cannot be used with query strings. They must be URL encoded<br><br>

## `Be able to construct a valid URL`

**Examples**

`https://amazon.com/Double-Stainless-Commercial-Refrigerator/B60HON32ie=UTF8&qid=142952676&sr=93&keywords=commercial+fridge`

- **host**: `amazon.com`
- **names of the query parameters**:`ie, qid, sr, keywords`
- **values of the query parameters**: `UTF8, 142952676, 93, commercial+fridge`
- **scheme**: `https`
- **path**: `/Double-Stainless-Commercial-Refrigerator/B60HON32`
- **port**: This URL does not contain a port. Most software will use **port 443** by default when working with this URL due to it having a scheme of https, but that information is not contained within the URL.

`http://localhost:4567/todos/15`

- **host**: `localhost`
- **query parameters**:None
- **scheme**: `http`
- **path**: `/todos/15`
- **port**: `4567`

`https://launchschool.com/staff/assessments/completed?course=RB109&verdict=passed`

- **host**: `launchschool.com`
- **names of the query parameters**:`course, verdict`
- **values of the query parameters**: `RB109, passed`
- **scheme**: `https`
- **path**: `/staff/assessments/completed`
- **port**: This URL does not contain a port. Most software will use **port 443** by default when working with this URL due to it having a scheme of https, but that information is not contained within the URL.

## `Have an understanding of what URL encoding is and when it might be used`

- URL encoding is a special technique that replaces characters that aren't allowed in a URL with an ASCII code.
- URLs are designed to accept only certain characters in the standard 128-character ASCII character set.
- URL encoding is used if 
    - a character has no corresponding character in the original ASCII set
    - is unsafe because it can be misinterpreted or modified by some systems (i.e. `%`, spaces, quotation marks, the `#` character, `<` and `>`, `{` and `}`, `and`, and `~`)
    - or the character is reserved for special use within the url. (such a `?` which indicates the beginning of the query string or `&` which separates query parameters.  Also `/`, `:`, `@`)
- URL encoding serves the purpose of replacing these non-conforming characters with a % symbol followed by two hexadecimal digits that represent the equivalent UTF-8 character.
- Only alphanumeric and special characters `$-_.+!'()"`, and reserved characters when used for their reserved purposes can be used unencoded within a URL.  
- As long as a character is not being used for its reserved purpose, it has to be encoded.
- We need a safe way to represent these characters in a URL because using them literally can "break" the URL, in that it will no longer be able to locate the resource in question.

**Examples**

- Space -> `%20`
- `$` -> `%24`
- `£` -> `%C2%A3`
- `€` -> `%E2%82%AC`
- `𐍈` -> `%F0%90%8D%88`

# <mark>HTTP and the Request/Response Cycle</mark>

## `Be able to explain what HTTP requests and responses are, and identify the components of each`

**HTTP**

- HTTP or Hypertext Transfer Protocol is a system of rules, a protocol, that serve as a link between applications and the transfer of hypertext documents.
- HTTP operates at the application layer and is concerned with structuring the messages that are exchanged between applications.
- It determines how requests for resources on the web are made, as well as how those requests should be responded to.
- It provides uniformity to the way resources are transferred. In other words, it is an agreed-upon format on how to communicate.
- HTTP is based on the client-server paradigm, in which a client (usually some kind of browser) makes a request through the network for a particular web resource stored on a server.
- The server, then, sends a response to this request that ideally contains the requested resource, or if not, some kind of messaging that explained what happened.
    - The server's response provides the client with the requested resource, informs the client that the action requested has been carried out, or else informs the client that an error occured in the process.
- HTTP governs the syntax of these messages, which together consist of the request/response cycle.
- HTTP is a text-based protocol. All requests and responses are made in plain text, which makes it inherently insecure.

**Requests**

- An HTTP request is a text-based message sent from the client to the server with the aim of accessing a resource on the server.
- Entering something into the browser address bar, clicking a link, submitting a form, or any number of other "user interaction" with a resources on the web can instigate the sending on an HTTP request.
- It consists of a **request line**, **headers**, and an **optional body**.
- The HTTP request line contains the **method**, **path**, and **version**
    - The **method** indicates what kind of action the request is performing (for example, `GET` or `POST`).  This is required.
    - The **path** indicates where to find the particular resource locally within the server.  This is required.
        - Technically speaking the 'path' portion of the request-line is known as the 'request-URI', and incorporates the actual path to the resource and the optional parameters if present. In practice, most people simply refer to this part of the request-line as the 'path'.
    - The **version** tells us which version of HTTP is being used (i.e. 1.0, 1.1, 2).  As of HTTP 1.0, the HTTP version also forms part of the request-line.
    - The **parameters** are optional.
- **HTTP Headers**: allow the client and the server to send additional information during the HTTP request/response cycle.
    - a way to give more information about both the client and the resource that is being requested
    - Headers are colon-separated name-value pairs that are sent in plain text.
    - The **host** header:
        - has been required since HTTP 1.1
        - indicates where the resource in question is located as a server may contain many hosts
    - All other headers are optional
    - Other headers might include:
        - `Accept-Language` fields about what languages are accepted by the client
        - `User-Agent` specially formatted string that identifies the client such as a `session id`
        - `Cookie` information about cookies that help applications maintain the appearance of state
        - `Connection` what type of connection the client prefers (such as `keep-alive`)
- **HTTP body** - the body contains the data that is being transmitted in an HTTP message and is optional. 
    - In other words, an HTTP message can be sent with an empty body. When used, the body can contain HTML, images, audio and so on.
    - What this looks like depends on the type of request methond sent (i.e. method)
    - The body is mainly used with a `POST` request, which is used to send data to the server

**Responses**

- HTTP Response are text-based messages sent from the server to the client with the aim of responding to the client's request.
- Raw data returned by the server is called a response.
- They either:
    - Provide the client with the resource required
    - Inform the client that the action it requested was carried out
    - Inform the client that an error occurred in the process
- consists of a **status line**, **optional headers**, and an **optional body**.
- The **status line** contains the **status code**, **status text**, and **version**
    - The status code is a three digit number indicating the specific status of the response, i.e. whether or not it was successful
    - It is accompanied by the status text that tells the status of the response
- HTTP response **headers** contain additional information about the response.  These are optional.
    - `Content-Encoding` information about the type of encoding used on the data
    - `Server` the name of the server
    - `Location` a new resource location if applicable (Location header), which helps the client redirect to the requested resource if it has been moved
    - `Content-Type` the content-type (i.e. text/html), which helps the client correctly render the data in a user friendly way
- The HTTP response **body** consists of the raw data for the requested resource.  This is optional.
    - This might be the HTML of the webpage, or the raw data of any files being requested, such as images, videos, or audio files

## `Be able to describe the HTTP request/response cycle`

- The HTTP request/response cycle is the interaction between client (browser) and server (software running on the server) in which the client makes a request and the server makes a response.
- It begins with the client making an HTTP request.
    - For our purposes, this is typically issued by a browser in response to some kind of user action or event (i.e. typing a url into an address bar, clicking a link, submitting a form, etc).
    - Two required pieces of data in a HTTP request:
        - The method (`GET`, `POST` ...)
        - The path (`/task`)
        - Parameters (optional)
        - Headers (optional)
            - **The `Host` header is required for HTTP / 1.1**
            - HTTP 1.1 requires that the client sends the host name in a `Host` request header to the server. This header is typically set by the client automatically, based on the URI requested by the user.
        - Body (optional)
    - **The domain name is just used to determine what server to send the request to, but it's not a part of the request itself.  Once a connection has been established between a host and server, the domain name is not really used again.**
    - The request is sent off to the server by means of the lower layer network protocols.
- When the server receives the request, it will analyze it.
    - This may include actions like verifying the user's session, loading any necessary data from a database or rendering HTML
- Once the server has analyzed the request, it will send a response to the client.  This includes:
    - Status: A numeric code and a short string of text (ex. 200 OK). Used to signify if the request was successful or not.
    - Headers: A collection of metadata about the contents of the response. (ex. `Content-Type: text/html`).
        - Helps the client process the response
        - This value tells the browser that once it receives the response it can then be displayed in that format (ex. as if it were a webpage).
    - Body: the bulk of the actual raw data being sent.
        - In the case of a web page, the body will contain all the HTML code that the browser will use to display the result to the user.
- When the browser receives the response, it will process the information within and render the resource in a user-friendly manner.

## `Be able to explain what status codes are, and provide examples of different status code types`

- **Status Codes**: three-digit numbers that are part of the status line in a HTTP Response. They indicate the status of the request. There are various categories of status code:
    - **200 OK**: the request was successfully handled, and the resource has been transmitted.  All 200 level response codes indicate success
    - **302 Found**: When your browser sees a response status code of 302, it knows that the resource has been moved, and will automatically follow the new re-routed URL in the `Location` response header.
        - All 300 level status codes indicate some kind of redirect status
        - When the browser receives the 302 response, it will automatically issue an HTTP request to the updated URL provided in the `Location` header.
        - This, ideally, will result in the HTTP 200 OK response so that the browser can render the resource for the user.
    - **404 Not Found**: The server returns this status code when the requested resource cannot be found due to a client error with the request
        - All 400 level status codes indicate various client errors
    - **500 Internal Server Error**: A 500 status code says "there's something wrong on the server side".
        - Indicates a generic server-side error took place while trying to retrieve the requested resource.
        - All 500 level status codes indicate server side errors.

## `Understand what is meant by 'state' in the context of the web, and be able to explain some techniques that are used to simulate state`

**State**

- A "stateful" web application is one that maintains knowledge of past interactions
- This might include keeping track of individual user accounts and maintain a "logged in" status accross multiple resource requests and refreshes.
- When your e-mail client identifies you by name and displays some kind of customized greeting, this is also an aspect of "state"
- When you go to Facebook, for example, and log in, you expect to see the internal Facebook page. That was one complete request/response cycle. You then click on the picture -- another request/response cycle -- but you do not expect to be logged out after that action.
- Statefulness can be simulated through techniques which use **session IDs**, **cookies**, and **AJAX**.
    - There's also a 4th approach: sending stateful data as query parameters when making a request. This approach used to be nearly universal, but is mostly gone from all modern web sites.

**Stateless**

- HTTP is a stateless protocol. This means that each Request/ Response cycle is independent of Request and Responses that came before or those that come after.
- No information is kept on the server between request/response cycles
- Stateless protocols are resilient, fast, and flexible as the server doesn't have to retain any information between each request/response cycle nor does any part of the system have to perform any clean up.
- However, because of the statelessness of HTTP, it can be very difficult to simulate a stateful experience and make it seem like a persistent connection exists as many modern web apps do.
- This statelessness is what makes HTTP and the internet so distributed and difficult to control, but it's also the same ephemeral attribute that makes it difficult for web developers to build stateful web applications.

**Sessions**

- One way to maintain a sense of statefulness is to have the server send some form of a unique token to the client. 
- Whenever a client makes a request to that server, the client appends this token as part of the request, allowing the server to identify clients.
- we call this unique token that gets passed back and forth the **session identifier**.
- This mechanism of passing a `session id` back and forth between the client and server creates a sense of persistent connection between requests.
- This sort of faux statefulness has several consequences:
    - This can be difficult to maintain because each HTTP request must be analyzed for a `session id`.
    - Furthermore, each `session id` must be validated and the server must establish procedures for invalid ids;  the server needs to retrieve the session data based on the session id
    - If a session id is valid, the server needs to store and retrieve data associated with each `session id`, as well as recreate the application state from that data when sending back a response

**Cookies or HTTP cookies**

- Cookies are a way for the browser to store data sent from the server that helps maintain the appearance of persistent application state. They work in conjunction with `session id`s.
- A piece of data that's sent from the server and stored in the client (browser) during a request/response cycle.  It contains information about the `session id`
- Small files stored in the browser and contain the session information.
- These files are stored even if the browser is closed or shut down, which enables a longer and more consistent appearance of state.
- Session data is generated and stored on the server-side and the session id is sent to the client in the form of a cookie.
- The information stored in cookies is sent with each request to the server, then used to "unlock" the correct stored session data
- The client side cookie is compared with the server-side session data on each request to identify the current session.
- This allows the server to recreate the correct state of the application, and the session id to be recognized each time a website is visited, even if some time has passed.  When you visit the same website again, your session will be recognized because of the stored cookie with its associated information.
- The session id is stored on the client, and it is used as a "key" to the session data stored server side.<br><br>
- In fact, this is what many web applications with authentication systems do. When a user's username and password match, the session id is stored on their browser so that on the next request they won't have to re-authenticate.

**AJAX or Asynchronous JavaScript and XML**

- Its main feature is that it allows browsers to issue requests and process responses without a full page refresh.
- When AJAX is used, all requests sent from the client are performed asynchronously, which just means that the page doesn't refresh.
- Modern web pages tend to be fairly complex, including dynamically generated content as well as many resource dependencies.
- Therefore, it behooves us to have a means of responding to both server data and user actions without having to refresh and reload the whole page.
- AJAX enabled this functionality, allowing the client to send and retrieve information in small pieces that can be used to update the state of an application without refreshing/reloading, making it much easier to maintain state.
- AJAX requests are sent like normal HTTP requests, and the server responds to them with a normal HTTP response.
- Instead of the browser refreshing to process the HTTP response, it will process the response with a **callback** function (which is usually some client-side JavaScript code), which can update the state of the web app.

## `Explain the difference between GET and POST, and know when to choose each`

**GET Requests**

- **GET requests**: Used to retrieve a resource from the server
- initiated by clicking a link or via the address bar of a browser.
- The response from a GET request can be anything, but if it's HTML and that HTML references other resources, your browser will automatically request those referenced resources. A pure HTTP tool will not.

**POST requests**

- **POST request**: Used when you want to initiate some action on the server (server side action), or send data to a server.
- Typically from within a browser, you use POST when submitting a form or other information (such as user authentication or form submission)
- Without `POST` requests, we are limited to sending data to the server via query strings
- Using a `POST` request in a form fixes the  problem of exposing credentials in the URL query string.  With a `POST` request, we can send more sensitive data such as a username or password
- `POST` requests also help sidestep the query string size limitation that you have with GET requests. With POST requests, we can send significantly larger forms of information (such as images or videos) to the server.
- Search forms are a noticeable exception to this rule: they often use `GET` since they are not changing any data on the server, only viewing it.

# <mark>Security</mark>

## `Have an understanding of various security risks that can affect HTTP, and be able to outline measures that can be used to mitigate against these risks`

**HTTP risks**

- HTTP is a text based protocol, and all it's requests and responses consist of plain text.  As such, HTTP is inherently insecure. 
- As the client and server send requests and responses to each other, all information in both requests and responses are being sent as strings. If a malicious hacker was attached to the same network, they could employ ***packet sniffing*** techniques to read the messages being sent back and forth. 
- As we learned previously, requests can contain the session id, which uniquely identifies you to the server, so if someone else copied this session id, they could craft a request to the server and pose as your client, and thereby automatically being logged in without even having access to your username or password.

**HTTPS**

- A resource that's accessed by HTTPS will start with `https://` instead of `http://`, and usually be displayed with a lock icon in most browsers:
- With HTTPS every request/response is encrypted before being transported on the network. This means if a malicious hacker sniffed out the HTTP traffic, the information would be encrypted and useless.
- HTTPS sends messages through a cryptographic protocol called **TLS** for encryption.
    - These cryptographic protocols use certificates to communicate with remote servers and exchange security keys before data encryption happens. 
    
**Same-origin policy**

- permits unrestricted interaction between resources originating from the same origin, but restricts certain interactions between resources originating from different origins.
- By **origin**, we mean the combination of the **scheme**, **host**, and **port**.  Only those resources that share all three aspects are allows to issue requests unrestrictedly.
    - So `http://mysite.com/doc1`:
        - has the same origin as `http://mysite.com/doc2`
        - but a different origin from `https://mysite.com/doc1` (different scheme)
        - `http://mysite.com:4000/doc1` (different port), 
        - and `http://anothersite.com/doc1` (different host).
- This prevents attackers from using session hijacking to access `session id`s or other session information.
- Designing for the same-origin policy can help to mitigate the lack of security in HTTP by restricting interactions between resources.
- The same-origin policy is an important guard against **session hijacking** attacks and serves as a cornerstone of web application security.
    - **Session Hijacking** refers to a malignant action in which a hacker utilizes a stolen `session id` to authenticate themselves and share sessions
    - When an attacker gets a hold of the session id, both the attacker and the user now share the same session and both can access the web application.
    - Because a `session id` is used to identify a user to the server, it can also be used by hackers to pose as the user and get logged in without needing to authenticate with a username and password.
    - Countermeasures for Session Hijacking include:
        - **Resetting sessions**. With authentication systems, this means a successful login must render an old `session id` invalid and create a new one.
        - **Setting an expiration time on sessions** gives attackers a narrower window for access to the `session id`.
        - **Use HTTPS across the entire app** to minimize the chance that an attacker can get to the `session id`

**Cross-Site Scripting (XSS)**

- This type of attack happens when you allow users to input HTML or JavaScript that ends up being displayed by the site directly.
- Websites that allow some kind of input, such as allowing users to enter a comment that will be displayed, must protect against cross site scripting or XSS.
- Because it's just a normal HTML `<textarea>`, users are free to input anything into the form. This means users can add raw HTML and JavaScript into the text area and submit it to the server as well
- If the server side code doesn't do any sanitization of input, the user input will be injected into the page contents, and the browser will interpret the HTML and JavaScript and execute it.
- Potential solutions for cross-site scripting include:
    - making sure to always **sanitize user input**. This is done by eliminating problematic input, such as `<script>` tags, or by disallowing HTML and JavaScript input altogether.
    - **Escape all user input** data when displaying it so that the browser does not interpret it as code.  (To escape a character means to replace an HTML character with a combination of ASCII characters, which tells the client to display that character as is, and to not process it)
    - Site's can also choose to only **accept a safer form of input**, such as Markdown.

## `Be aware of the different services that TLS can provide, and have a broad understanding of each of those services`

**TLS**

- Because HTTP is a text based protocol, it is inherently insecure.
- Any intercepted requests/responses are easy to read.
- Furthermore, HTTP is a fairly simple protocol, concerned only with basic message structure.
- It provides no check for whether or not the source of an HTTP response is trustworthy, nor does it provide a means of determining if the messages are being tampered with in transit.
- When thinking about TLS it can be useful to think of it as operating between HTTP and TCP.
- TLS adds security to HTTP communications.
- Purpose of TLS:
    - TLS enables us to provide encryption to the inherently insecure plain text of the HTTP protocol.  Encrpytion is a process of encoding a message so that it can only be read by those with an authorized means of decoding the message
    - It provides authentication services,  a process to verify the identity of a particular party in the message exchange / checking to see if the source of an HTTP response is trustworthy.
    - It also provides a means of ensuring data integrity, that is, determining whether or not HTTP messages have been tampered with /  detect whether a message has been interfered with or faked
    
**TLS Encryption**

- Allows us to encode messages so that they can only be read by those with an authorized means of decoding the message
- TLS encryption uses a combination of Symmetric Key Encryption and Asymmetric Key Encryption. Encryption of the initial key exchange is performed asymmetrically, and subsequent communications are symmetrically encrypted.
- This secure channel is established with the TLS handshake, which uses both symmetric and asymmetric key encryption
- It encrypts, which increases security, but adds several round-trips of latency which impacts performance
    - **Symmetric Key Encryption**: an encrypted communication system in which both the sender and receiver posses a shared encryption key.
        - The advantages to this are that it facilitates two-way communication. Both parties can use the shared key to encode, send, and decode messages to and from the other.
        - The disadvantage is that a symmetric system relies on the fact that no one else has access to the key in order for it to remain secure.
        - This means that it requires a secure way for both parties to exchange keys before symmetric encryption can be established, and this is difficult to do on the web (can't exchange keys in-person when communicating over a network).
        - For this reason, it is used in conjunction with asymmetric key encryption, which facilitates a secure exchange of a shared key
    - **Asymmetric Key Encryption**: also known as public key encryption, an encrypted communications system which uses two distinct keys: a public key and a private key.
        - The public key is used to encrypt and send a secure message to the recipient, who holds the private key, which is used to decode the encrypted message.
        - This only facilitates one way communication, in which only the party who holds the private key can receive and decode secure communications.
        - Encryption is primarily intended to work in one direction.  Bob can send Alice messages encrypted with the public key which she can then decrypt with the private one, but they can't be used in the other direction.
        - However, because it works only one way, we can use asymmetric key encryption as a means for hosts to exchange symmetric encryption keys during the TLS handshake process.
        - Unlike the symmetric system where the same key is used to encrypt and decrypt messages, in the asymmetric system the keys in the pair are non-identical: the public key is used to encrypt and the private key to decrypt.<br><br>
    - **The TLS Handshake**:  a special process that takes place after the TCP Handshake in which the client and the server exchange encryption keys.  This is how TLS sets up an encrypted connection.
    - This exchange allows both parties to communicate via encrypted messages, thus giving a security advantage over the inherently insecure messages of HTTP.
    - TLS uses a combination of symmetric and asymmetric cryptography.
        - The bulk of the message exchange is conducted via symmetric key encryption but the initial symmetric key exchange is conducted using asymmetric key encryption.
        - Asymmetric key encryption is a mechanism used to encrypt the encryption key itself, so that even if it is intercepted it can't be used.
    - The key points to remember about the TLS Handshake process is that it is used to:
        - Agree which version of TLS to be used in establishing a secure connection.
        - Agree on the various algorithms that will be included in the cipher suite.
        - Enable the exchange of symmetric keys that will be used for message encryption.
    - The TLS Handshake must be performed before secure data exchange can begin; it involves several round-trips of latency and therefore has an impact on performance.<br><br>
    - How the TLS Handshake is implemented:
        - The client sends a message to the server in the form of a `ClientHello`, which includes the maximum version of TLS protocol it supports and a list of available cipher suites.
        - The server responds with a `ServerHello`, which contains a decision regarding which TLS version and cipher suite will be used. It also includes the server's certificate and public key. This ends with a `ServerHelloDone` marker.
        - Next the client initiates the symmetric key exchange process, using the server's public key for asymmetric key encryption.
        - Once the keys have been exchanged, the server sends a ready-to-go message using the symmetric key and secure message exchange commences.<br><br>
    - Trade offs:
        - Allows us to implement secure message exchange over the inherently insecure text based protocol of HTTP
        - Because the TLS handshake is a complex process, it can add two round-trips of latency, this has an impact on speed and performance.
        
**TLS authentication**  

- Provides a means of verifying the identity of a participant(s) in a message exchange,  to ensure that party is trustworthy.
- This ensures that the source of an HTTP response is trustworthy, and so the provided resource can be safely processed.
- TLS Authentication is implemented through the use of Digital Certificates, which are signed by a chain of Certificate Authorities.  
- Digital certificates are provided by the server during the TLS handshake.
- The certificate includes a public key, a signature (which consists of data encrypted with the private key), and the original data that was used to create the signature.
- Upon receipt, the receiver decrypts the signature with the public key and checks that it matches against the original data, which tells it that the sender is who it says it is (because it holds the private key).
- The digital certificate the server provides is considered to be trustworthy on the basis of the issuing certificate authority and the chain of trust.
- Certificates are signed by a **Certificate Authority**, and work on the basis of a Chain of Trust which leads to one of a small group of highly trusted Root CAs.
    - Certificate Authorities are trustworthy sources that issue certificates used by servers to establish authentication.
    - We use certificates provided by these authorities to ensure that the certificate in question is not being faked.
    - Certificate authorities exist in a hierarchy known as the "chain of trust"
    - Within this hierarchy, the certificate for lower level authorities is signed by the CA one level above it
    - At the top of the chain there exists a Root CA whose certificate is "self-signed"
    - These consist of a small group of organizations who have proved their high level trustworthiness through prominence and longevity.
    
**TLS Integrity** 

- provides a means of checking whether a message has been altered or interfered with in transit.
- Data that is being exchanged via HTTP is encapsulated within the TLS payload
- Metadata fields such as the Message Authentication Code (MAC) allows us the check to see if the message has been interfered with.
- This is slightly different than a regular checksum, which is only concerned with error detection.
- The sender creates a digest of the data payload with a hashing algorithm (pre-agreed upon in the TLS handshake)
- This data is then encrypted with the symmetric key and sent to the receiver
- The receiver decrypts the data, creates a digest with the same pre-agreed upon hashing algorithm, and checks to see if the two match.