# Denison CS181/DA210 SW Lab #13 - Step 1

Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

#### Setup

Note that for these exercises, we'll use `mysocket.py`, provided with the book, and available to you now in `modules/` in this repository.

In [None]:
import os
import os.path
import sys
import importlib

if os.path.isdir(os.path.join("../../..", "modules")):
    module_dir = os.path.join("../../..", "modules")
else:
    module_dir = os.path.join("../..", "modules")

module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

import mysocket as sock
importlib.reload(sock)

---

## Part A: Identifying Resources with URLs and URIs

Uniform Resource Identifiers (URIs) and Uniform Resource Locators (URLs) define a standard notation for specifying the files, data, and resources of the internet.  Note that URI is the broader term, so all URLs are URIs.

Using an explicit protocol scheme, host location, and resource path, URLs can be used to uniquely identify a resource at a specific location on the internet.  These components are summarized in the following table:

Item | Description
:----|:--------------
_protocol_ | The network stack layer above TCP; we'll use `http` and `https`
_location_ | The server/host machine within the internet
_port_     | The program used for connections; we usually use port 80 for `HTTP` web server programs and port 143 for HTTPS web server programs
_resource-path_ | Identifies a particular resource within the host/port endpoint; could also include a query string

The _resource-path_ given above is a resource relative to the _location_.  This starts with a `'/'` to indicate the root, and then is specified like we have seen for trees in a file system.

The _resource-path_ includes the _endpoint-path_ and an optional _query-string_, which is a &-separated list of name-value pairs:

_query-string_ |= ?_name_=_value_[&_name_=_value_]+

Note that no part of a URL may contain spaces or other special characters, like `:`, `/`, `=`, or `&`, due to their special meanings.

The general form of a URL is given by the following (shown with extra spaces for readability):

_protocol_ : // _location_ [ : _port_ ] _resource-path_

**Q1:** Type the following URL in a web browser: http://datasystems.denison.edu:80/topnames.html.  What are the _protocol_, _location_, _port_, and _resource-path_ for this URL?

YOUR ANSWER HERE

**Q2:** Now, use a search engine to search for "Denison University".  What are the _protocol_, _location_, _port_, and _resource-path_ for the resulting URL? 

YOUR ANSWER HERE

> You've reached the first checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 1: Web browsers provide some shortcuts.  Which parts of the URL can we _not_ specify?  What are the defaults in that case?  (Hint: try leaving out parts of the URL and see if you still get to the same page.)

---

## Part B: HTTP Definition

Web browsers are simply programs that request data (often HTML of web pages) from web servers, and display them to the user.  HTTP exists to enable these requests.

As discussed in class, HTTP is an application protocol, and is therefore built on TCP and the sockets interface.

1. The web server program is in an ``always ready'' state, waiting with an unresolved TCP socket endpoint, listening for requests for port 80 (for HTTP).
2. A client (e.g., your web browser or this notebook) makes a TCP connection to the server endpoint, and a bidrectional communication is initiated.
3. The client constructs an _HTTP request_, described below.
4. The request is sent over the TCP socket connection to the server.
5. The server receives the request and processes it, constructing an _HTTP response_.
6. The response is sent over the TCP socket connection back to the client.
7. The client receives the response and processes it.
8. Both the client and server close the TCP socket connection.

Note that steps 3-7 can happen just once or many times, depending on the HTTP request parameters.

#### HTTP message format

HTTP messages, both requests and responses, have the following syntax:

_message_ |= _start-line_ \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ _header-line_ ]* \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;_empty-line_ \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ _body_ ]

The _start-line_ is either a _request-line_ for a request message or a _status-line_ for a response message:

_start-line_ |= _request-line_ | _status-line_.

Note that for both, the optional body is separated from any headers by an empty line.  Each of these pieces is separated by a combination of a carriage return (`'\r'`) followed by a newline character (`'\n'`): `'\r\n'`.

----

#### HTTP requests

For an HTTP request, the _request-line_ specifies the method, URI, and version.

The method can be any of `GET`, `POST`, `HEAD`, `PUT`, `DELETE`.  For now, we'll focus on `GET` requests, which retrieve data, and do not include a body in the HTTP request message.

The version is either `'HTTP/1.0'` or `'HTTP/1.1'`.

An example _request-line_ is given in the following cell:

In [None]:
# method: GET
# URI: /
# version: HTTP/1.1
# line ends with: \r\n
request_line = 'GET / HTTP/1.1\r\n'

#### Request/Response Sequence

The following sequence establishes the general sequence of communication over HTTP:

1. Establish a TCP socket connection with server machine $\textit{host}$ at port $\textit{port}$; call it `connection`.  
2. Build a correctly formatted HTTP request string and assign it to a string variable, `request_message`.  This will use a an HTTP method (`GET`) and will include a header `Host` with value $\textit{host}$ and use \textit{resource path} as the URI in the *request-line*.  
3. Perform a `send` of the string `request_message` over `connection`.  
4. Perform a `receive` of the HTTP message response from `connection`.  Assuming a valid response message, this must retrieve the string of characters up to and including the \textit{empty-line} after the message headers, and then, based on additional information, must retrieve the `body` of the response.  
5. Perform a `close()` operation on `connection`.  

A module, `mysocket`, is included with our textbook, and import above as `import mysocket as sock`.  It provides the following helper functions:

Function                                           | Description
---------------------------------------------------|-------------------------------------------------------------------
`makeConnection(host, port)`                       | Establish a TCP connection from the client machine to a server at the given machine `host` and listening at the given `port`. This returns the socket connection.  This corresponds to Step 1 of the client-side steps.
`sendString(conn, s)`                              | Given an established socket `conn`, take `s`, a string, and send it over the connection.  This corresponds to Step 3 of the client-side steps, where `s` would define all the characters making up a complete HTTP request.
`receiveTillClose(conn)`                            | This performs a socket `recv()` from the connection, consuming data until the server closes the connection.  This returns the complete HTTP response message. This corresponds to Step 4 of the client-side steps, and assumes that a connection close will define the end of the response message.

-----------------------------------------------------------------------------------------------------------------------

Let's now walk through the steps of communication:

**Step 1**

In [None]:
connection = sock.makeConnection("httpbin.org", 80)
assert connection is not None

**Step 2**

In [None]:
request_line = 'GET / HTTP/1.1\r\n'     # we've already seen this
host_line = 'Host: httpbin.org\r\n'     # required for HTTP 1.1
one_and_done = 'Connection: close\r\n'  # specifies whether to keep connection alive
empty_line = '\r\n'                     # we need this before the (optional) body

request_message = request_line + host_line + \
                  one_and_done + empty_line
                  
print(request_message)

**Step 3**

In [None]:
sock.sendString(connection, request_message)

**Step 4**

In [None]:
response = sock.receiveTillClose(connection)

**Step 5**

In [None]:
connection.close()

We can view the first 250 characters of the response (lines are separated by `'\r\n'`):

In [None]:
print(response[:250])

---

## Part C: Practice with HTTP Requests

**Q3:** Suppose we wish to retrieve (GET) a file via HTTP (so port 80) from `datasystems.denison.edu`.  The resource path of the file is `/data/ind0.json`.  We wish to use version 1.1 of HTTP and to request that the connection be closed after a single request/reply exchange.  We will need a header line to satisfy the HTTP 1.1 requirement of a valid `Host` header.  Write a sequence of code to compose a valid HTTP request as a Python string, and assign the result to `message`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

print(message)
print("--------------------")

In [None]:
# Testing cell
assert type(message) == str
assert message[:3] == "GET"
assert message[4:4+len("/data/ind0.json")] == "/data/ind0.json"
assert "Host: datasystems.denison.edu" in message
assert message.count('\r\n') == 4
assert message[-4:] == '\r\n\r\n'

**Q4:** Write a sequence of code to establish a connection to the host `datasystems.denison.edu` at port 80, to send the string `message` from the previous problem to the host, receive the reply from the host until the server closes the connection, assigning the reply to `reply`, and close the connection.  Note: if the request is not completely correct, a network connection can wait forever for a reply that will never come.  So if you have difficulty here, double check your answer to the previous problem.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

print(reply)

In [None]:
# Testing cell
assert type(reply) == str
assert "200 OK" in reply
assert "application/json" in reply
assert reply.endswith("19485.4}}}")

**Q5:** Suppose we want to generalize the scenario from the first exercise, where the two things that can change are the *host location* and the *resource path*.  For example, we might want to change the host to `httpbin.org` and the resource path to `/`, or many other combinations.  Write a function
```
    buildRequest(location, resource)
```    
that constructs and returns a Python string containing a valid HTTP GET request that incorporates the parameters `location` and `resource` into the request at the appropriate places, and includes the appropriate header lines (for the required `Host` and to request the server close the connection after the exchange).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

print(buildRequest("httpbin.org", "/get"))
print("---------------------")

In [None]:
# Testing cell

r1 = buildRequest("datasystems.denison.edu", "/data/ind0.json")
assert r1[:3] == "GET"
assert r1[4:4+len("/data/ind0.json")] == "/data/ind0.json"
assert "Host: datasystems.denison.edu" in r1
assert r1.count('\r\n') == 4
assert r1[-4:] == '\r\n\r\n'

r2 = buildRequest("httpbin.org", "/get")
assert r2[:3] == "GET"
assert r2[4:4+len("/get")] == "/get"
assert "Host: httpbin.org" in r2
assert r2.count('\r\n') == 4
assert r2[-4:] == '\r\n\r\n'

**Q6:** Write a function
```
    makeRequest(location, resource)
```
that first constructs a valid HTTP GET request for `resource` at host `location`, as a Python string (using your function from the previous question), and then performs the  request-reply steps of making the connection, sending the string request, receiving a reply until the connection closes, and finally closing the client side of the connection.  The function should return the reply.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

print(makeRequest("datasystems.denison.edu", "/basic.html"))

In [None]:
# Testing cell

resp1 = makeRequest("datasystems.denison.edu", "/basic.html")
# print(resp1)
assert "200 OK" in resp1
assert "text/html" in resp1
assert resp1.endswith("</html>\n")

resp2 = makeRequest("datasystems.denison.edu", "/data/ind0.json")
# print(resp2)
assert "200 OK" in resp2
assert "application/json" in resp2
assert resp2.endswith("19485.4}}}")

resp3 = makeRequest("httpbin.org", "/get")
# print(resp3)
assert "200 OK" in resp3
assert "application/json" in resp3
assert resp3.endswith(""""url": "http://httpbin.org/get"\n}\n""")

> You've reached the second checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 2: Consider the responses in the previous question's testing cell.  What do you think the `Content-Type` header line is used for by web browsers?  (Uncomment some of the `print` statements to see the responses.)

---

## Part D: HTTP Response Messages

The _start-line_ for an HTTP response message is given by a _status-line_, which is comprised of the protocol version, a status code, and the corresponding reason.  Status codes are divided into different categories, and are all 3 digits:

Status Code | Reason | Examples
:----|:--------------|:--------------
100s | Informational | 100: continue, 102: processing
200s | Success | 200: OK
300s | Redirection | 301: moved permanently
400s | Client error | 400: syntax error, 401: unauthorized, 404: not found
500s | Server error | 500: internal server error

If the request is for a web page, we expect the _body_ of the response to contain a character sequence in the HTML format, with the `Content-Type` header of `text/html`.  Alternatively, the body may contain other types of text, such as JSON, with `Content-Type` of `application/json`.  As with a request message, the body is provided after a single empty line.

The next set of exercises are about parsing through the reply resulting from a request.  If we consider an HTTP reply, we can partition it into a status line, the set of headers, and the body.  The exercises ask for functions that, given a reply, and parse the reply and return each of these pieces.

**Q7:** Write a function
```
    parseStatus(reply)
```
that finds and returns a Python string consisting of only the status line of a reply.  The returned value should include the line-terminating `"\r\n"`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

reply = makeRequest("datasystems.denison.edu", "/basic.html")
print(repr(parseStatus(reply)))
reply = makeRequest("datasystems.denison.edu", "/foobar.txt")
print(repr(parseStatus(reply)))

In [None]:
r1 = makeRequest("datasystems.denison.edu", "/basic.html")
s1 = parseStatus(r1)
assert s1 == "HTTP/1.1 200 OK\r\n"

r2 = makeRequest("datasystems.denison.edu", "/foobar.txt")
s2 = parseStatus(r2)
assert s2 == "HTTP/1.1 404 Not Found\r\n"

**Q8:** Write a function
```
    parseHeaders(reply)
```
that finds and returns a single Python string that starts with the first header in the reply and continues up through the last header in the reply, including the line-terminating `"\r\n"`, but *not* the empty line separating the headers from the body.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

reply = makeRequest("datasystems.denison.edu", "/basic.html")
print(repr(parseHeaders(reply)))
reply = makeRequest("datasystems.denison.edu", "/foobar.txt")
print(repr(parseHeaders(reply)))

In [None]:
# Testing cell

r1 = makeRequest("datasystems.denison.edu", "/basic.html")
h1 = parseHeaders(r1)
assert "Server: Apache" in h1
assert "Connection: close\r\n" in h1
assert "Content-Type: text/html" in h1

r2 = makeRequest("datasystems.denison.edu", "/foobar.txt")
h2 = parseHeaders(r2)
assert "Server: Apache" in h2
assert "Connection: close\r\n" in h2
assert "Content-Type: text/html" in h2

**Q9:** Write a function
```
    parseBody(reply)
```
that finds and returns a single Python string that starts with the beginning of the body (i.e. after the empty line of the reply) and continues to the end of the reply.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

reply = makeRequest("datasystems.denison.edu", "/basic.html")
print(parseBody(reply))
reply = makeRequest("datasystems.denison.edu", "/foobar.txt")
print(parseBody(reply))

In [None]:
# Testing cell
r1 = makeRequest("datasystems.denison.edu", "/basic.html")
b1 = parseBody(r1)
r2 = makeRequest("datasystems.denison.edu", "/foobar.txt")
b2 = parseBody(r2)
assert b1.startswith("<!DOCTYPE html>")
assert b1.endswith("</html>\n")
assert b2.startswith("<!DOCTYPE HTML")
assert b2.endswith("</body></html>\n")

> You've reached the third (and final) checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 3: How does the displayed _body_ in the previous question compare with what you would see for the page source using developer tools in a web browswer?

---

---

## Part E

How much time (in minutes/hours) did you spend on this lab outside of class?

YOUR ANSWER HERE