# HTTP and Web Requests

## The Internet

When we navigate to pages on the internet, we communicate with computers running code elsewhere specifically designed to respond to our requests. They are referred to as servers. The devices we use to access these pages are referred to as clients




<img src="internet.png" width="500">

Whenever we navigate to websites, like cat-soop, we recieve pages back that contain content and styling that determine what the website will look like in our browser

In Lab 10, we will be building a tool to help download files from the Internet. There will likely be new terminology and techniques you encounter. We'll go over it so that you understand the motivation behind the Lab.

## Web Protocols

URLs are the identifiers we use to make requests to specific servers. For example, typing www.google.com in our browser will have our computer make a request to the servers Google owns serving their home page.

Communication on the web can take many forms. Basically the client making a request and the server responding need to agree on a standard language for communication. This language will dictate what requests and responses look like. They are called protocols. There are different protocols used for getting webpages, for streaming video, and for sending email.

The urls we request often look take the form:

&nbsp;&nbsp;&nbsp;&nbsp; protocol://hostname:port/path-or-file-name

The one that we typically use on the web is HTTP(S). The most basic type of request in HTTP is a GET request. This is done when we ask a server for some sort of information. Often this is a web page. When we type a url in Google Chrome or another browser, we make a GET request to that page. You can do it in your terminal to by typing something like

curl --dump-header - https://6009.cat-soop.org/_static/spring19/helloworld.txt  
curl --dump-header - https://www.google.com

## Response Codes

Most of the time, we make a request to some website such as https://6009.cat-soop.org/spring19/, and we successfully get a page back. This is not always the case. Numerous things can happen, each of which typically has a response code for it. 200 means that a valid response was sent back. 404, which you may be familiar with, means that a server holding for the requested URL was not found. 

Response codes of 1xx are informational.  
Response codes of 2xx mean the request was successful.  
Response codes of 3xx mean a redirection is needed.  
Response codes of 4xx mean the client making the request had an error.  
Response codes of 5xx mean the server encountered an error.  

Here is a list of common response codes that will be useful to know for the lab.

| Response Code        | Title                  | Description            |
|:---------------------|:-----------------------|:-----------------------|
| 200                  | OK                     | The request succeeded  |
| 301                  | Moved Permenantly      | The URI of the requested resource has been changed permanently |
| 302                  | Found                  | The URI of requested resource has been changed temporarily |
| 307                  | Temporary Redirect     | The URI of requested resource has been changed temporarily |
| 403                  | Forbidden              | The client does not have access rights to the content
| 404                  | Not Found              | The server can not find requested resource |
| 500                  | Internal Service Error | The server has encountered a situation it doesn't know how to handle. |

#### Examples
curl --dump-header - https://6009.cat-soop.org/_static/spring19/helloworld.txt: 200 successful!    
curl --dump-header - https://py.mit.edu/:  301 redirects to https://py.mit.edu/6.145   
curl --dump-header - https://6009.cat-soop.org/spring19/labs/lab11: 404 doesn't exist because lab 10 is the last lab!

## Caching

Our browsers and computers have the capability to do difficult things automatically for us to improve performance and usability. One of those things, that you'll hear referenced a lot in computer science, is caching. Caching, in its simplest terms, is basically saving things so that if you need to use them again, they'll be available. Our browsers do a lot of caching to make them more performant. For example, if you visit Facebook.com a lot, your browser might have logos, images, and other resources from Facebook servers saved / cached so that the next time you go back to Facebook, you don't have to request them again. 

https://www.giftofspeed.com/cache-checker/

In Lab10 we will use caching so that if we download a loop of files (say for a GIF) we don't have to keep redownloading them, making everything much faster!

In [1]:
import io
import os
import socket
import http.client

from urllib.parse import urlparse

def http_response(url):
    """
    Opens a request to the given URL using the `http.client` library.

    Parameters:
        url (str or bytes):
            The URL containing the resource to be downloaded

    Rerturns:
        A file-like object representing the response received from the server.
        In the case of http:// or https:// requests, the return value will be
        an instance of `http.client.HTTPResponse`.  In the case of a file://
        request (representing a local file on disk), the return value will be
        an `io.BytesIO` object.

        In either case, the returned object will support `read` and `readlines`,
        and it will have a `status` attribute containing an HTTP status code.
    """
    if isinstance(url, bytes):
        url = url.decode('utf-8')
    url = urlparse(url)
    assert url.scheme in ('http', 'file', 'https')
    if url.scheme == 'file':
        fname = os.path.join(url.netloc, url.path)
        if os.path.isfile(fname):
            out = open(fname, 'rb')
            out.status = 200
        else:
            out = io.BytesIO()
            out.status = 404
        return out
    cls = http.client.HTTPConnection if url.scheme == 'http' else http.client.HTTPSConnection
    try:
        connection = cls(url.netloc, timeout=20)
        connection.request('GET', url.path)
    except socket.timeout:
        raise ConnectionError('no response from server within 5 seconds; connection attempt timed out') from None
    except socket.gaierror:
        raise ConnectionError('could not connect') from None
    return connection.getresponse()

In [2]:
response = http_response('https://6009.cat-soop.org/_static/spring19/helloworld.txt')
print ("Status: {}\n".format(response.status))
print ("Content: {}".format(response.read()))

ConnectionError: could not connect

In [None]:
response = http_response('https://6009.cat-soop.org/_static/spring19/nonexistent.txt')
print ("Status: {}\n".format(response.status))
print ("Content: {}".format(response.read()))

In [None]:
import time

start = time.time()

needed_file = 'https://6009.cat-soop.org/_static/spring19/helloworld.txt'
for i in range(20):
    response = http_response(needed_file)
    print ("Iteration: {}, Status: {}, Content: {}".format(i, response.status, response.read()))
    
end = time.time()
print ("\nTime Spent: {}".format(end - start))

In [None]:
start = time.time()

needed_files = 'https://6009.cat-soop.org/_static/spring19/helloworld.txt'
cached_results = None
for i in range(20):
    if cached_results is None:
        response = http_response(needed_files)
        cached_results = {'status': response.status, 'content': response.read()}
    print ("Iteration: {}, Status: {}, Content: {}".format(i, cached_results['status'], cached_results['content']))
        
end = time.time()
print ("\nTime Spent: {}".format(end - start))