# Retriving Data from the Web

* We don't always have to download data to our local machines before loading it into Python
* If the data are openly available on the web we can retrieve them programmatically
    * We can even log into systems with access control, but that is a more complicated topic
* Getting remote data requires the use of *web protocols* to GET data


## What is HTTP

* HTTP is the *HyperText Transfer Protocol* and is the lingua franca of the web
 > HTTP is a protocol which allows the fetching of resources, such as HTML documents. It is the foundation of any data exchange on the Web and a client-server protocol, which means requests are initiated by the recipient, usually the Web browser. A complete document is reconstructed from the different sub-documents fetched, for instance text, layout description, images, videos, scripts, and more. - [MDN Web Docs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview)

![HTTP Flow](images/http-flow.png)

## Elements of HTTP

* Request Methods - Verbs
    * GET - Requests a representation of a specific resource. Retrieve only.
    * POST - Submit an entity to a specified resource, often causing a change in state on the server.
    * PUT - Replace the current representation of the specified resource with the request payload.
    * DELETE - Remove the specified resource from the server.
    * HEAD - Same as GET, but without the response body.
* User Agent - Information about the application making the request
* Headers - Metadata about the request
* Body - Data sent or received


## HTTP Status Codes

* HTTP has five categories of status code
    * 1xx: informational – used for development
    * 2xx: Successful response
    * 3xx: Redirection
    * 4xx: Client Error
    * 5xx: Server Error
* Frequently used codes:
    * 200 - success
    * 301 and 302 - Moved permanently or temporarily
    * 400 - bad request
    * 401 - unauthorized
    * 403 - forbidden
    * 404 - not found


## HTTP Request & Response

![HTTP Request and Response](images/http-request-response.png)

## Working with HTTP in Python

* Because Python has the *batteries included* there is an [http client module](https://docs.python.org/3/library/http.client.html) as part of the standard library
    * It is fine in a pinch, but there is a better 3rd party library
* The [Requests](https://2.python-requests.org/en/master/) library by [Kenneth Reitz](https://www.kennethreitz.org/)
    * It is *HTTP for humans*
* Requests is the most popular library for fetching data from the web
* It is very powerful, but we will only touch on a little bit of it today.

In [None]:
# load the requests library
import requests

In [None]:
# put the address of the page we want to load into a variable
URL = "http://loc.gov"

# make an HTTP GET request to the specified URL
# Save the response in a variable
response = requests.get(URL)


In [None]:
# Inspect the response status code
response.status_code

* This means tour HTTP request was successful 
* Requests makes it easy to inspect various bits of information related to our HTTP transaction

In [None]:
# Display the HTTP headers we got from the server
response.headers

In [None]:
# Look at the content type of the resource we got back from the server
response.headers['Content-Type']

* This means we got an HTML document back from loc.gov
* You can access the response body in the `response.text` or `response.content` fields
    * Be careful, They can be really long!

In [None]:
# display the first 5000 characters of the response string
response.text[0:1000]

In [None]:
# use the print function so the newlines aren't escaped
print(response.text[0:1000])

* From here we could use a library like [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse the HTML and extract specific pieces of information

In [None]:
from bs4 import BeautifulSoup

In [None]:
# Grab the HTML page for the digitized books collection
url = "https://loc.gov/collections/selected-digitized-books"
response = requests.get(url)

In [None]:
# Parse the HTML string with BeautifulSoup so we can search it
soup = BeautifulSoup(response.text)

# find all the HTML elements with the titles
span_elements = soup.findAll("span", class_="item-description-title")

# Use a list comprehension to extract just the text from each HTML element
titles = [item.text.strip() for item in span_elements]
# Display the list of title strings
titles

* But we don't have to parse these titles from HTML
* LC has provided a *much easier* way of programmatically accessing information

In [None]:
# Grab the JSON for the digital book collection
url = "https://loc.gov/collections/selected-digitized-books/?fo=json"
response = requests.get(url)

In [None]:
response.status_code

In [None]:
response.headers["Content-Type"]

In [None]:
response.text[0:1000]

* That looks like JSON!

In [None]:
collection = response.json()
titles = [item['title'] for item in collection['results']]
titles

## Using HTTP Parameters with Requests

* This time we are going to use some HTTP parameters to search for certain items
* What we want to search for are images of kittens
    * CUTE!

In [None]:
# Specify the search endpoint and criteria
search_endpoint = 'http://www.loc.gov/search/'
parameters = {
    'fo' : 'json',
    'q'  : 'kittens',
    'fa' : 'online-format:image'
}

* Now that we have our query as python data, we can pass these to requests

In [None]:
# make the request with the additional parameters
response = requests.get(search_endpoint, params = parameters)

print('URL:',response.url)
print('Response code:',response.status_code)
for header, value in response.headers.items():
    print('Header:', header, value)

In [None]:
# parse the response into Python dictionaries
kitten_data = response.json()
# look at the first result
kitten_data['results'][0]

* If we look at this result we can see there are some URLs the 
* [//cdn.loc.gov/service/pnp/hec/43400/43433v.jpg#h=793&w=1024](//cdn.loc.gov/service/pnp/hec/43400/43433v.jpg#h=793&w=1024)
* Can we programmatically access this image using Python?
    * The answer is YES!

In [None]:
# Extract the URL from the JSON data
kitten_url = kitten_data['results'][0]['image_url'][-1]
kitten_url

In [None]:
# make a request for the image 
# We need to prepend http to the URL because requests doesn't like the protocol agnostics URLs
response = requests.get( kitten_url)
# Check to make sure we got JPEG data back
response.headers['content-type']

* Now we have the JPEG image of a kitten, lets look at it!!!

In [None]:
# Dispay the content
response.content[0:1000]

* EEK, that is not a cute kitten picture, that is binary data being barfed into plain text
* We need a mechanism for displaying this raw image data not as text but as an image
    * Jupyter provides mechanisms for doing this because we are running in an web browser

In [None]:
# load up the Jupyter/IPython display library
from IPython import display

* This imports a function that renders JPEG image data as a JPEG image
    * See the [documentation](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html#IPython.display.display_jpeg) for more information 
* Jupyter supports a bunch of different formats including 

In [None]:
# Use the display function to render the JPEG image we downloaded in the notebook
display.display_jpeg(response.content, raw=True)

***KITTENS!!!***