# Web Scraping Part 1

*Inspired by web scraping lessons from [Lauren Klein](https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/class4-web-scraping-complete.ipynb) and [Allison Parrish](https://github.com/aparrish/dmep-python-intro/blob/master/scraping-html.ipynb)*

In this lesson, we're going to introduce how to "scrape" data from the internet with the Python libraries requests and BeautifulSoup.

😺 Kittens toy website: http://static.decontextualize.com/kittens.html

## Responses and Requests

To programmatically access the text data attached to every URL, we can use a Python library called [requests](https://requests.readthedocs.io/en/master/).

When you type in a URL in your search address bar, you're sending an HTTP **request** for a web page, and the server which stores that web page will accordingly send back a **response**, some web page data that your browser will render.

## Import Requests 

In [3]:
import requests

## Get HTML Data

With the `.get()` method, we can request to "get" web page data for a specific URL, which we will store in a varaible called `response`.

In [29]:
response = requests.get("https://www.dailyscript.com/scripts/Juno.txt")

## HTTP Status Code

If we check out `response`, it will simply tell us its [HTTP response code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), aka whether the request was successful or not.

"200" is a successful response, while "404" is a common "Page Not Found" error.

In [107]:
response

<Response [200]>

Let's see what happens if we change the title of the movie from *Juno* to *Guno* in the URL...

In [12]:
bad_response = requests.get("http://www.scifiscripts.com/scripts/Guno.txt")

In [13]:
bad_response

<Response [404]>

### Extract Text From Web Page

To actually get at the text data in the reponse, we need to use `.text`, which we will save in a variable called `html_string`. The text data that we're getting is formatted in the HTML markup language, which we will talk more about in the BeautifulSoup section below.

In [None]:
html_string = response.text
print(html_string)

## Extract Text From Multiple Web Pages

In [10]:
urls = ['https://www.dailyscript.com/scripts/Juno.txt',
        'https://www.dailyscript.com/scripts/Titanic.txt',
        'https://imsdb.com/scripts/Mulan.html']

Use a for loop and iterate through a list of screenplay urls called `urls` and then print out the first 500 characters for each screenplay.

## Your turn!

In [None]:
# Your code here



    print(html_string[:500])

## HTML & BeautifulSoup

Not all web pages will be as easy to scrape as these screenplay files, however. If web pages are messy and complicated, how can we extract just the things that we want?

Well, we can use a Python library called BeautifulSoup, but first we need to learn a little about how web pages are written.

Poet and professor Allison Parrish made a toy website called "Kittens and the TV Shows They Love." It can be found at the following URL: http://static.decontextualize.com/kittens.html

If we use our requests library on this Kittens TV website, this is what we get:

In [None]:
response = requests.get("http://static.decontextualize.com/kittens.html")
html_string = response.text
print(html_string)

This is an HTML document. HTML stands for HyperText Markup Language. It is the standard language for writing web page documents. The most important thing you need to know about HTML is that the language uses HTML "tags" to represent different elements, such as a main header `<h1>`. 

| HTML Tag                | Explanation                              |
|--------------------|-------------------------------------------|
| <\!DOCTYPE>        | Defines document type                 |
| <html\>             | Defines HTML document                  |
| <head\>             | Main information about document    |
| <title\>            | Title for document          |
| <body\>             | Document body               |
| <h1\> to <h6\>       |  Headings                    |
| <p\>                | Paragraph                       |
| <br\>               | Line break               |
| <\!\-\-comment here-\-> | Comment                         |
| <img\> | Image                         |
| <a\> | Hyperlink                       |
| <ul\> | Unordered list                     |
| <ol\> | Ordered list                     |
| <li\> | List item                     |
| <style\> | Style information for a document                    |
| <div\> | Section in a document                   |
| <span\> | Section in a document                   |
| class= | Certain kind of element, can apply to multiple elements |
| id= | Unique identifier for an element |

HTML tags often, but not always, require a "closing" tag. For example, the main header "Kittens and the TV Shows They Love" will be surrounded by `<h1>` (opening tag) and `</h1>` (closing tag) on either side: `<h1>Kittens and the TV Shows They Love</h1>`

## Extract HTML Elements

To use BeautifulSoup, we need to import it.

In [6]:
from bs4 import BeautifulSoup

To make a BeautifulSoup document, we call `BeautifulSoup()` with two parameters: the `html_string` from our HTTP request and [the kind of parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use) that we want to use, which will always be `"html.parser"` for our purposes.

In [11]:
response = requests.get("http://static.decontextualize.com/kittens.html")
html_string = response.text

document = BeautifulSoup(html_string, "html.parser")

We can use the `.find()` method to find and extract certain elements, such as a main header.

In [12]:
document.find("h1")

<h1>Kittens and the TV Shows They Love</h1>

If we want only the text contained between those tags, we can use `.text` to extract just the text.

In [9]:
document.find("h1").text

'Kittens and the TV Shows They Love'

Find the HTML element that contains an image.

In [None]:
document.find("img")

You can also extract multiple HTML elements at a time with `.find_all()`

In [None]:
document.find_all("img")

You can extract elements that are only of a certain `class`:

In [None]:
document.find_all("div", attrs={"class": "kitten"})

### Your Turn!
Find the name of **one** of the kittens and then return the text of the name (either "Fluffy" or "Monsieur Whiskers").

To do so, open the web page (http://static.decontextualize.com/kittens.html) and then use your Developer Tools to find the HTML tag associated with the kitten names.

In [42]:
# Your code here

Extract the names of all the TV shows listed on the web page

In [42]:
# Your code here

### Extract Multiple HTML Elements

Let's try to extract the text from all the header2 elements:

In [None]:
document.find_all("h2").text

Uh oh. That didn't work! In order to extract text data from multiple HTML elements, we need a `for` loop and some list-building.

In [15]:
all_h2_headers = document.find_all("h2")

In [None]:
all_h2_headers

First we will make an empty list called `h2_headers`.

We will loop through the headers, grab the `.text`, put it into a variable called `header_contents`, then `.append()` it to our `h2_headers` list.

In [16]:
h2_headers = []
for header in all_h2_headers:
    header_contents = header.text
    h2_headers.append(header_contents)

In [None]:
h2_headers

🚨 Heads up! New Python concept!🚨

You can also use something called a [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) to make a new Python list in a single line of code. 

In [None]:
h2_headers = [header.text for header in all_h2_headers]
h2_headers