# Collecting Texts from the Internet: Introduction to Webscraping

There is a huge amount of text and material available openly online, and there are ways to programmatically access and collect texts from websites. 

Webscraping saves you from having to manually navigate to each URL and copy/paste the text you want to use in a file, instead you can programmatically access the text data attached to every URL.

In this notebook we’ll learn how to do this using the Python libraries `requests` and `BeautifulSoup`. 

(There are some more specialized libraries that exist. For example, the [Newspaper](https://newspaper.readthedocs.io/en/latest/) library is designed to collect online news articles).

## Accessing web pages with Requests

When you type in a URL in your search address bar, you’re sending an HTTP *request* for a web page, and the server which stores that web page will accordingly send back a *response*, some web page data that your browser will render.

![Request-Response](Request-Response.png)

We're using the library request to send out a request and store the response.

In [None]:
import requests

In [None]:
#Use the get method to get web page data from a specified URL
#Store data in a variable called response
response = requests.get("https://www.gutenberg.org/files/59227/59227-0.txt")

We can check whether the `response` has been successful or not by looking at the [HTTP response code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

In [None]:
response

“200” is a successful response.

If we modify the URL to an non-existent address, what response code do we get?

In [None]:
bad_response = requests.get("https://www.gutenberg.org/files/59227.txt")

In [None]:
bad_response

“404” is a common “Page Not Found” error.

`Get` enables to store the request, but to actually get text data from the response we need to use `.text`.

In [None]:
# store text data in variable html_string
html_string = response.text
print(html_string)

## Extracting the text we want from HTML using BeautifulSoup

In the example above, we get all the text for Charlotte Gilman's *The Yellow Book* because the webpage itself is a .txt document: `https://www.gutenberg.org/files/59227/59227-0.txt` But not all webpages are like that.

For example, if we look at [The Dartmouth](https://www.thedartmouth.com/) homepage, the page seems to display text and images text, but there’s a lot more information beyond the text. 

If we do the same as before and convert our `response` into text, we can see there’s a lot more messy information compared to the Gilman URL. 

In [None]:
response = requests.get("https://www.thedartmouth.com/")
html_string = response.text
print(html_string)

This messy information is in fact quite structured. Web pages are written in HTML, and we see is the HTML source of the webpage. If we learn how to read HTML, we can use the Python library `BeautifulSoup` to extract only the text we want from the pages.

### Introduction to HTML

Just like Markdown, HTML (which stands for HyperText Markup Language) is a way of writing web page documents. It uses different tags and labels to signal what kinds of element each part of page is (e.g headers, titles, body etc.) Often, particular parts of the page are signaled using an opening and closing tag. For example, a main header will be surrounded by surrounded by an opening tag: `<h1>` and a closing tag: `</h1>`. **We can use these HTML tags to identify the parts of the webpage we want to extract**

**Reading HTML**

Poet, programmer, and professor Allison Parrish has created a website to learn about HTML entitled [“Kittens and the TV Shows They Love.”](http://static.decontextualize.com/kittens.html).

As before, let’s get the response from the address and convert to text. Read over the output. Can you identify any labels and patterns in the labels?

In [None]:
response = requests.get("http://static.decontextualize.com/kittens.html")
html_string = response.text
print(html_string)

Here is a list of a few HTML tags: 

![Html-tags.png](Html-tags.png)

### Inspect HTML Elements in your Browser

You can inspect webpages and identify the HTML elements you’re interested in using your Browser. When you are on a webpage you want to inspect, right click and select “Inspect”. Or got to “View” > “Inspect Elements”. You can also use the shortcut *control-shift-C* on Windows or *command-option-C* on macOS.

The HTML elements will appear on the right and you can hover over and explore. 

![Inspect-1.png](Inspect-1.png)

![Inspect-2.png](Inspect-2.png)

### Extracting elements from HTML using BeautifulSoup

Now that we can request and get text data from webpages, and we can read and identify relevant labels from HTML, we can use the library `BeautifulSoup` to extract the parts of the page we are interested in. 

To make a BeautifulSoup document, we call `BeautifulSoup()` with two parameters: 
- the response we got from our HTTP request (that we've converted to text/string and stored at the variable `html_string`)
- and the kind of parser that we want to use, which will always be `"html.parser"` for our purposes.

In [None]:
from bs4 import BeautifulSoup

In [None]:
response = requests.get("http://static.decontextualize.com/kittens.html")
html_string = response.text

document = BeautifulSoup(html_string, "html.parser")

document

We can use the `.find()` method to find and extract certain elements, such as a main header.

In [None]:
document.find("h1")

If we want only the text contained between those tags, we can use `.text` to extract just the text.

In [None]:
document.find("h1").text

You can also extract multiple HTML elements at a time with `.find_all()`

In [None]:
document.find_all("h2")

In [None]:
document.find_all("div", attrs={"class": "kitten"})

In order to extract text data from multiple HTML elements, we need a `for` loop and some list-building.

In [None]:
#Find all h2 headers in the document
all_h2_headers = document.find_all("h2")

#Create an empty list called h2_headers.
#Creat a for loop:
#for each header in all_h2_headers, 
#we will grab the .text, 
#put it into a variable called header_contents, 
#then .append() it to our h2_headers list.
h2_headers = []
for header in all_h2_headers:
    header_contents = header.text
    h2_headers.append(header_contents)
h2_headers