# Web Scraping

[Download relevant files here](https://melaniewalsh.org/Web-Scraping.zip)

Inspired by web scraping lessons from [Lauren Klein](https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/class4-web-scraping-complete.ipynb) and [Allison Parrish](https://github.com/aparrish/dmep-python-intro/blob/master/scraping-html.ipynb)

How did the data journalists of *The Pudding* actually collect the necessary screenplay and rap song data for their visualizations and analyses? 

<img src='../images/Pudding-film-dialogue-Mean-Girls.png' width=100%>

<img src='../images/Pudding-rap-viz.png' width=100%>

Well, they almost certainly used some form of web scraping. Web scraping is a way of computationally extracting data from the internet. It's one of the two ways of computationally collecting data from the internet that we're going to discuss in this class.

# Why Do We Need To Scrape At All?

To understand the necessity and significance of web scraping, let's walk through the likely data collection process behind [“Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age”](https://pudding.cool/2017/03/film-dialogue/) or any project similar to it.

One of the biggest sources for *The Pudding*'s screenplay data was the [Cornell Movie Dialogues Corpus](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). This is a corpus created by Cornell CIS professors Cristian Danescu-Niculescu-Mizil and Lillian Lee for their paper ["Chameleons in imagined conversations"](http://www.cs.cornell.edu/~cristian/papers/chameleons.pdf). Go Big Red! These researchers helpfully shared a dataset of every URL that they used to find and access the screenplays in their own project.

Let's take a look:

In [53]:
import pandas as pd

In [500]:
urls = pd.read_csv("../data/cornell-movie-corpus/raw_script_urls.csv", delimiter='\t', encoding='utf=8')

In [501]:
urls

Unnamed: 0,id,movie_title,script_url
0,m0,10 things i hate about you,http://www.dailyscript.com/scripts/10Things.html
1,m1,1492: conquest of paradise,http://www.hundland.org/scripts/1492-Conquest...
2,m2,15 minutes,http://www.dailyscript.com/scripts/15minutes....
3,m3,2001: a space odyssey,http://www.scifiscripts.com/scripts/2001.txt
4,m4,48 hrs.,http://www.awesomefilm.com/script/48hours.txt
...,...,...,...
612,m612,watchmen,http://www.scifiscripts.com/scripts/wtchmn.txt
613,m613,xxx,http://www.dailyscript.com/scripts/xXx.txt
614,m614,x-men,http://www.scifiscripts.com/scripts/xmenthing...
615,m615,young frankenstein,http://www.horrorlair.com/scripts/young.txt


This is an extremely useful dataset! But how can we actually use these URLs to get workable, computationally tractable text data? Well, we could manually navigate to each URL and then copy and paste each screenplay into a plain text file....

But that route would be suuuuper slow and painstaking, not to mention that we would lose some crucial data along the way—for example, information that might help us automatically distinguish the title of the movie from the screenplay itself. It would be much better if we could programmatically access the text data attached to every URL.

# Responses and Requests

The first step down this more efficient web scraping path is to import a Python library called [requests](https://requests.readthedocs.io/en/master/), which will help us access the web page data associated with every URL. We're going to practice by **requesting** the screenplay data for the movie *Ghostbusters*.

<img src="https://pbs.twimg.com/profile_images/1203012648406667264/RR4pig4F_400x400.jpg" width=100%>

When you type in a URL in your search address bar, you're sending an HTTP **request** for a web page, and the server which stores that web page will accordingly send back a **response**, some web page data that your browser will render.

<img src="../images/request-response.png" width=100%>

In [538]:
import requests

## `.get()`

With the `.get()` method, we can request to "get" web page data for a specific URL, which we will store in a varaible called `response`.

In [74]:
response = requests.get("http://www.scifiscripts.com/scripts/Ghostbusters.txt")

## HTTP Status Code

If you check out `response`, it will simply tell you its [HTTP response code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), aka whether the request was successful or not. "200" is a successful response, while "404" is a common "Page Not Found" error.

In [540]:
response

<Response [200]>

Let's see what happens if I change the title of the movie from *Ghostbusters* to *Ghostboogers* in the URL...

In [545]:
bad_response = requests.get("http://www.scifiscripts.com/scripts/Ghostboogers.txt")

<img src="../images/Ghostboogers.png" width=100%>

In [546]:
bad_response

<Response [404]>

# Grab the `.text`

To actually get at the text data in the reponse, we need to use `.text`, which we will save in a variable called `html_string`. The text data that we're getting is formatted in the HTML markup language, which we will talk more about in the BeautifulSoup section below.

In [75]:
html_string = response.text

Voila! Here's the screenplay now in a variable.

In [76]:
print(html_string)

						Ghostbusters

							by
						Harold Ramis
							and
						Dan Aykroyd

					Final Shooting Script
				Last revised October 7, 1983


											FADE IN

EXT. NEW YORK PUBLIC LIBRARY -- DAY

The sun shines brightly on the classic facade of the main library at Fifth
Avenue and 42nd Street.  In the adjacent park area, pretty hustlers and
drug peddlers go about their business.

FRONT STEPS

A few people lounge on the steps flanked by the familiar stone lions.

INT. MAIN READING ROOM -- DAY

People are dotted throughout the room sitting at the long oak tables
polished by decades of use.  Reading lamps with green glass shades cast a
golden glow on the tables.  The patina of age is everywhere.  It is very
quiet.

LIBRARIAN

A slightly stout, studious looking girl in her late twenties circulates
quietly among the tables picking up books and putting them on her cart.
Everything seems completely normal and peaceful.

POV

A single eerie musical note signals the presence of something stra

# Looping Requests

Let's quickly demonstrate how we might loop through the URLs and get text data for each film. We're going to create a smaller dataframe from the Cornell Movie Dialogue Corpus, which consists of 10 randomly selected movies.

In [79]:
import pandas as pd

In [80]:
urls = pd.read_csv("../data/cornell-movie-corpus/raw_script_urls.csv", delimiter='\t', encoding='utf=8')

In [99]:
sample_urls = urls.sample(10)

In [100]:
sample_urls

Unnamed: 0,id,movie_title,script_url
610,m610,the wizard of oz,http://www.scifiscripts.com/scripts/wizoz.txt
589,m589,u turn,http://www.dailyscript.com/scripts/u-turn_sho...
205,m205,taxi driver,http://www.dailyscript.com/scripts/Taxi%20Dri...
416,m416,kramer vs. kramer,http://www.dailyscript.com/scripts/Kramer%20v...
76,m76,gladiator,http://www.angelfire.com/movies/ridleyscott/s...
73,m73,the ghost and the darkness,http://www.dailyscript.com/scripts/Ghost%20An...
478,m478,predator,http://www.scifiscripts.com/scripts/predator.txt
8,m8,a nightmare on elm street: the dream child,http://www.hundland.org/scripts/A-Nightmare-o...
332,m332,l.a. confidential,http://www.dailyscript.com/scripts/LA%20Confi...
438,m438,midnight cowboy,http://www.dailyscript.com/scripts/Midnight%2...


Then we're going to make a function called `scrape_screenplay()` that includes our `requests.get()` and `response.text` code.

In [83]:
def scrape_screenplay(url):
    response = requests.get(url)
    html_string = response.text
    return html_string

Then we're going to loop through every URL in our smaller sample dataframe, scrape each screenplay from each URL, and then print the first 900 characters for each screenplay.

In [101]:
for url in sample_urls['script_url']:
    full_screenplay = scrape_screenplay(url)
    sample_screenplay = full_screenplay[:900]
    print(f"\n🎬🎬🎬🎬🎬🎬🎬\n{sample_screenplay}\n🎬🎬🎬🎬🎬🎬🎬\n")


🎬🎬🎬🎬🎬🎬🎬
The Wizard Of Oz -- Movie Script 
** DISCLAIMER & CREDITS **
This script was transcribed by Paul Rudoff
script copyright © 1939 Metro-Goldwyn-Meyer.
All rights reserved.

            http://spookcentral.cjb.net



                      The Wizard Of Oz

                             by
                        Noel Langley
                      Florence Ryerson
                   and Edgar Allen Woolf

                 Cutting Continuity Script
                 Taken From Printer's Dupe
                Last revised March 15, 1939


FADE IN -- Title:

For nearly forty years this story has given faithful service to the Young
in Heart; and Time has been powerless to put its kindly philosophy out of
fashion.

To those of you who have been faithful to it in return

...and to the Young in Heart --- we dedicate this picture.

								FADE OUT:

MS -- Dorothy stoo
🎬🎬🎬🎬🎬🎬🎬


🎬🎬🎬🎬🎬🎬🎬
<html>

<head>
   <title>"U Turn", shooting draft, revised by Richard Rutowski & Oliver Stone</title>
</he

ConnectionError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

# BeautifulSoup & HTML

Not all web pages will be as easy to scrape as these screenplay files, however. Let's say we wanted to scrape the lyrics for Missy Elliott's song "The Rain (Supa Dupa Fly)" (1997) from *Genius*.

<img src="../images/Missy-Elliott.png" width=100%>

Even at a glance, we can tell that this *Genius* web page is a lot more complicated than the *Ghostbusters* page and that it contains a lot of information beyond the lyrics. Sure enough, if we use our requests library again and try to grab the data for this web page, the underlying data is much more complicated, too.

In [30]:
response = requests.get("https://genius.com/Missy-elliott-the-rain-supa-dupa-fly-lyrics")
html_string = response.text
print(html_string)



<!DOCTYPE html>
<html class="snarly apple_music_player--enabled bagon_song_page--enabled song_stories_public_launch--enabled react_forums--disabled" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" lang="en" xml:lang="en">
  <head>
    <base target='_top' href="//genius.com/">

    <script type="text/javascript">
//<![CDATA[

  var _sf_startpt=(new Date()).getTime();
  if (window.performance && performance.mark) {
    window.performance.mark('parse_start');
  }

//]]>
</script>

<title>Missy Elliott – The Rain (Supa Dupa Fly) Lyrics | Genius Lyrics</title>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta content='width=device-width,initial-scale=1' name='viewport'>

  <meta name="apple-itunes-app" content="app-id=709482991">

<link href="https://assets.genius.com/images/apple-touch-icon.png?1582666308" rel="apple-touch-icon" />


  

  <link href="https://assets.genius.com/images/apple-touch-icon.png?1582666308" rel="apple-

How can we extract just the song lyrics from this messy soup of a document? Luckily there's a Python library that can help us called BeautifulSoup, which parses HTML documents.

To understand BeautifulSoup and HTML, we're going to briefly depart from our Missy Elliot lyrics challenge to consider a much simpler website. (But we will return to Missy soon!) This toy website was made by the poet, programmer, and professor Allison Parrish explicitly for the purposes of teaching BeautifulSoup.

## HTML

Parrish's website is titled "Kittens and the TV Shows They Love," and it can be found at the following URL: http://static.decontextualize.com/kittens.html Let's check it out.

<img src="../images/kittens-web.png" width=100%>

If we use our requests library on this Kittens TV website, this is what we get:

In [32]:
response = requests.get("http://static.decontextualize.com/kittens.html")
html_string = response.text
print(html_string)

<!doctype html>
<html>
	<head>
		<title>Kittens!</title>
		<style type="text/css">
			span.lastcheckup { font-family: "Courier", fixed; font-size: 11px; }
		</style>
	</head>
	<body>
		<h1>Kittens and the TV Shows They Love</h1>
		<div class="kitten">
			<h2>Fluffy</h2>
			<div><img src="http://placekitten.com/120/120"></div>
			<ul class="tvshows">
				<li>
					<a href="http://www.imdb.com/title/tt0106145/">Deep Space Nine</a>
				</li>
				<li>
					<a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a>
				</li>
			</ul>
			Last check-up: <span class="lastcheckup">2014-01-17</span>
		</div>
		<div class="kitten">
			<h2>Monsieur Whiskeurs</h2>
			<div><img src="http://placekitten.com/110/110"></div>
			<ul class="tvshows">
				<li>
					<a href="http://www.imdb.com/title/tt0106179/">The X-Files</a>
				</li>
				<li>
					<a href="http://www.imdb.com/title/tt0098800/">Fresh Prince</a>
				</li>
			</ul>
			Last check-up: <span class="lastcheckup">2013-11-02</span>
		</div

### HTML Tags

This is an HTML document. HTML stands for HyperText Markup Language. It is the standard language for writing web page documents. The most important thing you need to know about HTML is that the language uses HTML "tags" to represent different elements, such as a main header `<h1>`. 

| HTML Tag                | Explanation                              |
|--------------------|-------------------------------------------|
| <\!DOCTYPE>        | Defines document type                 |
| <html\>             | Defines HTML document                  |
| <head\>             | Main information about document    |
| <title\>            | Title for document          |
| <body\>             | Document body               |
| <h1\> to <h6\>       |  Headings                    |
| <p\>                | Paragraph                       |
| <br\>               | Line break               |
| <\!\-\-comment here-\-> | Comment                         |
| <img\> | Image                         |
| <a\> | Hyperlink                       |
| <ul\> | Unordered list                     |
| <ol\> | Ordered list                     |
| <li\> | List item                     |
| <style\> | Style information for a document                    |
| <div\> | Section in a document                   |
| <span\> | Section in a document                   |

HTML tags often, but not always, require a "closing" tag. For example, the main header "Kittens and the TV Shows They Love" will be surrounded by `<h1>` (opening tag) and `</h1>` (closing tag) on either side: `<h1>Kittens and the TV Shows They Love</h1>`

### HTML Attributes, Classes, and IDs

HTML elements sometimes come with even more information inside a tag. This will often be a keyword (like `class` or `id`) followed by an equals sign `=` and a further descriptor such as `<div class="kitten">`

We need to know about tags as well as attributes, classes, and IDs because this is how we're going to extract specific HTML data with BeautifulSoup.

# BeautifulSoup

In [6]:
from bs4 import BeautifulSoup

To make a BeautifulSoup document, we call `BeautifulSoup()` with two parameters: the `html_string` from our HTTP request and [the kind of parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use) that we want to use, which always be `"html.parser"` for our purposes.

In [103]:
response = requests.get("http://static.decontextualize.com/kittens.html")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [35]:
document

<!DOCTYPE doctype html>

<html>
<head>
<title>Kittens!</title>
<style type="text/css">
			span.lastcheckup { font-family: "Courier", fixed; font-size: 11px; }
		</style>
</head>
<body>
<h1>Kittens and the TV Shows They Love</h1>
<div class="kitten">
<h2>Fluffy</h2>
<div><img src="http://placekitten.com/120/120"/></div>
<ul class="tvshows">
<li>
<a href="http://www.imdb.com/title/tt0106145/">Deep Space Nine</a>
</li>
<li>
<a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a>
</li>
</ul>
			Last check-up: <span class="lastcheckup">2014-01-17</span>
</div>
<div class="kitten">
<h2>Monsieur Whiskeurs</h2>
<div><img src="http://placekitten.com/110/110"/></div>
<ul class="tvshows">
<li>
<a href="http://www.imdb.com/title/tt0106179/">The X-Files</a>
</li>
<li>
<a href="http://www.imdb.com/title/tt0098800/">Fresh Prince</a>
</li>
</ul>
			Last check-up: <span class="lastcheckup">2013-11-02</span>
</div>
</body>
</html>

## `.find()` HTML Elements

We can use the `.find()` method to find and extract certain elements, such as a main header.

In [39]:
document.find("h1")

<h1>Kittens and the TV Shows They Love</h1>

If we want only the text contained between those tags, we can use `.text` to extract just the text.

In [68]:
document.find("h1").text

'The Rain (Supa Dupa Fly)'

Find the HTML element that contains an image. Hint: the HTML image tag is "img"

In [69]:
#Your Code Here

## `.find_all()` HTML Elements

You can also extract multiple HTML elements at a time with `.find_all()`

In [51]:
document.find_all("img")

[<img src="http://placekitten.com/120/120"/>,
 <img src="http://placekitten.com/110/110"/>]

In [104]:
document.find_all("div", attrs={"class": "kitten"})

[<div class="kitten">
 <h2>Fluffy</h2>
 <div><img src="http://placekitten.com/120/120"/></div>
 <ul class="tvshows">
 <li>
 <a href="http://www.imdb.com/title/tt0106145/">Deep Space Nine</a>
 </li>
 <li>
 <a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a>
 </li>
 </ul>
 			Last check-up: <span class="lastcheckup">2014-01-17</span>
 </div>, <div class="kitten">
 <h2>Monsieur Whiskeurs</h2>
 <div><img src="http://placekitten.com/110/110"/></div>
 <ul class="tvshows">
 <li>
 <a href="http://www.imdb.com/title/tt0106179/">The X-Files</a>
 </li>
 <li>
 <a href="http://www.imdb.com/title/tt0098800/">Fresh Prince</a>
 </li>
 </ul>
 			Last check-up: <span class="lastcheckup">2013-11-02</span>
 </div>]

In [41]:
document.find("h2")

<h2>Fluffy</h2>

In [42]:
document.find_all("h2")

[<h2>Fluffy</h2>, <h2>Monsieur Whiskeurs</h2>]

Let's try to extact the text from all the header2 elements:

In [70]:
document.find_all("h2").text

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Uh oh. That wont' work. In order to extract the text data from multiple HTML elements, we're going to need a `for` loop and some list-building powers.

In [44]:
all_h2_headers = document.find_all("h2")

In [63]:
h2_headers = []
for header in all_h2_headers:
    header_contents = header.text
    h2_headers.append(header_contents)

In [46]:
h2_headers

['Fluffy', 'Monsieur Whiskeurs']

### List Comprehension?

How might we transform this exact same `for` loop into a one-line list comprehension instead? Refer back to [List Comprehensions](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Python/More-Lists-Loops.html#List-Comprehensions) to jog your memory.

In [61]:
h2_headers = #Your Code Here

In [62]:
h2_headers

['Fluffy', 'Monsieur Whiskeurs']

## Inspect The HTML 🧐

Most times if you're looking to extract something from an HTML document, it's best to use your "Inspect" capabilities in your web browser. You can hover over elements that you're interested in and find that specific element in the HTML.

<img src="../images/inspect.png" width=100%>

For example, if we hover over the main header:

<img src="../images/inspect-h1.png" width=100%>

Or if we hover over a link:

<img src="../images/inspect-a.png" width=100%>

# Back to Missy Elliott — Your Turn!

Ok so now we've learned a little bit about how to use BeautifulSoup to parse HTML documents. So how would we apply what we've learned to extract Missy Elliott lyrics?

<img src="../images/Missy-Elliott.png" width=100%>

In [56]:
response = requests.get("https://genius.com/Missy-elliott-the-rain-supa-dupa-fly-lyrics")
html_str = response.text
document = BeautifulSoup(html_str, "html.parser")

In [57]:
document


<!DOCTYPE html>

<html class="snarly apple_music_player--enabled bagon_song_page--enabled song_stories_public_launch--enabled react_forums--disabled" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<base href="//genius.com/" target="_top"/>
<script type="text/javascript">
//<![CDATA[

  var _sf_startpt=(new Date()).getTime();
  if (window.performance && performance.mark) {
    window.performance.mark('parse_start');
  }

//]]>
</script>
<title>Missy Elliott – The Rain (Supa Dupa Fly) Lyrics | Genius Lyrics</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="app-id=709482991" name="apple-itunes-app"/>
<link href="https://assets.genius.com/images/apple-touch-icon.png?1582666308" rel="apple-touch-icon"/>
<link href="https://assets.genius.com/images/apple-touch-icon.png?1582666308" rel="apple-touch-icon"/>
<!-- Mobil

What HTML element do we need to "find" to extract the song lyrics?

In [None]:
missy_lyrics = #Your Code Here

In [None]:
print(missy_lyrics)

What HTML element do we need to "find" to extract the title?

In [None]:
song_title = #Your Code Here

In [None]:
print(song_title)