# Web Scraping with Beautiful Soup

Lauren F. Klein wrote version 1.0 of this notebook, based on lessons by [Alison Parrish](http://www.decontextualize.com/) and [Jinho Choi](https://github.com/emory-courses/data-science/blob/master/course/data_aggregation/data_aggregation.ipynb), which I have supplemented with material from Melanie Walsh's chapter [TF-IDF](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/Text-Analysis/TF-IDF.html) from her online textbook [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/welcome.html).

## Why Do We Need To Scrape At All?

To perform practical approaches to data science with text, we need...**text**. Some text is prepared for computational analysis and publicly available in digital libraries. We can easily get novels in the public domain as .txt files, for example, from [Project Gutenberg](https://www.gutenberg.org/). Or we can work with text from HathiTrust's vast collections through the [HathiTrust Research Center](https://analytics.hathitrust.org/). 

If we want to work with text from popular culture and the internet--and we do!--we often need to scrape the web. To understand the necessity and significance of web scraping, let's walk through the likely data collection process behind [“Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age”](https://pudding.cool/2017/03/film-dialogue/) or any project similar to it.

One of the biggest sources for *The Pudding*'s screenplay data was the [Cornell Movie Dialogues Corpus](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). This is a corpus created by Cornell CIS professors Cristian Danescu-Niculescu-Mizil and Lillian Lee for their paper ["Chameleons in imagined conversations"](http://www.cs.cornell.edu/~cristian/papers/chameleons.pdf). These researchers helpfully shared a dataset of every URL that they used to find and access the screenplays in their own project.

Let's take a look:

First we import pandas, a Python library for managing data

In [1]:
import pandas as pd

Then, we read in the data, which is stored as a .csv file. First you have to download the file from Canvas and store it somewhere where you know where to find it. I've stored my in the `docs` folder within my `QTM-340` folder, which contains the materials for this course. Go to Canvas now to download it or get there by clicking [here](https://canvas.emory.edu/courses/76593). It's called `raw_script_urls` and it's under the `Datasets` module.

You'll see three "arguments" between parentheses in the code below. The first is the filepath, telling pandas where to find the data. You will have to edit this argument to match where you've stored the data on your computer. Don't worry about the next two arguments for now.

In [2]:
urls = pd.read_csv("../docs/raw_script_urls.csv", delimiter='\t', encoding='utf=8')

Now, when we run the code naming the data we've just read in, Jupyter will render the top and bottom five rows from the .csv for us. It will also tell us how many rows and columns are in the data.

In [3]:
urls

Unnamed: 0,id,movie_title,script_url
0,m0,10 things i hate about you,http://www.dailyscript.com/scripts/10Things.html
1,m1,1492: conquest of paradise,http://www.hundland.org/scripts/1492-Conquest...
2,m2,15 minutes,http://www.dailyscript.com/scripts/15minutes....
3,m3,2001: a space odyssey,http://www.scifiscripts.com/scripts/2001.txt
4,m4,48 hrs.,http://www.awesomefilm.com/script/48hours.txt
...,...,...,...
612,m612,watchmen,http://www.scifiscripts.com/scripts/wtchmn.txt
613,m613,xxx,http://www.dailyscript.com/scripts/xXx.txt
614,m614,x-men,http://www.scifiscripts.com/scripts/xmenthing...
615,m615,young frankenstein,http://www.horrorlair.com/scripts/young.txt


This is an extremely useful dataset! But how can we actually use these URLs to get workable, computationally tractable text data? Well, we could manually navigate to each URL and then copy and paste each screenplay into a plain text file....

## Responses and Requests

The first step down this more efficient web scraping path is to import a Python library called [requests](https://requests.readthedocs.io/en/master/), which will help us access the web page data associated with every URL. We're going to practice by **requesting** the screenplay data for the movie *Ghostbusters*.

<img src="https://pbs.twimg.com/profile_images/1203012648406667264/RR4pig4F_400x400.jpg" width=100%>

When you type in a URL in your search address bar, you're sending an HTTP **request** for a web page, and the server which stores that web page will accordingly send back a **response**, some web page data that your browser will render.

In [4]:
import requests

With the `.get()` method, we can request to "get" web page data for a specific URL, which we will store in a varaible called `response`.

**Side note:** Since you're familiar with R, you know what a function is: a block of organized, reusable code that is called by a name, and performs some sort of action. Python has functions too, as well as things called methods, which are functions that are associated only with a particular object or class. Keep on reading to see an example of the 
`get` method in action.

In [5]:
response = requests.get("http://www.scifiscripts.com/scripts/Ghostbusters.txt")

### HTTP Status Code

If you check out `response`, it will simply tell you its [HTTP response code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), aka whether the request was successful or not. "200" is a successful response, while "404" is a common "Page Not Found" error.

In [6]:
response

<Response [200]>

Let's see what happens if I change the title of the movie from *Ghostbusters* to *Ghostboogers* in the URL...

In [7]:
bad_response = requests.get("http://www.scifiscripts.com/scripts/Ghostboogers.txt")

In [8]:
bad_response

<Response [404]>

### Grab the `.text`

To actually get at the text data in the reponse, we need to use `.text`, which we will save in a variable called `html_string`. The text data that we're getting is formatted in the HTML markup language, which we will talk more about in the BeautifulSoup section below.

In [9]:
html_string = response.text

Voila! Here's the screenplay now in a variable.

In [10]:
print(html_string)

						Ghostbusters

							by
						Harold Ramis
							and
						Dan Aykroyd

					Final Shooting Script
				Last revised October 7, 1983


											FADE IN

EXT. NEW YORK PUBLIC LIBRARY -- DAY

The sun shines brightly on the classic facade of the main library at Fifth
Avenue and 42nd Street.  In the adjacent park area, pretty hustlers and
drug peddlers go about their business.

FRONT STEPS

A few people lounge on the steps flanked by the familiar stone lions.

INT. MAIN READING ROOM -- DAY

People are dotted throughout the room sitting at the long oak tables
polished by decades of use.  Reading lamps with green glass shades cast a
golden glow on the tables.  The patina of age is everywhere.  It is very
quiet.

LIBRARIAN

A slightly stout, studious looking girl in her late twenties circulates
quietly among the tables picking up books and putting them on her cart.
Everything seems completely normal and peaceful.

POV

A single eerie musical note 

## Looping Requests

Let's quickly demonstrate how we might loop through the URLs and get text data for each film. We're going to create a smaller dataframe from the Cornell Movie Dialogue Corpus, which consists of 10 randomly selected movies.

In [11]:
sample_urls = urls.sample(10)

In [12]:
sample_urls

Unnamed: 0,id,movie_title,script_url
296,m296,chinatown,http://www.dailyscript.com/scripts/Chinatown.txt
304,m304,contact,http://www.scifiscripts.com/scripts/contact.txt
156,m156,panther,http://www.dailyscript.com/scripts/Panther.txt
284,m284,bringing out the dead,http://www.dailyscript.com/scripts/Bringing+O...
407,m407,jennifer eight,http://www.dailyscript.com/scripts/jennifer-e...
470,m470,pet sematary ii,http://www.dailyscript.com/scripts/petsemetar...
91,m91,hope and glory,http://www.dailyscript.com/scripts/Hope+And+G...
501,m501,the salton sea,http://www.dailyscript.com/scripts/salton_sea...
520,m520,snow falling on cedars,http://www.dailyscript.com/scripts/snow-falli...
51,m51,drop dead gorgeous,http://www.dailyscript.com/scripts/Drop+Dead+...


Then we're going to make a function called `scrape_screenplay()` that includes our `requests.get()` and `response.text` code.

In [13]:
def scrape_screenplay(url):
    response = requests.get(url)
    html_string = response.text
    return html_string

Then we're going to loop through every URL in our smaller sample dataframe, scrape each screenplay from each URL, and then print the first 900 characters for each screenplay.

In [14]:
for url in sample_urls['script_url']:
    full_screenplay = scrape_screenplay(url)
    sample_screenplay = full_screenplay[:900]
    print(f"\n🎬🎬🎬🎬🎬🎬🎬\n{sample_screenplay}\n🎬🎬🎬🎬🎬🎬🎬\n")


🎬🎬🎬🎬🎬🎬🎬
                                       "CHINATOWN"

                                            by

                                       ROBERT TOWNE

                

               FULL SCREEN PHOTOGRAPH Grainy but unmistakably a man and 
               woman making love. Photograph shakes. SOUND of a man MOANING 
               in anguish. The photograph is dropped, REVEALING ANOTHER, 
               MORE compromising one. Then another, and another. More moans.

                                     CURLY'S VOICE
                              (crying out)
                         Oh, no.

               INT. GITTES' OFFICE

               CURLY drops the photos on Gittes' desk. Curly towers over 
               GITTES and sweats heavily through his workman's clothes, his 
               breathing progressively more labored. A drop plunks on Gittes' 
               shiny desk top.

 
🎬🎬🎬🎬🎬🎬🎬


🎬🎬🎬🎬🎬🎬🎬
					 CONTACT

			Rewrite by Michael Goldenberg

		    Based on the Nove


🎬🎬🎬🎬🎬🎬🎬
                                   "DROP DEAD GORGEOUS"

                                      Screenplay by

                                      Lona Williams

                

               FADE IN:

               EXT. COUNTRY ROAD - MINNESOTA - DAY

               Vintage black and white stock footage of some farms and 
               farmhouses.

                                                               DISSOLVE TO:

               EXT. COUNTRY ROAD - DAY

               Color footage of cotton fields passing by. We FREEZE and

                                                             FADE TO BLACK.

               TITLE WIPES IN:

                           1995 MARKED THE FIFTIETH ANNIVERSARY
                         OF THE NATION'S OLDEST BEAUTY CONTEST...
                 THE SARAH ROSE COSMETICS AMERICAN TEEN PRINCESS PAGEANT
                           A DOCUMENTARY
🎬🎬🎬🎬🎬🎬🎬



Nice work! We now know how to scrape text from simple websites! We can get a lot of text data this way!

# BeautifulSoup & HTML

But not all web pages will be as easy to scrape as these screenplay files. Let's say we wanted to scrape the lyrics for Missy Elliott's song "The Rain (Supa Dupa Fly)" (1997) from *Genius*.

<img src="../web-scraping/images/Missy-Elliott.png" width=100%>

Even at a glance, we can tell that this *Genius* web page is a lot more complicated than the *Ghostbusters* page and that it contains a lot of information beyond the lyrics. Sure enough, if we use our requests library again and try to grab the data for this web page, the underlying data is much more complicated, too.

In [15]:
response = requests.get("https://genius.com/Missy-elliott-the-rain-supa-dupa-fly-lyrics")
html_string = response.text
print(html_string)



<!DOCTYPE html>
<html class="snarly apple_music_player--enabled bagon_song_page--enabled song_stories_public_launch--enabled react_forums--disabled" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" lang="en" xml:lang="en">
  <head>
    <base target='_top' href="//genius.com/">

    <script type="text/javascript">
//<![CDATA[

  var _sf_startpt=(new Date()).getTime();
  if (window.performance && performance.mark) {
    window.performance.mark('parse_start');
  }

//]]>
</script>

<title>Missy Elliott – The Rain (Supa Dupa Fly) Lyrics | Genius Lyrics</title>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta content='width=device-width,initial-scale=1' name='viewport'>

  <meta name="apple-itunes-app" content="app-id=709482991">

<link href="https://assets.genius.com/images/apple-touch-icon.png?1600204144" rel="apple-touch-icon" />


  

  <link href="https://assets.genius.com/images/apple-touch-icon.png?1600204144" rel="apple-

How can we extract just the song lyrics from this messy soup of a document? Luckily there's a Python library that can help us called BeautifulSoup, which parses HTML documents.

To understand BeautifulSoup and HTML, we're going to briefly depart from our Missy Elliot lyrics challenge to consider a much simpler website. (But we will return to Missy soon!) This toy website was made by the poet, programmer, and professor Allison Parrish explicitly for the purposes of teaching BeautifulSoup.

## HTML

Parrish's website is titled "Kittens and the TV Shows They Love," and it can be found at the following URL: http://static.decontextualize.com/kittens.html Let's check it out.

If we use our requests library on this Kittens TV website, this is what we get:

In [16]:
response = requests.get("http://static.decontextualize.com/kittens.html")
html_string = response.text
print(html_string)

<!doctype html>
<html>
	<head>
		<title>Kittens!</title>
		<style type="text/css">
			span.lastcheckup { font-family: "Courier", fixed; font-size: 11px; }
		</style>
	</head>
	<body>
		<h1>Kittens and the TV Shows They Love</h1>
		<div class="kitten">
			<h2>Fluffy</h2>
			<div><img src="http://placekitten.com/120/120"></div>
			<ul class="tvshows">
				<li>
					<a href="http://www.imdb.com/title/tt0106145/">Deep Space Nine</a>
				</li>
				<li>
					<a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a>
				</li>
			</ul>
			Last check-up: <span class="lastcheckup">2014-01-17</span>
		</div>
		<div class="kitten">
			<h2>Monsieur Whiskeurs</h2>
			<div><img src="http://placekitten.com/110/110"></div>
			<ul class="tvshows">
				<li>
					<a href="http://www.imdb.com/title/tt0106179/">The X-Files</a>
				</li>
				<li>
					<a href="http://www.imdb.com/title/tt0098800/">Fresh Prince</a>
				</li>
			</ul>
			Last check-up: <span class="lastcheckup">2013-11-02</span>
		</div

This is an HTML document. HTML stands for HyperText Markup Language. It is the standard language for writing web page documents. The most important thing you need to know about HTML is that the language uses HTML "tags" to represent different elements, such as a main header `<h1>`. 

| HTML Tag                | Explanation                              |
|--------------------|-------------------------------------------|
| <\!DOCTYPE>        | Defines document type                 |
| <html\>             | Defines HTML document                  |
| <head\>             | Main information about document    |
| <title\>            | Title for document          |
| <body\>             | Document body               |
| <h1\> to <h6\>       |  Headings                    |
| <p\>                | Paragraph                       |
| <br\>               | Line break               |
| <\!\-\-comment here-\-> | Comment                         |
| <img\> | Image                         |
| <a\> | Hyperlink                       |
| <ul\> | Unordered list                     |
| <ol\> | Ordered list                     |
| <li\> | List item                     |
| <style\> | Style information for a document                    |
| <div\> | Section in a document                   |
| <span\> | Section in a document                   |

HTML tags often, but not always, require a "closing" tag. For example, the main header "Kittens and the TV Shows They Love" will be surrounded by `<h1>` (opening tag) and `</h1>` (closing tag) on either side: `<h1>Kittens and the TV Shows They Love</h1>`

### HTML Attributes, Classes, and IDs

HTML elements sometimes come with even more information inside a tag. This will often be a keyword (like `class` or `id`) followed by an equals sign `=` and a further descriptor such as `<div class="kitten">`

We need to know about tags as well as attributes, classes, and IDs because this is how we're going to extract specific HTML data with BeautifulSoup.

Here again is source code for kittens.html:

	<!doctype html>
	<html>
	  <head>
	    <title>Kittens!</title>
	  </head>
	  <body>
	    <h1>Kittens and the TV Shows They Love</h1>
	    <div class="kitten">
	      <h2>Fluffy</h2>
	      <div><img src="http://placekitten.com/100/100"></div>
	      <ul class="tvshows">
	        <li><a href="http://www.imdb.com/title/tt0106145/">Deep Space Nine</a></li>
	        <li><a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a></li>
	      </ul>
	      Last check-up: <span class="lastcheckup">2014-01-17</span>
	    </div>
	    <div class="kitten">
	      <h2>Monsieur Whiskeurs</h2>
	      <div><img src="http://placekitten.com/150/100"></div>
	      <ul class="tvshows">
	        <li><a href="http://www.imdb.com/title/tt0106179/">The X-Files</a></li>
	        <li><a href="http://www.imdb.com/title/tt0098800/">Fresh Prince</a></li>
	      </ul>
	      Last check-up: <span class="lastcheckup">2013-11-02</span>
	    </div>
	  </body>
	</html>

This is pretty well organized HTML, but if you don't know how to read HTML, it will still look like a big jumble. Here's how I would characterize the structure of this HTML, reading in my own idea of what the meaning of the elements are.

* We have two "kittens," both of which are contained in `<div>` tags with class `kitten`.
* Each "kitten" `<div>` has an `<h2>` tag with that kitten's name.
* There's an image for each kitten, specified with an `<img>` tag.
* Each kitten has a list (a `<ul>` with class `tvshows`) of television shows, contained within `<li>` tags.
* Those list items themselves have links (`<a>` tags) with an `href` attribute that contains a link to an IMDB entry for that show.

**SOME HTML QUESTIONS FOR YOU:**
* What's the parent tag of `<a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a>`? 

* Both `<div class="kitten">` tags share a parent tag---what is it? What attributes are present on both `<img>` tags?

## BeautifulSoup

### Scraping kittens with Beautiful Soup

We've examined `kittens.html` a bit now. What we'd like to do is write some code that is going to extract information from the HTML, like "what is the last checkup date for each of these kittens?" or "what are Monsieur Whiskeur's favorite TV shows?" To do so, we need to *parse* the HTML, and create a representation of it in our program that we can manipulate with Python.

As mentioned earlier, HTML is hard to parse by hand. (Don't even try it. In particular, [don't parse HTML with regular expressions](http://stackoverflow.com/a/1732454).)

Let's scrape text from html!

In [17]:
from bs4 import BeautifulSoup

To make a BeautifulSoup document, we call `BeautifulSoup()` with two parameters: the `html_string` from our HTTP request and [the kind of parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use) that we want to use, which will always be `"html.parser"` for our purposes.

In [18]:
response = requests.get("http://static.decontextualize.com/kittens.html")
html_string = response.text

document = BeautifulSoup(html_string, "html.parser")

In [19]:
document

<!DOCTYPE doctype html>

<html>
<head>
<title>Kittens!</title>
<style type="text/css">
			span.lastcheckup { font-family: "Courier", fixed; font-size: 11px; }
		</style>
</head>
<body>
<h1>Kittens and the TV Shows They Love</h1>
<div class="kitten">
<h2>Fluffy</h2>
<div><img src="http://placekitten.com/120/120"/></div>
<ul class="tvshows">
<li>
<a href="http://www.imdb.com/title/tt0106145/">Deep Space Nine</a>
</li>
<li>
<a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a>
</li>
</ul>
			Last check-up: <span class="lastcheckup">2014-01-17</span>
</div>
<div class="kitten">
<h2>Monsieur Whiskeurs</h2>
<div><img src="http://placekitten.com/110/110"/></div>
<ul class="tvshows">
<li>
<a href="http://www.imdb.com/title/tt0106179/">The X-Files</a>
</li>
<li>
<a href="http://www.imdb.com/title/tt0098800/">Fresh Prince</a>
</li>
</ul>
			Last check-up: <span class="lastcheckup">2013-11-02</span>
</div>
</body>
</html>

### `.find()` HTML Elements

We can use the `.find()` method to find and extract certain elements, such as a main header.

In [20]:
document.find("h1")

<h1>Kittens and the TV Shows They Love</h1>

If we want only the text contained between those tags, we can use `.text` to extract just the text.

In [21]:
document.find("h1").text

'Kittens and the TV Shows They Love'

In [22]:
type(document.find("h1").text)

str

Find the HTML element that contains an image. Hint: the HTML image tag is "img"

In [23]:
document.find("img")

<img src="http://placekitten.com/120/120"/>

**Note:** You might have noticed that there is more than one `<img>` tag in `kittens.html`! If more than one tag matches the name you pass to `.find()`, it returns only the *first* matching tag. (A better name for `.find()` might be `find_first`.)

### `.find_all()` HTML Elements

You can also extract multiple HTML elements at a time with `.find_all()`

In [24]:
document.find_all("img")

[<img src="http://placekitten.com/120/120"/>,
 <img src="http://placekitten.com/110/110"/>]

In [25]:
document.find_all("div", attrs={"class": "kitten"})

[<div class="kitten">
 <h2>Fluffy</h2>
 <div><img src="http://placekitten.com/120/120"/></div>
 <ul class="tvshows">
 <li>
 <a href="http://www.imdb.com/title/tt0106145/">Deep Space Nine</a>
 </li>
 <li>
 <a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a>
 </li>
 </ul>
 			Last check-up: <span class="lastcheckup">2014-01-17</span>
 </div>, <div class="kitten">
 <h2>Monsieur Whiskeurs</h2>
 <div><img src="http://placekitten.com/110/110"/></div>
 <ul class="tvshows">
 <li>
 <a href="http://www.imdb.com/title/tt0106179/">The X-Files</a>
 </li>
 <li>
 <a href="http://www.imdb.com/title/tt0098800/">Fresh Prince</a>
 </li>
 </ul>
 			Last check-up: <span class="lastcheckup">2013-11-02</span>
 </div>]

In [26]:
document.find("h2").text

'Fluffy'

In [27]:
document.find_all("h2")

[<h2>Fluffy</h2>, <h2>Monsieur Whiskeurs</h2>]

Let's try to extact the text from all the header2 elements:

In [28]:
document.find_all("h2").text

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Uh oh. That didn't work! To extract text data from multiple HTML elements, we need a `for` loop and some list-building.

In [29]:
# Let's store all the headers in a list
all_h2_headers = document.find_all("h2")

In [30]:
all_h2_headers

[<h2>Fluffy</h2>, <h2>Monsieur Whiskeurs</h2>]

What if we just want to see the text of the headers?

To find out, we can use a loop.

We're going to use some 'for' loop syntax that's very common in Python:

In [31]:
for tag in all_h2_headers:
    print(tag.string) 

Fluffy
Monsieur Whiskeurs


For our purposes, we'll need to built a list with just the text from the headers. Here's how.

First we will make an empty list called `h2_headers`. 

Then `for` each `header` in `all_h2_headers`, we will grab the `.text`, put it into a variable called `header_contents`, then `.append()` it to our `h2_headers` list.

In [32]:
h2_headers = []
for header in all_h2_headers:
    header_contents = header.text
    h2_headers.append(header_contents)

Let's see what we've got

In [33]:
h2_headers

['Fluffy', 'Monsieur Whiskeurs']

How might we transform this exact same for loop into a one line? 

In [34]:
h2_headers = [header.text for header in all_h2_headers]

In [35]:
h2_headers

['Fluffy', 'Monsieur Whiskeurs']

**Important note about list comprehension in Python**

The one line version is a helpful shorthand: it creates a list (same as R lists) with each of the `header.text`s in `all_h2_headers`.

In more official terms, it's called a *list comprehension*, and it helps with a very common task in both data analysis and computer programming: when you want to apply an operation to every item in a list (e.g., scaling the numbers in a list by a fixed factor), or create a copy of a list with only those items that match a particular criterion (e.g., eliminating values that fall below a certain threshold). 

A list comprehension has a few parts:

* a source list, or the list whose values will be transformed or filtered;
* a predicate expression, to be evaluated for every item in the list;
* (optionally) a membership expression that determines whether or not an item in the source list will be included in the result of evaluating the list comprehension, based on whether the expression evaluates to True or False; and
* a temporary variable name by which each value from the source list will be known in the predicate expression and membership expression.
These parts are arranged like so:

> `[` *predicate expression* `for` *temporary variable name* `in` *source list* `if` *membership expression* `]`

The words for, in, and if are a part of the syntax of the expression. They don't mean anything in particular (and in fact, they do completely different things in other parts of the Python language). You just have to spell them right and put them in the right place in order for the list comprehension to work.

You can see more examples of this in action [here](lists.ipynb).

**An unrelated note, but before we move on:**

Beautiful Soup's `.find()` and `.find_all()` methods are actually more powerful than we're letting on here. [Check out the details in the official Beautiful Soup documentation.](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)

Here's a slightly more complicated version of a list comprehension.

Both the `.find()` and `.find_all()` methods can search not just for tags with particular names, but also for tags that have particular attributes. For that, we use the `attrs` keyword argument, giving it a dictionary that associates attribute names as keys and the desired attribute value as values. For example, to find all `span` tags with a `class` attribute of `lastcheckup`:

In [36]:
checkup_tags = document.find_all('span', attrs={'class': 'lastcheckup'})
[tag.string for tag in checkup_tags]

['2014-01-17', '2013-11-02']

### Inspecting HTML 🧐

It is often helpful to inspect the HTML from the rendered site to get a sense of how to scrape what you want to scrape.

Click on the link: http://static.decontextualize.com/kittens.html. In most browsers, you can ctrl-click (or right click) anywhere on the page and select "Inspect Element." Go ahead. Your screen should look (something) like this:

<a href="http://static.decontextualize.com/snaps/kittens-dev-tools.png"><img src="http://static.decontextualize.com/snaps/kittens-dev-tools.png" alt="kittens-dev-tools"/></a>

In the upper panel, you see the web page you're inspecting. In the lower panel, you see a version of the HTML source code, with little arrows next to some of the lines. (The little arrows allow you to collapse parts of the HTML source that are hierarchically related.) As you move your mouse over the elements in the top panel, different parts of the source code will be highlighted. Your browswer is showing you which parts of the source code are causing which parts of the page to show up. Pretty spiffy!

This relationship also works in reverse: you can move your mouse over some part of the source code in the lower panel, which will highlight in the top panel what that source code corresponds to on the page. We can use this to visually identify the parts of the page that are interesting to us, so we can write code that extracts the contents of those parts automatically.

## Back to Missy Elliott — Your Turn!

Ok so now we've learned a little bit about how to use BeautifulSoup to parse HTML documents. So how would we apply what we've learned to extract Missy Elliott lyrics?

In [37]:
response = requests.get("https://genius.com/Missy-elliott-the-rain-supa-dupa-fly-lyrics")
html_str = response.text

document = BeautifulSoup(html_str, "html.parser")

In [38]:
document


<!DOCTYPE html>

<html class="snarly apple_music_player--enabled bagon_song_page--enabled song_stories_public_launch--enabled react_forums--disabled" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<base href="//genius.com/" target="_top"/>
<script type="text/javascript">
//<![CDATA[

  var _sf_startpt=(new Date()).getTime();
  if (window.performance && performance.mark) {
    window.performance.mark('parse_start');
  }

//]]>
</script>
<title>Missy Elliott – The Rain (Supa Dupa Fly) Lyrics | Genius Lyrics</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="app-id=709482991" name="apple-itunes-app"/>
<link href="https://assets.genius.com/images/apple-touch-icon.png?1600204144" rel="apple-touch-icon"/>
<link href="https://assets.genius.com/images/apple-touch-icon.png?1600204144" rel="apple-touch-icon"/>
<!-- Mobil

What HTML element do we need to "find" to extract the song lyrics?

In [39]:
missy_lyrics = document.find("p").text

**Check answer below**

In [40]:
missy_lyrics = document.find("p").text

In [41]:
print(missy_lyrics)

[Intro: Missy Elliott & Timbaland]
[*Yawning*]
Run the track

[Chorus: Missy Elliott & Ann Peebles]
Me I'm super fly, super dupa fly
(I can't stand the rain) Supa dupa fly
Me I'm super fly, (against my window) super dupa fly
(I can't stand the rain) supa dupa fly
Me I'm super fly, (against my window) super dupa fly
(I can't stand the rain) supa dupa fly
Me I'm super fly, (against my window)

[Verse 1: Missy Elliott]
When the rain hits my window
I take and *cough* me some indo
Me and Timbaland, ooh, we sang a jangle
We so tight that you get our styles tangled
Sway on dosie-do like you loco
Can we get kinky tonight?
Like Coko, so-so
You don't wanna play with my Yo-Yo
I smoke my hydro on the D-low (D-D-D-D-D-low)

[Hook: Ann Peebles]
I can't stand the rain against my window
I can't stand the rain against my window
I can't stand the rain against my window
I can't stand the rain against my window
I can't stand the rain

[Verse 2: Missy Elliott]
Beep, beep, who got the keys to the Jeep, vroo

What HTML element do we need to "find" to extract the title?

In [42]:
song_title = document.find("h1").text

In [43]:
print(song_title)

The Rain (Supa Dupa Fly)


**CONGRATULATIONS!!** 

You now know how to scrape the web! Endless text is now available for your data science needs!