# Web Scraping with Beautiful Soup (contined)

Based on lessons by [Alison Parrish](http://www.decontextualize.com/) and [Jinho Choi](https://github.com/emory-courses/data-science/blob/master/course/data_aggregation/data_aggregation.ipynb)

*Quick discussion of plaforms (again) and file formats. We'll leave character encoding to next class.*

### Inspecting HTML's anatomy with Developer Tools

Thanks to Alison Parrish, we have a very simple example of HTML to begin with. It is about kittens. [Here's the rendered version](http://static.decontextualize.com/kittens.html), and [here's the HTML source code](https://raw.githubusercontent.com/ledeprogram/courses/master/databases/data/kittens.html).

First we're going to use Developer Tools in Chrome to take a look at how `kittens.html` is organized. Click on the "rendered version" link above. In Chrome, ctrl-click (or right click) anywhere on the page and select "Inspect Element." This will open Chrome's Developer Tools. Your screen should look (something) like this:

<a href="http://static.decontextualize.com/snaps/kittens-dev-tools.png"><img src="http://static.decontextualize.com/snaps/kittens-dev-tools.png" alt="kittens-dev-tools"/></a>

In the upper panel, you see the web page you're inspecting. In the lower panel, you see a version of the HTML source code, with little arrows next to some of the lines. (The little arrows allow you to collapse parts of the HTML source that are hierarchically related.) As you move your mouse over the elements in the top panel, different parts of the source code will be highlighted. Chrome is showing you which parts of the source code are causing which parts of the page to show up. Pretty spiffy!

This relationship also works in reverse: you can move your mouse over some part of the source code in the lower panel, which will highlight in the top panel what that source code corresponds to on the page. We'll be using this later to visually identify the parts of the page that are interesting to us, so we can write code that extracts the contents of those parts automatically.

### Characterizing the structure of kittens

Here's what the source code of kittens.html looks like:

	<!doctype html>
	<html>
	  <head>
	    <title>Kittens!</title>
	  </head>
	  <body>
	    <h1>Kittens and the TV Shows They Love</h1>
	    <div class="kitten">
	      <h2>Fluffy</h2>
	      <div><img src="http://placekitten.com/100/100"></div>
	      <ul class="tvshows">
	        <li><a href="http://www.imdb.com/title/tt0106145/">Deep Space Nine</a></li>
	        <li><a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a></li>
	      </ul>
	      Last check-up: <span class="lastcheckup">2014-01-17</span>
	    </div>
	    <div class="kitten">
	      <h2>Monsieur Whiskeurs</h2>
	      <div><img src="http://placekitten.com/150/100"></div>
	      <ul class="tvshows">
	        <li><a href="http://www.imdb.com/title/tt0106179/">The X-Files</a></li>
	        <li><a href="http://www.imdb.com/title/tt0098800/">Fresh Prince</a></li>
	      </ul>
	      Last check-up: <span class="lastcheckup">2013-11-02</span>
	    </div>
	  </body>
	</html>

This is pretty well organized HTML, but if you don't know how to read HTML, it will still look like a big jumble. Here's how I would characterize the structure of this HTML, reading in my own idea of what the meaning of the elements are.

* We have two "kittens," both of which are contained in `<div>` tags with class `kitten`.
* Each "kitten" `<div>` has an `<h2>` tag with that kitten's name.
* There's an image for each kitten, specified with an `<img>` tag.
* Each kitten has a list (a `<ul>` with class `tvshows`) of television shows, contained within `<li>` tags.
* Those list items themselves have links (`<a>` tags) with an `href` attribute that contains a link to an IMDB entry for that show.

**SOME HTML QUESTIONS FOR YOU:**
* What's the parent tag of `<a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a>`? 

* Both `<div class="kitten">` tags share a parent tag---what is it? What attributes are present on both `<img>` tags?

### Scraping kittens with Beautiful Soup

We've examined `kittens.html` a bit now. What we'd like to do is write some code that is going to extract information from the HTML, like "what is the last checkup date for each of these kittens?" or "what are Monsieur Whiskeur's favorite TV shows?" To do so, we need to *parse* the HTML, and create a representation of it in our program that we can manipulate with Python.

As mentioned earlier, HTML is hard to parse by hand. (Don't even try it. In particular, [don't parse HTML with regular expressions](http://stackoverflow.com/a/1732454).)

[Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) is a Python library that parses HTML (even if it's poorly formatted) and allows us to extract and manipulate its contents. More specifically, it gives us some Python objects that we can call methods on to poke at the data contained therein. So instead of working with strings and bytes, we can work with Python objects, methods and data structures.

If you're using Anaconda, Beautiful Soup comes pre-installed, so there's no extra installation step! (We will learn how to install new libraries in future classes). 

Note that Beautiful Soup only *parses* HTML. It's left to us to actually *get* the HTML from somewhere. In most cases, we'll want to download the HTML directly from the actual web. For that, we'll use the `get` method from the Python library `requests` ([link](https://2.python-requests.org/en/master/)):

*Quick discussion of functions vs. methods; differences are masked in R but not in Python*

In [None]:
# first let's import the requests library
import requests 

If it worked, you won't get an error message. You will see that the execution counter has been incremented by 1.

Now let's use the "get" method to make an http request (the eponymous "requests") to get the contents of kittens.html.

In [None]:
resp = requests.get("http://static.decontextualize.com/kittens.html") 

Note that "resp" is a Python object, and not plain text.

Also note that the "get" method makes things easy by guessing at the document's character encoding. 

We're not really going to talk about character encoding until next class, but since we can, let's check this page's encoding real quick.

In [None]:
resp.encoding

Now let's create a string with the contents of the web page in text format we use the "text" method for this.

In [None]:
html_str = resp.text

Let's see what it looks like:

In [None]:
html_str

That looks like a mess but it's apparent that we've obtained the data as desired.

Now we need to create a Beautiful Soup object from that data. Here's how to go about doing that:

In [None]:
from bs4 import BeautifulSoup

document = BeautifulSoup(html_str, "html.parser")
type(document)

The `BeautifulSoup` function creates a new Beautiful Soup object. It takes two parameters: the string containing the HTML data, and a string that designates which underlying parser to use to build the parsed version of the document. (Leave this as `"html.parser"`.) I've assigned this object to the variable `document`. This object supports a number of interesting methods that allow us to dig into the contents of the HTML. Primarily what we'll be working with are `Tag` objects and `ResultSet` objects, which are essentially just lists of `Tag` objects.

### Finding a tag

As we've previously discussed, HTML documents are composed of tags. To represent this, Beautiful Soup has a type of value that represents tags. We can use the `.find()` method of the `BeautifulSoup` object to find a tag that matches a particular tag name. For example:

In [None]:
h1_tag = document.find('h1')
type(h1_tag)

A `Tag` object has several interesting attributes and methods. The `string` attribute of a `Tag` object, for example, returns a string representing that tag's contents:

In [None]:
h1_tag.string

You can access the attributes of a tag by treating the tag object as though it were a dictionary, using the square-bracket index syntax, with the name of the attribute whose value you want as a string inside the brackets. 

**NOTE FOR FOLKS USING PYTHON FOR THE FIRST TIME:**

Unlike R, which (I think) just has lists, Python has two data types that store values: lists which are ordered, and store only one value, and dictionaries-- what I'm talking
about here-- that store UNORDERED sets of values in key:value pairs. 

For more on dictionaries, see [this notebook](dictionaries-sets-tuples.ipynb).

For now, though, just see if you can follow this example:

To print out the `src` attribute of the first `<img>` tag in the document, you would write:

In [None]:
img_tag = document.find('img')
img_tag['src']

> Note: You might have noticed that there is more than one `<img>` tag in `kittens.html`! If more than one tag matches the name you pass to `.find()`, it returns only the *first* matching tag. (A better name for `.find()` might be `find_first`.)

### Finding multiple tags

It's very often the case that we want to find not just one tag that matches particular criteria, but ALL tags matching those criteria. For that, we use the `.find_all()` method of the `BeautifulSoup` object. For example, to find all `h2` tags in the document:

In [None]:
h2_tags = document.find_all('h2')
type(h2_tags)

But what's in the Result Set?

In order to find out, we're going to need to use a loop.

We're going to use some 'for' loop syntax that's very common in Python:

In [None]:
for tag in h2_tags:
    print(tag.string) 

This conventiently gives you a variable, `tag`, that updates with the appropriate value each time you iterate. 

Does R have this too? You tell me! 

(For more on loops and counting, see [this notebook](counting.ipynb)). 

In any case, both the `.find()` and `.find_all()` methods can search not just for tags with particular names, but also for tags that have particular attributes. For that, we use the `attrs` keyword argument, giving it a dictionary that associates attribute names as keys and the desired attribute value as values. For example, to find all `span` tags with a `class` attribute of `lastcheckup`:

In [None]:
checkup_tags = document.find_all('span', attrs={'class': 'lastcheckup'})
[tag.string for tag in checkup_tags]

**Important note about list comprehension in Python**

Line 2 above is a helpful shorthand: it creates a list (same as R lists) with each of the `tag.string`s in `checkup_tags`.

In more official terms, it's called a *list comprehension*, and it helps with a very common task in both data analysis and computer programming: when you want to apply an operation to every item in a list (e.g., scaling the numbers in a list by a fixed factor), or create a copy of a list with only those items that match a particular criterion (e.g., eliminating values that fall below a certain threshold). 

A list comprehension has a few parts:

* a source list, or the list whose values will be transformed or filtered;
* a predicate expression, to be evaluated for every item in the list;
* (optionally) a membership expression that determines whether or not an item in the source list will be included in the result of evaluating the list comprehension, based on whether the expression evaluates to True or False; and
* a temporary variable name by which each value from the source list will be known in the predicate expression and membership expression.
These parts are arranged like so:

> `[` *predicate expression* `for` *temporary variable name* `in` *source list* `if` *membership expression* `]`

The words for, in, and if are a part of the syntax of the expression. They don't mean anything in particular (and in fact, they do completely different things in other parts of the Python language). You just have to spell them right and put them in the right place in order for the list comprehension to work.

You can see more examples of this in action [here](lists.ipynb).

**An unrelated note, but before we move on:**

Beautiful Soup's `.find()` and `.find_all()` methods are actually more powerful than we're letting on here. [Check out the details in the official Beautiful Soup documentation.](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)

### Finding tags within tags

Let's say that we wanted to print out a list of the name of each kitten, along with a list of the names of that kitten's favorite TV shows. In other words, we want to print out something that looks like this:

    Fluffy: Deep Space Nine, Mr. Belvedere
    Monsieur Whiskeurs: The X-Files, Fresh Prince
    
In order to do this, we need to find *not just* tags with particular names, but tags with *particular hierarchical relationships* with other tags. I.e., we need to identify all of the kittens, and then find the shows that belong to that kitten. This kind of search is made easy by the fact that you can use `.find()` and `.find_all()` methods not just on the entire document, but on *individual tags*. When you use these methods on tags, they search for matching tags that are specifically *children of* the tag that you call them on.

In our kittens example, we can see that information about individual kittens is grouped together under `<div>` tags with a `class` attribute of `kitten`. So, to find a list of all `<div>` tags with `class` set to `kitten`, we might do this:

In [None]:
kitten_tags = document.find_all("div", attrs={"class": "kitten"})

Now, we'll loop over that list of tags and find, inside each of them, the `<h2>` tag that is its child:


In [None]:
for kitten_tag in kitten_tags:
    h2_tag = kitten_tag.find('h2')
    print(h2_tag.string)

Now, we'll go one extra step. Looping over all of the kitten tags, we'll find not just the `<h2>` tag with the kitten's name, but all `<a>` tags (which contain the names of the TV shows that we were looking for):

In [None]:
for kitten_tag in kitten_tags:
    h2_tag = kitten_tag.find('h2')
    a_tags = kitten_tag.find_all('a')
    a_tag_strings = [tag.string for tag in a_tags] # note list comprehension syntax 
    a_tag_strings_joined = ", ".join(a_tag_strings) # this is some string manipulation
                                                    # that I think is similar to R.
                                                    # It joins each item in the list with
                                                    # a ','
    print(h2_tag.string + ": " + a_tag_strings_joined) 
        # The '+' operator is different from R. It's a very simple way of concatinating
        # two strings, or two other data types that you want to treat like strings. 

**EXERCISE:** Paste the code above into the next cell, and modify it to print out a list of kitten names along with the last check-up date for that kitten.

HINT: Look up above for the code that deals with check-up dates. 

In [None]:
## LK DELETE THIS IN STUDENT NOTEBOOK

for kitten_tag in kitten_tags:
    h2_tag = kitten_tag.find('h2') # keep this to find the kittens
    checkup_tag = kitten_tag.find('span', attrs={'class': 'lastcheckup'})

    print(h2_tag.string + ", " + checkup_tag.string)

**ANOTHER EXERCISE**: Rewrite the code above to print a list of kitten names with links to that kitten's favorite shows. I.e., you should end up with something of the format:

`Name: [kitten name]
 URLS: www.asdasdsa.com, www.sdkalskdjsa.com`
 
Hint: Look up above for how you access the attribute of a tag. 

In [None]:
## LK DELETE THIS IN STUDENT NOTEBOOK

for kitten_tag in kitten_tags:
    h2_tag = kitten_tag.find('h2') # keep this to find the kittens
    url_tags = kitten_tag.find_all('a')
    url_strings = [url['href'] for url in url_tags]
    urls_joined = ", ".join(url_strings)
    print("Name: " + h2_tag.string)
    print("URLs: " + urls_joined)

 

### Finding sibling tags

Often, the tags we're looking for don't have a distinguishing characteristic, like a `class` attribute, that allows us to find them using `.find()` and `.find_all()`, and the tags also aren't in a parent-child relationship. This can be tricky! Take the following HTML snippet, for example:    

In [None]:
cheese_html = """
<h2>Camembert</h2>
<p>A soft cheese made in the Camembert region of France.</p>

<h2>Cheddar</h2>
<p>A yellow cheese made in the Cheddar region of... France, probably, idk whatevs.</p>
"""

If our task was to create a list of the name of the cheese followed by the description that follows in the `<p>` tag directly afterward, we'd be out of luck. Fortunately, Beautiful Soup has a `.find_next_sibling()` method, which allows us to search for the next tag that is a *sibling* of the tag you're calling it on (i.e., the two tags share a parent), that also matches particular criteria. So, for example, to accomplish the task outlined above:

In [None]:
document = BeautifulSoup(cheese_html, "html.parser")
cheese_dict = {}
for h2_tag in document.find_all('h2'):
    cheese_name = h2_tag.string
    cheese_desc_tag = h2_tag.find_next_sibling('p')
    cheese_dict[cheese_name] = cheese_desc_tag.string

cheese_dict

You now know most of what you need to know to scrape web pages effectively. Good job!

In a minute, we're going to attempt to level up, and scrape an actual web page on the internet. 

But before that, we should talk about what to do when things go wrong-- which they invariably will.

### When things go wrong with Beautiful Soup

A number of things might go wrong with Beautiful Soup. You might, for example, search for a tag that doesn't exist in the document:

In [None]:
footer_tag = document.find("footer")

Beautiful Soup doesn't return an error if it can't find the tag you want. Instead, it returns `None`:

In [None]:
type(footer_tag)

If you try to call a method on the object that Beautiful Soup returned anyway, you might end up with an error like this:

In [None]:
footer_tag.find("p")

You might also inadvertently try to get an attribute of a tag that wasn't actually found. You'll get a similar error in that case:

In [None]:
footer_tag['title']

Whenever you see something like `TypeError: 'NoneType' object is not subscriptable`, it's a good idea to check to see whether your method calls are indeed finding the thing you were looking for.

However, the `.find_all()` method will return an empty list if it doesn't find any of the tags you wanted:

In [None]:
footer_tags = document.find_all("footer")
print(footer_tags)

If you attempt to access one of the elements of this regardless...

In [None]:
print(footer_tags[0].string)

...you'll get an `IndexError`.

## Back to the task at hand: Scraping contents off the actual web

If you recall, I had you read this article from *The New York Times* for class today. Were you wondering why? Well, we're going to scrape it for a list of all the songs that the candidates have played.

Does anyone rememeber the first step?

In [None]:
# enter the first step here (hint: it uses the requests library)

resp = ... # complete

In [None]:
# now, see what it looks like

html_str = ... # complete

Whoa! That's a lot more complicated than kittens. Let's go back to Chrome and take a look using View Source.

To get a lay of the (HTML) land, try doing a simple command-f for "Aretha" (as in Franklin; the first artist mentioned at the top of the site). 

Questions:
* What is the tag enclosing the first mention of Aretha Franklin? 
* What is its previous sibling? 
* What is the parent to that tag?
* What is the parent to *that* tag?
* What is the parent to *that* tag?
* What is the parent to *that* tag?

Remember the answer to that last question, and let's go take a look using Developer Tools.

Now I'm going to show you a shortcut: Chrome's "Inspect Elements" feature. It's really helpful for locating specific elements in complicated HTML. 

Using "Inspect Elements," can we confirm our answers to the questions above? If so, we're ready to move on to Beautiful Soup. 

Does anyone remember (or want to scroll back up to the top) and tell me how you create a Beautiful Soup object from the html?

In [None]:
document = ... # complete 

Now that we've got the Beautiful Soup object, we need to find all of the song-artist tags and all of the soong-title tags. 

What BS method can we use to do this?

In [None]:
title_tags = ... # complete 

Let's see what it looks like:

In [None]:
print(title_tags)

It looks like we got it! Now, let's format this nicely into a list. 

In [None]:
# make a list
song_titles = []

for title in title_tags: 
    # complete this
    
song_titles

If time, can we use another method we learned above to create a dictionary of the format 
{song title, artist}?

In [None]:
# if time, can we use another method we learned above to create a dictionary of the format
# {song title, artist}?

# make a dict
song_dict = {}

for title in title_tags:
    # complete this
    
song_dict

Bonus question: Why didn't I ask you to make the dictionary in the format {artist: title}?