# Web Scraping on the Actual Web

The [class 5 notebook](class5-web-scraping-preview.ipynb) explained all of the steps involved in web scraping. But what about when it comes time to try it on the actual web? 

We're going to use this [*The New York Times* article](https://www.nytimes.com/interactive/2019/08/19/us/politics/presidential-campaign-songs-playlists.html) about 2020 presidental candidate playlists as our real-world example. Our task for today's class is to scrape it for a list of all the songs that the candidates have played.

**Does anyone rememeber the first step?**

Hint: It involves the Requests library

In [None]:
# import the requests library and request the contents of the page

Once we've gotten the contents of the page, it's a good idea to take a look. 

**How can you print the response from the server as text?** 

In [None]:
# now, see what it looks like
html_str = # your code here

Whoa! That's a lot more complicated than kittens! Let's go back to Chrome and take a look using View Source.

To get a lay of the (HTML) land, try doing a simple command-f for "Aretha" (as in Franklin; the first artist mentioned at the top of the site). 

**First round of questions for the class**:
* **What is the tag enclosing the first mention of Aretha Franklin?** 
* **Does this tag have an attribute, and if so, what it is?**

Let's do a little review and see if we can remember how to find all tags with specific attribute.

* **What is the BeautifulSoup function we should use?**

Now let's review the syntax of find_all when you want to include an attribute as a parameter. It looks like this:

In [None]:
# need to import BeautifulSoup since we haven't yet used it in this notebook
from bs4 import BeautifulSoup

# now let's use BeautifulSoup to parse the html document that we got from the web just a few minutes ago
document = BeautifulSoup(html_str, "html.parser")

# and now let's do our find_all query for <span> tags with the value "class" and attribute "song-artist"
document.find_all("span", attrs={"class": "song-artist"})

Yes! We're getting somewhere! But first, one last thing to learn about BeautifulSoup that we didn't have time to cover yesterday: traversing the HTML document tree.

### A final helpful method for finding sibling tags

Sometimes, the tags we're looking for don't have a distinguishing characteristic, like a `class` attribute with a specific value that allows us to find them using `.find()` and `.find_all()`. Other times (or sometimes in addition), and the tags also aren't in a nice neat parent-child structure that we can traverse easily. This can be tricky! Take the following HTML snippet, for example:    

```
<h2>Camembert</h2>
<p>A soft cheese made in the Camembert region of France.</p>

<h2>Cheddar</h2>
<p>A yellow cheese made in the Cheddar region of... France, probably, idk whatevs.</p>
```

If our task was to print the name of the cheese followed by the description that follows in the `<p>` tag directly afterward, we'd be out of luck. 

Fortunately, Beautiful Soup has a `.find_next_sibling()` method, which allows us to search for the next tag that is the *next sibling* of the tag you're calling it on (i.e., the two tags share a parent), that also matches particular criteria. 

So, for example, to accomplish the task outlined above, we'd do the following:

In [None]:
# convert that html up above into a string that we can use
# and remember, from the homework, the triple-quote synatax for feeding in more than a single line of text

cheese_html = """ 
<h2>Camembert</h2>
<p>A soft cheese made in the Camembert region of France.</p>

<h2>Cheddar</h2>
<p>A yellow cheese made in the Cheddar region of... France, probably, idk whatevs.</p>
"""

# now parse our cheese_html web doc using BeautifulSoup 
cheese_doc = BeautifulSoup(cheese_html, "html.parser")

# define a dictionary for future use 
# and remember, a dictionary is a data structure like a list but key/value pairs
cheese_dict = {}

# first find all the h2 tags
for h2_tag in cheese_doc.find_all('h2'):
    # the cheese name is easy: that's the text in the h2 tag, so just use the .string attribute
    cheese_name = h2_tag.string
    
    # but then we need to find the description
    # here's where we use "find_next_sibling":
    cheese_desc_tag = h2_tag.find_next_sibling('p')
    
    # now, we could just print the string version of cheese_desc_tag and be done
    # but we're practicing dictionaries right now too, so we'll associate the key with the value like this:
    cheese_dict[cheese_name] = cheese_desc_tag.string

# when we're done, dump out our dictionary
cheese_dict

So, armed with our three BeautifulSoup methods, `find()`, `find_all()`, and `find_next_sibling()`, we've got everything we need to move on with our NYT web-scraping project. 

Let's go back to the HTML of our *New York Times* feature that we're scraping. 

Remember--in Chrome, you can use either `View->Developer->View source` if it doesn't seem that complicated--or, like this particular page, if it's well-formatted. But you can also use `View->Developer->Developer Tools` if you're not entirely sure what you're looking for, or if you like that nesting feature.

In [None]:
# go look at it in view source...

There are always multiple ways to scrape text from a web page, but since we've just learned the `find_next_sibling` method, let's try to incorporate that. 

In [None]:
# remember, we already have our "document" object from up above, which is the whole doc parsed by Beautiful Soup

# so we can jump into...
title_tags = document.find_all('song-title')

title_tags


**But this does not work... why?**

Let's try again with the *attribute* value as "song-title" rather than the tag itself:

In [None]:
title_tags = document.find_all("span", attrs={"class": "song-title"})

title_tags

Let's format the song titles nicely into a list. 

NB: I'm doing this both so you remember the `.string` attribute and also to make sure I tell you about the `append()` method for adding values to lists, which I forgot to tell you about last class.

In [None]:
song_titles = [] # define a list for future use

for title in title_tags:
    song_titles.append(title.string) # the append method is how you add elements to the end of the list
                                     # and remember the convenient .string attribute!  
    
song_titles


But we've got more work to do:

Since the ultimate goal is to have the song title and artist name associated with each other, we're going to have to put a few things that we've recently learned all together.  

To begin, let's take another look at this HTML:

```
<div class="song order-1">
    <span class="song-title">Respect</span>`
    <span class="song-artist">Aretha Franklin</span>`
</div>```

**What is the relationship between the tag with the song title and the tag with the song artist?**

**And what Beautiful Soup method lets us find siblings?**

So now we can get started. 

Since we're looking to associate each song's title with the song's artist, this is a perfect use case for a dictionary. 

But note, since some artists have multiple songs, and dictionaries can't have multiple keys that are the same, we're going to make our dictionary in the format title/artist rather than the other way around. 

In any case, let's start with that:

In [None]:
# make a dict
song_dict = {}  # goal is format {song title, artist}

# then what? 

# 1) find all title tags
## your code here


# 2) then find the sibling of each of the title tags


    # and add it to the dict (and be sure to add only the string)
    
    
song_dict


**Amazing! We did it!**

And one final note:

### When things go wrong with Beautiful Soup

A number of things might go wrong with Beautiful Soup. You might, for example, search for a tag that doesn't exist in the document:

In [None]:
footer_tag = cheese_doc.find("footer")

footer_tag

Beautiful Soup doesn't return an error if it can't find the tag you want. Instead, it returns a `NoneType`:

In [None]:
type(footer_tag)

If you try to call a method on the object that Beautiful Soup returned anyway, you might end up with an error like this:

In [None]:
footer_tag.find("p")

You might also inadvertently try to get an attribute of a tag that wasn't actually found. You'll get a similar error in that case:

In [None]:
footer_tag['title']

Whenever you see something like `TypeError: 'NoneType' object is not subscriptable`, it's a good idea to check to see whether your method calls are indeed finding the thing you were looking for.

Making things slightly more complicated, the `.find_all()` method will return an empty list if it doesn't find any of the tags you wanted, which can also make you unaware that you didn't find what you want:

In [None]:
footer_tags = cheese_doc.find_all("footer")
print(footer_tags)

If you attempt to access one of the elements of this regardless, you get... an `IndexError`!

In [None]:
print(footer_tags[0].string)

As I've mentioned before, all of these errors are helpful pointers to where you might have gone wrong. In general, try to learn about the various types of errors you might encounter since they'll help you debug your code faster.

## And some further reading

* [Chapter 11](https://automatetheboringstuff.com/chapter11/) from Al Sweigart's [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/) is another good take on this material (and discusses a wider range of techniques).
* [The official Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) provides a systematic walkthrough of the library's functionality. If you find yourself thinking, "it really should be easy to do the thing that I want to do, why isn't it easier?" then check the documentation! Leonard's probably already thought of a way to make it easier and implemented a feature in the code to help you out.
* Beautiful Soup is the best scraping library out there for quick jobs, but if you have a larger site that you need to scrape, you might look into [Scrapy](http://scrapy.org/), which bundles a good parser with a framework for writing web "spiders" (i.e., programs that parse web pages and follow the links found there, in order to make a catalog of an entire web site, not just a single web page).