# Web Scraping on the Actual Web

The [class 4 notebook](class4-web-scraping.ipynb) explained all of the steps involved in web scraping. But what about when it comes time to try it on the actual web? 

We're going to use the [*The New York Times* article](https://www.nytimes.com/interactive/2019/08/19/us/politics/presidential-campaign-songs-playlists.html) about candidate playlists as our real-world example. Our task is to scrape it for a list of all the songs that the candidates have played.

**Does anyone rememeber the first step?**

Hint: It involves the Requests library

In [None]:
# request the contents of the page
import requests 
resp = requests.get("https://www.nytimes.com/interactive/2019/08/19/us/politics/presidential-campaign-songs-playlists.html") 

Once we've gotten the contents of the page, it's a good idea to take a look. 

**How can you print the response from the server as text?** 

In [None]:
# now, see what it looks like
html_str = resp.text

html_str

Whoa! That's a lot more complicated than kittens. Let's go back to Chrome and take a look using View Source.

To get a lay of the (HTML) land, try doing a simple command-f for "Aretha" (as in Franklin; the first artist mentioned at the top of the site). 

**Questions for the class**:
* What is the tag enclosing the first mention of Aretha Franklin? 
* What is its previous sibling? 
* What is the parent to that tag?
* What is the parent to *that* tag?
* What is the parent to *that* tag?
* What is the parent to *that* tag?

Now let's go take a look using Chrome's "Inspect Elements" feature. As I mentioned last class, it's really helpful for locating specific elements in complicated HTML. 

Using "Inspect Elements," can we confirm our answers to the questions above? If so, we're ready to move on to Beautiful Soup. 

Does anyone remember (or want to take a look at the Class 4 notebook) and tell me: 

**How do you create a Beautiful Soup object from the html?**

In [None]:
from bs4 import BeautifulSoup

document = BeautifulSoup(html_str, "html.parser")

Now that we've got the Beautiful Soup object, we need to find all of the `song-artist` tags and all of the `song-title` tags. 

Let's start with the `song-title` tags. 

This does not work:

In [None]:
title_tags = document.find_all('song-title')

title_tags

**Why does that not work?**

Remember that the two tags that have Aretha Franklin's song "Respect" look like this:

`<span class="song-title">Respect</span>
<span class="song-artist">Aretha Franklin</span>`

So instead:

* **What tag do we need to look for?**
* **What attribute(s) do we need to look for?**
* **What attribute *values* do we need to look for?**

**How might we modify this code (below) so that we find what we are looking for?**

`title_tags = document.find_all('song-title')`

In [None]:
title_tags = document.find_all("span", attrs={"class": "song-title"})

# let's see what it looks like 
title_tags


For ease of (future) reference, let's format the song titles nicely into a list. 

In [None]:
# make a list
song_titles = [] # this is how you create a new (empty) list

for title in title_tags:
    song_titles.append(title.string) # the append method is how you add elements to the end of the list
    
song_titles


Now, let's turn to the artist_name.

Since the ultimate goal is to have the song title and artist name associated with each other, we can build on the work we've already done.

Take another look at this HTML:

```
<div class="song order-1">
    <span class="song-title">Respect</span>`
    <span class="song-artist">Aretha Franklin</span>`
</div>```

**What is the relationship between the tag with the song title and the tag with the song artist?**

In [None]:
# sibling!!!



**How can we use our list of title_tags, which we've already created, to find each song's artist?**

Note that since we're looking to associate each song's title with the song's artist, this is a perfect use case for a dictionary. So let's start there:

In [None]:
# make a dict
song_dict = {}  # goal is format {song title, artist}

for title in title_tags:
    # same thing as above; first get the string assocaited w/ the tag
    song_title = title.string
    # while we're looking at that title, look for the next sibling
    artist_name = title.find_next_sibling('span')
    # add it to the dict (and be sure to add only the string)
    song_dict[song_title] = artist_name.string
    
song_dict

**Amazing! We did it!**

One bonus question: Why didn't I ask you to make the dictionary in the format {artist: title}?

Some artists have multiple songs, and dictionaries can't have multiple keys that are the same. 

**Now, APIs!**