# Getting Lyrics

<div class="alert alert-block alert-warning">
This notebook contains only 4 cells of code. You are of course free to delete all the documentation in between, but note the word I just used, <em>documentation</em>. Delete the current documentation, but be sure to create your own. 
</div>

Let's suppose for a moment you are, like Braxton, interested in building a corpus of lyrics. You need to ask a couple of questions upfront:

- What lyrics am I choosing and why?
- Can I access those lyrics in the time I have available?

While Braxton is interested in artists, you may be interested in lyrics that address certain topics. Given the way most lyrics sites are organized, artists is probably easier, but which artists and why? That is, at some point Braxton will probably need to articulate his choices. The conversation, either internal to him or between him and an undefined interlocutor (maybe me), might go something like this: 

**Q: "Why'd you choose these artists?"**  
A: "Because I like them."  
**Q: "Why do you like them?"**  
A: "I think their songs are cool."  
**Q: "What makes their lyrics cool?"**  
A: "I guess I need to figure this out."

The bad news is that *Yes, yes you do*, but the good news is that you can make that the point of your exploration, but it's very important that you have this conversation early on and that you speculate as much as you can in your research notebook. Your speculations might include: "I like the topics they sing about." Okay, what do you think those topics are? "I like the way they rhyme things." Okay, what do you think those rhymes are. 

In each and every one of those speculations, you have a basis for beginning your analysis: topic modeling or grabbing the words at the end of lines and compiling them as pairs. 

Okay, that's nice. But how do you get started? Well, I copied one of the artists Braxton named, **MF Doom**, and pasted "MF Doom lyrics" into a web search. (I think it was DuckDuckGo, but go with the search engine you like.) The first result was to a site called *[Genius](https://genius.com/artists/Mf-doom)*, but the UI was cluttered and my first impression was that there might be something better. The next result was for a site I recognize [AZlyrics](https://www.azlyrics.com/m/mfdoom.html), and when a list of albums and songs appeared on the first page with links to the songs, like ["Meddle with Metal"](https://www.azlyrics.com/lyrics/mfdoom/meddlewithmetal.html), I thought *this might work*.

So I have two URLs. The first gives me a list of songs:
```
https://www.azlyrics.com/lyrics/mfdoom/
```
And the second shows me what a song page looks like both in terms of the URL but also in terms of the HTML:
```
https://www.azlyrics.com/lyrics/mfdoom/meddlewithmetal.html
```

If this turns out to be relatively easy, I will need only `urllib` and `BeautifulSoup` to do this. 

Do I want anything else other than the song lyrics? Do I want to associate a song with a particular album? Do I want to include the album year? Given that I have multiple artists, do I want to associate an artist with each song? How am I going to create that metadata and how am I going to store it? 

So the first thing I want to do is load the libraries and see if I can establish a connection.

In [1]:
import urllib.request
from bs4 import BeautifulSoup

# My starting web page
myurl = "https://www.azlyrics.com/m/mfdoom.html"

# How urllib expects:
myconnection = urllib.request.urlopen(myurl)
myhtml = myconnection.read()

# Check to see that things work
print(myhtml[0:100])

b'<!DOCTYPE html>\r\n<html lang="en">\r\n<head>\r\n<meta charset="utf-8">\r\n<meta http-equiv="X-UA-Compatible'


With the connection established and HTML being returned. I need to find a way to get what I want from the page. I used my web browser to save the page's source and ran through the HTML until I got to the first song title:

```html
<!-- start of song list -->>
<div id="listAlbum"    >
<div id="45679" class="album">album: <b>"Operation: Doomsday"</b> (19
        99)<div><img src="/images/albums/456/d08d0c9c2e4eeb01a5166c594d2328fd.jp
                 g" class="album-image" alt="MF Doom - Operation: Doomsday album cover"
         /></d
    iv></div>
<div class="listalbum-it
    em"><a href="/lyrics/mfdoom/thetimewefaceddoomskit.html" target="_blank">The Time We Faced Doom (S
    kit)
</a></
```
To some degree, some of the work of creating metadata is done for me, the song list is broken up by album and it looks like the DIVs are hierarchical ... let's strip everything else out and just look at the DIVs:
```
<div id="listAlbum">
    <div id="45679" class="album"> ALBUM NAME (DATE)</div>
    <div class="listalbum-item"> SONG </div>
</div>
```
No, it's not hierarchical: the album names (and dates!) are simply at the top of the list of DIVs. *h*.iv>

`myhtml` is still active.

In [2]:
# Parse the divs with the song links
mysoup = BeautifulSoup(myhtml, "lxml")
mydivs = mysoup.find_all(class_="listalbum-item")

# Check work
print(type(mydivs))
for i in mydivs[0:2]:
    print(i)

<class 'bs4.element.ResultSet'>
<div class="listalbum-item"><a href="/lyrics/mfdoom/thetimewefaceddoomskit.html" target="_blank">The Time We Faced Doom (Skit)</a></div>
<div class="listalbum-item"><a href="/lyrics/mfdoom/doomsday.html" target="_blank">Doomsday</a></div>


Okay, I have something called an `element.ResultSet` which I can treat like a list an iterate over.

The contents of each item in the set is the entire div. I could possibly use some regex, but I would also like to see if there's a way to do this in BeautifulSoup. I think I'll try iterating over the div and get the links out:

In [3]:
# Empty list to hold the results
links = []

# Iterate over the list of DIVs
for div in mydivs[0:2]:
    the_link = div.find('a')['href']
    links.append(the_link)

# Check results
print(links)

['/lyrics/mfdoom/thetimewefaceddoomskit.html', '/lyrics/mfdoom/doomsday.html']


Okay, so now we have some piece of the URL we want to use in a list.

Let's take a look at the URL we are trying to replicate:
```
https://www.azlyrics.com/lyrics/mfdoom/meddlewithmetal.html
```

It looks like all we need to do is prepend `https://www.azlyrics.com` to build the full URL. (Note that we have a slash already, so we don't need the trailing slash in our base URL.)

### Using Our List of Links to Download Files

If you are comfortable downloading files in your current directory, then you can start doing that. If you want to see where you are, you can use one of the built-in "magic" commands, like `%pwd`, which should return something like:
```
'/Path/to/where/you/are'
```
You can then change to where you want to save the files, `%cd /Path/to/desired/directory`. 

In [4]:
# Our list of links all start at /lyrics
# So we need to prepend the site URL
baseURL = "https://www.azlyrics.com"

# When we go to save the files, 
# we want to remove this part of the link string:
myfilter = "/lyrics/mfdoom/"

for link in links:
    remotefile = urllib.request.urlopen(baseURL+link)
    filename = link.replace(myfilter, '')
    localfile = open(filename,'wb')
    localfile.write(remotefile.read())
    localfile.close()
    remotefile.close()