# Lil' Scrapy: Building a Rap Lyric Dataset

In [1]:
import re
import requests
from bs4 import BeautifulSoup

### Build List of Lyrics URLs

First let's collect a list of URLs to scrape. I've chosen [ohhla.com](http://www.ohhla.com) since the lyrics are all pretty centrally located, are stored in .txt format, and don't contain much overhead or javascript to work past. 

In [2]:
# all the lyrics are contained in 5 subsets of the main index page

root_urls = ["http://ohhla.com/all.html", "http://ohhla.com/all_two.html", "http://ohhla.com/all_three.html",
             "http://ohhla.com/all_four.html", "http://ohhla.com/all_five.html"]

Here we iterate through the 5 subsets, grab any URLs that don't have "anonymous" in the name (since these are more complicated to scrape and seem to be associated with lesser-known artists), and then append them to a URL master
list

In [3]:
urls = []

for root in root_urls:
    index_page = requests.get(root).content
    soup = BeautifulSoup(index_page, "lxml")
    filtered_soup = soup.find("pre")
    links = filtered_soup.find_all("a")
    for link in links:
        try:
            url = link['href']
            if "anonymous" not in url:
                urls.append(url)
        except:
            pass

In [4]:
urls = list(set(urls))

print("Total URLs found:", len(urls))
print("\nSamples:")
urls[:5]

Total URLs found: 141

Samples:


['YFA_snoopdogg_two.html',
 'YFA_BOB.html',
 'YFA_mop.html',
 'YFA_peterock.html',
 'YFA_nappy.html']

### Clean the Data

Before we move into actually parsing the lyrics and writing them to file, let's make sure we can do some basic cleaning:

In [5]:
# Sample lyrics

s = """
Artist: Da Brat
Album:  Funkafied
Song:   Da Shit Ya Can't Fuc Wit
Typed by: lyric724@aol.com

I promise the funk the whole funk and nothin' but the funk...Yoooooohoo

Chorus: I be Da B-R-A-T the new lady wit dat shit ya can't fuck wit (say 2x)

Verse 1:

Uh, fool sittin' all fat brat tat tat tat (bitch and its like that)
Well let me lift you to the sky
Just climb aboard the B-R-A-T ride
Those with no love I stay about like GOD
Quick to pull ya trigga nigga quick to pull ya card
And it don't stop and it don't quit
In ninety-fo I be the sho shot shit
And in years to come shit ain't gonna change
Sosodef, you know the name of the game
And those that say they don't, nigga bitch please
Cuz, we be known for makin' dem Geeees
Settin them swole, steady going gold
Whateva we release, whateva we unfold
So now you know in ninety-fo who's the shit
And who's got the shit dat you just can't fuck wit
"""

Feel free to add other regex/cleaning below. In this example I'm just removing the header and email addresses as a demonstration. I went back later and cleaned out a lot of the `[Chorus]`-type notation that's in the data. 

In [6]:
def clean(s):
    """Cleans string of some fluff"""
    email = re.findall("""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])""", s)
    header = re.findall('Artist:(?s)(.*)Typed by:', s)

    for h in header:
        s = s.replace("Artist:{}Typed by:".format(h), "")

    for e in email:
        s = s.replace(e, "")
        
    return s


clean(s)

"\n \n\nI promise the funk the whole funk and nothin' but the funk...Yoooooohoo\n\n I be Da B-R-A-T the new lady wit dat shit ya can't fuck wit (say )\n\n\n\nUh, fool sittin' all fat brat tat tat tat (bitch and its like that)\nWell let me lift you to the sky\nJust climb aboard the B-R-A-T ride\nThose with no love I stay about like GOD\nQuick to pull ya trigga nigga quick to pull ya card\nAnd it don't stop and it don't quit\nIn ninety-fo I be the sho shot shit\nAnd in years to come shit ain't gonna change\nSosodef, you know the name of the game\nAnd those that say they don't, nigga bitch please\nCuz, we be known for makin' dem Geeees\nSettin them swole, steady going gold\nWhateva we release, whateva we unfold\nSo now you know in ninety-fo who's the shit\nAnd who's got the shit dat you just can't fuck wit\n"

### Get Scrapin'

In [1]:
for url in urls:
    schema = "http://www.ohhla.com/"
    url = schema + url
    page = requests.get(url).content
    soup = BeautifulSoup(page, "lxml")
    table_links = soup.select("td > a")

    for link in table_links:
        
        try:
            link = link['href']

            if ".txt" in link:
                link = schema + link
                lyric_page = requests.get(link).content
                lyrics_soup = BeautifulSoup(lyric_page, "lxml")
                
                pre_tag = lyrics_soup.find("pre")
                
                if pre_tag:
                    text = pre_tag.text   
                else:
                    text = str(lyric_page, 'utf-8')
                    
                text = clean(text)
                with open("scraped_lyrics.txt", "a") as f:
                    f.write(text)
                    
        except Exception as e:
            print(link)
            print(e)