# INTRO
Ever since coming across [Matt Daniel's Rapper Vocabulary Chart](https://pudding.cool/projects/vocabulary/index.html), I've been interested in how one of my favorite rappers -- Buck 65 -- would place on there. To find that, I'll be getting as many lyrics as I can from LyricsGenius to get up to 35,000 lyrics in accordance with the original methodology:

```
35,000 words covers 3-5 studio albums and EPs. I included mixtapes if the artist was just short of the 35,000 words. Quite a few rappers don’t have enough official material to be included (e.g., Biggie, Kendrick Lamar). As a benchmark, I included data points for Shakespeare and Herman Melville, using the same approach (35,000 words across several plays for Shakespeare, first 35,000 of Moby Dick).

I used a research methodology called token analysis to determine each artist’s vocabulary. Each word is counted once, so pimps, pimp, pimping, and pimpin are four unique words. To avoid issues with apostrophes (e.g., pimpin’ vs. pimpin), they’re removed from the dataset. It still isn’t perfect. Hip hop is full of slang that is hard to transcribe (e.g., shorty vs. shawty), compound words (e.g., king shit), featured vocalists, and repetitive choruses.
```

With those lyrics, I'll be cleaning the data to remove apostrophes and (possibly) other special characters, and then using NLTK to break the lyrics into tokens and count the number of individual words.

**NB** The first three cells only need to be run once

In [7]:
import lyricsgenius
import json

secrets_file = open('secrets.json')
secrets = json.load(secrets_file)
secrets_file.close()

In [36]:
# Get lyrics: this step will take a while
genius = lyricsgenius.Genius(secrets['CLIENT_ACCESS_TOKEN'])
genius.remove_section_headers = True


artist = genius.search_artist("Buck 65")
artist.save_lyrics()

Searching for songs by Buck 65...

Song 1: "The Centaur"
Song 2: "Blood of a Young Wolf"
Song 3: "Wicked And Weird"
Song 4: "Super Pretty Naughty"
Song 5: "Pants on Fire"
Song 6: "Whispers Of The Waves"
Song 7: "Je T’aime Mon Amour"
Song 8: "Love Will Fuck You Up"
Song 9: "Paper Airplane"
Song 10: "Blood, Pt. 2"
Song 11: "Heart of Stone"
Song 12: "Roses In The Rain"
Song 13: "1957"
Song 14: "463"
Song 15: "Secret Splendour"
Song 16: "You Know The Science"
Song 17: "Craftsmanship"
Song 18: "Roses and Bluejays"
Song 19: "Untitled"
Song 20: "Phil"
Song 21: "Cries a Girl"
Song 22: "Square One"
Song 23: "15 Minutes To Live"
Song 24: "Devil’s Eyes"
Song 25: "The Floor"
Song 26: "Gee Whiz"
Song 27: "Indestructible Sam"
Song 28: "Cold Steel Drum"
Song 29: "Secret Splendor"
Song 30: "Sleep Apnea"
Song 31: "50 Gallon Drum"
Song 32: "Zombie Delight"
Song 33: "Danger And Play"
Song 34: "Bachelor of Science"
Song 35: "Out of Focus"
Song 36: "Rough House Blues"
Song 37: "A Case For Us"
Song 38: "Tot

In [8]:
# turn JSON into pandas dataframe
f = open("Lyrics_Buck65.json")
buck_json = json.load(f)
songs = pd.DataFrame(buck_json['songs'])

unneeded_cols = list(songs.columns.values)

# we only need these three values, so we drop the rest
unneeded_cols.remove('lyrics')
unneeded_cols.remove('title')
unneeded_cols.remove('release_date')
unneeded_cols.remove('album')

songs = songs.drop(unneeded_cols, axis=1)
songs.head()

Unnamed: 0,title,release_date,album,lyrics
0,The Centaur,,"{'api_path': '/albums/1210', 'cover_art_url': ...",Most people are curious\nSome wanna get dirt o...
1,Blood of a Young Wolf,,"{'api_path': '/albums/1203', 'cover_art_url': ...","Ten thousand horses, Sable island, endless sum..."
2,Wicked And Weird,,"{'api_path': '/albums/1207', 'cover_art_url': ...","Driving with a yellow dog, I-95\nHe's got a sm..."
3,Super Pretty Naughty,2014-09-02,"{'api_path': '/albums/417822', 'cover_art_url'...",Fancy time. Naked Saturday. Wild stylin’\nNow ...
4,Pants on Fire,,"{'api_path': '/albums/1201', 'cover_art_url': ...","Sky diver, your pants are on fire, and the res..."


In [19]:
# get the album name from the JSON 
def find_album(album):
    return album['name']
    
albums = songs['album']
songs['album'] = albums.map(find_album, na_action='ignore')

songs.head()

TypeError: string indices must be integers

In [21]:
songs.sort_values(by='album', inplace=True)
songs.head(20)

Unnamed: 0,title,release_date,album,lyrics
81,Who By Fire,,20 Odd Years,And who by fire?\nWho by water?\nWho in the su...
27,Cold Steel Drum,2011-01-01,20 Odd Years,"I lay down for you, in black and blue\nLife he..."
31,Zombie Delight,2011-02-07,20 Odd Years,Zombie Delight Zombie Delight\nZombies are com...
54,She Said Yes,2011-01-01,20 Odd Years,"She wrote back, too alone\nA single pair of sh..."
59,BCC,2011-01-01,20 Odd Years,BCC the ADD\nWe don't have much time you see\n...
62,Tears Of Your Heart,2011-02-07,20 Odd Years,"(French singing)\n\nAge is beauty, bewildered ..."
76,Final Approach,2011-01-01,20 Odd Years,"The sun is always shining bright, at thirty th..."
8,Paper Airplane,2011-01-01,20 Odd Years,Down by the lake you saw me\nAnd you knew I wa...
25,Gee Whiz,,20 Odd Years,"Tell me what is it is, Gee Whiz, I don't think..."
5,Whispers Of The Waves,2011-01-01,20 Odd Years,"I am the deck, you are the sea...*\nI am the l..."


In [5]:
# clean data

# remove remixes
remixes = songs['title'].str.contains('([rR]emix\)|\[Acoustic Version\])')
songs = songs[~remixes]

# remove newlines
songs['lyrics'] = songs['lyrics'].str.replace('[\n\t]', ' ')
# replace hyphens/dashes with spaces
songs['lyrics'] = songs['lyrics'].str.replace('[-–—]', ' ')
# remove all other punctuation
songs['lyrics'] = songs['lyrics'].str.replace('[^a-zA-Z0-9 ]', '')

songs['lyrics'] = songs['lyrics'].str.lower()

# save
songs.to_csv('buck65.tsv', sep="\t", index=False)

series.str.len().sum()






  return func(self, *args, **kwargs)
  songs['lyrics'] = songs['lyrics'].str.replace('[\n\t]', ' ')
  songs['lyrics'] = songs['lyrics'].str.replace('[-–—]', ' ')
  songs['lyrics'] = songs['lyrics'].str.replace('[^a-zA-Z0-9 ]', '')


297778

In [None]:
#Rundown

So far, we have gathered, sorted, and cleaned all of the lyrics from Buck 65's Genius entry. We can see that we have 297,778 individual lyrics across 163 songs. The first step then is to narrow that down to the 35,000 used in the original project. More specifically, we need to get the (chronologically) first 35,000 words. To do that, we'll need a list of his albums, which I've gotten from [this Wikipedia article](https://en.wikipedia.org/wiki/Buck_65_discography). The most relevant part of that article is pasted below:

```
Studio albums
Buck 65

Game Tight (1994)
Year Zero (1996)
Weirdo Magnet (1996)
Language Arts (1996)
Vertex (1998)
Man Overboard (Anticon, 2001)
Synesthesia (Endemik, 2001)
Square (WEA, 2002)
Talkin' Honky Blues (WEA, 2003)
Secret House Against the World (WEA, 2005)
Situation (Strange Famous, 2007)
20 Odd Years (WEA, 2011)
Laundromat Boogie (2014)
Neverlove (2014)
```

In [2]:
import pandas as pd
import nltk

songs = pd.read_csv('buck65.tsv', sep="\t", keep_default_na=False)

In [11]:
albums = [
    'Game Tight',
    'Year Zero',
    'Weirdo Magnet',
    'Language Arts',
    'Vertex',
    'Man Overboard',
    'Synesthesia',
    'Square',
    'Talkin\' Honky Blues',
    'Secret House Against the World',
    'Situation',
    '20 Odd Years',
    'Laundromat Boogie',
    'Neverlove'
]

def get_album_lyric_count(album_series):
    return album_series['lyrics'].str.len().sum()

total_count = 0
relevant_albums = []
for album in albums:
    album_series = songs[songs['album'] == album]
    lyrics = get_album_lyric_count(album_series)
    if lyrics == 0:
        continue
    new_count = total_count + lyrics
    relevant_albums.append(album)
    total_count = new_count
    if new_count > 35000:
        break

print(relevant_albums)
print(total_count)



['Weirdo Magnet', 'Language Arts', 'Vertex']
39847
