# INTRO
Ever since coming across [Matt Daniel's Rapper Vocabulary Chart](https://pudding.cool/projects/vocabulary/index.html), I've been interested in how one of my favorite rappers -- Buck 65 -- would place on there. To find that, I'll be getting as many lyrics as I can from LyricsGenius to get up to 35,000 lyrics in accordance with the original methodology:

```
35,000 words covers 3-5 studio albums and EPs. I included mixtapes if the artist was just short of the 35,000 words. Quite a few rappers don’t have enough official material to be included (e.g., Biggie, Kendrick Lamar). As a benchmark, I included data points for Shakespeare and Herman Melville, using the same approach (35,000 words across several plays for Shakespeare, first 35,000 of Moby Dick).

I used a research methodology called token analysis to determine each artist’s vocabulary. Each word is counted once, so pimps, pimp, pimping, and pimpin are four unique words. To avoid issues with apostrophes (e.g., pimpin’ vs. pimpin), they’re removed from the dataset. It still isn’t perfect. Hip hop is full of slang that is hard to transcribe (e.g., shorty vs. shawty), compound words (e.g., king shit), featured vocalists, and repetitive choruses.
```

With those lyrics, I'll be cleaning the data to remove apostrophes and (possibly) other special characters, and then using NLTK to break the lyrics into tokens and count the number of individual words.

In [5]:
import lyricsgenius
import json

secrets_file = open('secrets.json')
secrets = json.load(secrets_file)
secrets_file.close()

{'CLIENT_ID': 'Kb-iPXnIMs_2V9azrpUvC_WUC5dfM8b1sI2lcjW2MeEqeUvfmy-nyQ_W6kmANmHM', 'CLIENT_SECRET': 'f1yly8xDP8bEn-TSS24_KbKvDu8BNru5ctLRCV5tu5_ZKyVuDEXWsDEvQTtVglBsOqhH4OoUmxET3FpdiauXWg', 'CLIENT_ACCESS_TOKEN': '6uP1AJ5UQaYdvSmOCVwAXOwINJApY0qhRQXDfgzcoprGgpTF0ePeOnvfOzwq7Rtd'}
Searching for songs by Buck 65...

Song 1: "15 Minutes To Live"
Song 2: "1957"
Song 3: "463"

Reached user-specified song limit (3).
Done. Found 3 songs.
Buck 65, 3 songs


In [6]:
# Get lyrics: this step will take a while
genius = lyricsgenius.Genius(secrets['CLIENT_ACCESS_TOKEN'])
genius.remove_section_headers = True


artist = genius.search_artist("Buck 65")
artist.save_lyrics()

Searching for songs by Buck 65...

Song 1: "The Centaur"
Song 2: "Blood of a Young Wolf"
Song 3: "Wicked And Weird"
Song 4: "Super Pretty Naughty"
Song 5: "Pants on Fire"
Song 6: "Whispers Of The Waves"
Song 7: "Je T’aime Mon Amour"
Song 8: "Love Will Fuck You Up"
Song 9: "Paper Airplane"
Song 10: "Blood, Pt. 2"
Song 11: "Heart of Stone"
Song 12: "Roses In The Rain"
Song 13: "1957"
Song 14: "463"
Song 15: "Secret Splendour"
Song 16: "You Know The Science"
Song 17: "Craftsmanship"
Song 18: "Roses and Bluejays"
Song 19: "Untitled"
Song 20: "Phil"
Song 21: "Cries a Girl"
Song 22: "Square One"
Song 23: "15 Minutes To Live"
Song 24: "Devil’s Eyes"
Song 25: "The Floor"
Song 26: "Gee Whiz"
Song 27: "Indestructible Sam"
Song 28: "Cold Steel Drum"
Song 29: "Secret Splendor"
Song 30: "Sleep Apnea"
Song 31: "50 Gallon Drum"
Song 32: "Zombie Delight"
Song 33: "Danger And Play"
Song 34: "Bachelor of Science"
Song 35: "Out of Focus"
Song 36: "Rough House Blues"
Song 37: "A Case For Us"
Song 38: "Tot

In [27]:
import pandas as pd
f = open("Lyrics_Buck65.json")
buck_json = json.load(f)
songs = pd.DataFrame(buck_json['songs'])

unneeded_cols = list(songs.columns.values)

# we only need these three values, so we drop the rest
unneeded_cols.remove('lyrics')
unneeded_cols.remove('title')
unneeded_cols.remove('release_date')

songs = songs.drop(unneeded_cols, axis=1)
songs.head()

Unnamed: 0,title,release_date,lyrics
0,The Centaur,,Most people are curious\nSome wanna get dirt o...
1,Blood of a Young Wolf,,"Ten thousand horses, Sable island, endless sum..."
2,Wicked And Weird,,"Driving with a yellow dog, I-95\nHe's got a sm..."
3,Super Pretty Naughty,2014-09-02,Fancy time. Naked Saturday. Wild stylin’\nNow ...
4,Pants on Fire,,"Sky diver, your pants are on fire, and the res..."
