# INTRO
Ever since coming across [Matt Daniel's Rapper Vocabulary Chart](https://pudding.cool/projects/vocabulary/index.html), I've been interested in how one of my favorite rappers -- Buck 65 -- would place on there. To find that, I'll be getting as many lyrics as I can from LyricsGenius to get up to 35,000 lyrics in accordance with the original methodology:

```
35,000 words covers 3-5 studio albums and EPs. I included mixtapes if the artist was just short of the 35,000 words. Quite a few rappers don’t have enough official material to be included (e.g., Biggie, Kendrick Lamar). As a benchmark, I included data points for Shakespeare and Herman Melville, using the same approach (35,000 words across several plays for Shakespeare, first 35,000 of Moby Dick).

I used a research methodology called token analysis to determine each artist’s vocabulary. Each word is counted once, so pimps, pimp, pimping, and pimpin are four unique words. To avoid issues with apostrophes (e.g., pimpin’ vs. pimpin), they’re removed from the dataset. It still isn’t perfect. Hip hop is full of slang that is hard to transcribe (e.g., shorty vs. shawty), compound words (e.g., king shit), featured vocalists, and repetitive choruses.
```

With those lyrics, I'll be cleaning the data to remove apostrophes and (possibly) other special characters, and then using NLTK to break the lyrics into tokens and count the number of individual words.

In [1]:
import lyricsgenius
import json
import pandas as pd

secrets_file = open('secrets.json')
secrets = json.load(secrets_file)
secrets_file.close()

def make_initial_dataframe(json_file):
    f = open(json_file)
    buck_json = json.load(f)
    songs = pd.DataFrame(buck_json['songs'])

    unneeded_cols = list(songs.columns.values)

    # we only need these three values, so we drop the rest
    unneeded_cols.remove('lyrics')
    unneeded_cols.remove('title')
    unneeded_cols.remove('release_date')
    unneeded_cols.remove('album')

    songs = songs.drop(unneeded_cols, axis=1)
    songs.head()
    return songs

def find_album(album):
    return album['name']

def format_albums(songs):
    albums = songs['album']
    songs['album'] = albums.map(find_album, na_action='ignore')

    songs.head()
    return songs

def clean_lyrics(songs):
    # remove remixes
    remixes = songs['title'].str.contains('([rR]emix\)|\[Acoustic Version\])')
    songs = songs[~remixes]

    # remove newlines
    songs['lyrics'] = songs['lyrics'].str.replace('[\n\t]', ' ')
    # replace hyphens/dashes with spaces
    songs['lyrics'] = songs['lyrics'].str.replace('[-–—]', ' ')
    # remove all other punctuation
    songs['lyrics'] = songs['lyrics'].str.replace('[^a-zA-Z0-9 ]', '')

    songs['lyrics'] = songs['lyrics'].str.lower()
    return songs

In [2]:
# set up genius api access
genius = lyricsgenius.Genius(secrets['CLIENT_ACCESS_TOKEN'])
genius.remove_section_headers = True

In [3]:
# this only needs to be run if Lyrics_Buck65.json doesn't exist
# it will also take a while
# buck = genius.search_artist("Buck 65")
# buck.save_lyrics()

In [4]:
# turn JSON into pandas dataframe
songs = make_initial_dataframe("Lyrics_Buck65.json")

In [5]:
# get the album name from the JSON 
songs = format_albums(songs)

In [6]:
songs.sort_values(by='album', inplace=True)
songs.head(20)

Unnamed: 0,title,release_date,album,lyrics
81,Who By Fire,,20 Odd Years,And who by fire?\nWho by water?\nWho in the su...
27,Cold Steel Drum,2011-01-01,20 Odd Years,"I lay down for you, in black and blue\nLife he..."
31,Zombie Delight,2011-02-07,20 Odd Years,Zombie Delight Zombie Delight\nZombies are com...
54,She Said Yes,2011-01-01,20 Odd Years,"She wrote back, too alone\nA single pair of sh..."
59,BCC,2011-01-01,20 Odd Years,BCC the ADD\nWe don't have much time you see\n...
62,Tears Of Your Heart,2011-02-07,20 Odd Years,"(French singing)\n\nAge is beauty, bewildered ..."
76,Final Approach,2011-01-01,20 Odd Years,"The sun is always shining bright, at thirty th..."
8,Paper Airplane,2011-01-01,20 Odd Years,Down by the lake you saw me\nAnd you knew I wa...
25,Gee Whiz,,20 Odd Years,"Tell me what is it is, Gee Whiz, I don't think..."
5,Whispers Of The Waves,2011-01-01,20 Odd Years,"I am the deck, you are the sea...*\nI am the l..."


In [7]:
# clean data
songs = clean_lyrics(songs)


# save
songs.to_csv('buck65.tsv', sep="\t", index=False)

songs['lyrics'].str.len().sum()






  return func(self, *args, **kwargs)
  songs['lyrics'] = songs['lyrics'].str.replace('[\n\t]', ' ')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  songs['lyrics'] = songs['lyrics'].str.replace('[\n\t]', ' ')
  songs['lyrics'] = songs['lyrics'].str.replace('[-–—]', ' ')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  songs['lyrics'] = songs['lyrics'].str.replace('[-–—]', ' ')
  songs['lyrics'] = songs['lyrics'].str.replace('[^a-zA-Z0-9 ]', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value ins

297778

# Rundown

So far, we have gathered, sorted, and cleaned all of the lyrics from Buck 65's Genius entry. We can see that we have 297,778 individual lyrics across 163 songs. The first step then is to narrow that down to the 35,000 used in the original project. More specifically, we need to get the (chronologically) first 35,000 words. To do that, we'll need a list of his albums, which I've gotten from [this Wikipedia article](https://en.wikipedia.org/wiki/Buck_65_discography). The most relevant part of that article is pasted below:

```
Studio albums
Buck 65

Game Tight (1994)
Year Zero (1996)
Weirdo Magnet (1996)
Language Arts (1996)
Vertex (1998)
Man Overboard (Anticon, 2001)
Synesthesia (Endemik, 2001)
Square (WEA, 2002)
Talkin' Honky Blues (WEA, 2003)
Secret House Against the World (WEA, 2005)
Situation (Strange Famous, 2007)
20 Odd Years (WEA, 2011)
Laundromat Boogie (2014)
Neverlove (2014)
```

In [8]:
import nltk
nltk.download('punkt')

songs = pd.read_csv('buck65.tsv', sep="\t", keep_default_na=False)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [9]:
# sort albums by album release
albums = [
    'Game Tight',
    'Year Zero',
    'Weirdo Magnet',
    'Language Arts',
    'Vertex',
    'Man Overboard',
    'Synesthesia',
    'Square',
    'Talkin\' Honky Blues',
    'Secret House Against the World',
    'Situation',
    '20 Odd Years',
    'Laundromat Boogie',
    'Neverlove'
]

songs['ordered_album'] = pd.Categorical(
    songs['album'], 
    categories=albums, 
    ordered=True
)

songs = songs.sort_values(by='ordered_album')

songs.head()

Unnamed: 0,title,release_date,album,lyrics,ordered_album
150,Untitled,,Weirdo Magnet,yes the actual name of the song is untitled i...,Weirdo Magnet
38,Pubic’s Tube,,Language Arts,stinkin rich is x rated guess whos squirtin co...,Language Arts
37,Bush Pilot,,Language Arts,we should take a break from the computers and ...,Language Arts
36,Totem Pole,1997-01-01,Language Arts,we dont need to watch tv tonight this is for u...,Language Arts
35,Seventeen,,Language Arts,this ones goin out to all of those that dont k...,Language Arts


In [13]:
def get_unique_lyrics(tokens):
    return len(set(tokens))

def tokenize_lyrics(songs):
    lyrics = songs['lyrics']
    lyric_string = lyrics.str.cat()
    return nltk.word_tokenize(lyric_string)

lyric_tokens = tokenize_lyrics(songs)
# get unique words in first 35,000 lyrics
limited_tokens = lyric_tokens[:34999]
get_unique_lyrics(limited_tokens)



6351

# First Conclusion

The above cell gives us Buck 65's vocabulary according to Daniel's first 35,000 word methodology: 6,351 unique words. This puts him in a close 5th place behind GZA's 6,390 unique words, but comfortably ahead of Wu-Tang Clan's 6,196 unique words. While this is a good result, I want to see how sensitive it is to changes in the sample. 


In [14]:
#all lyrics
print(get_unique_lyrics(lyric_tokens))

8457


In [15]:
# last words
last_tokens = lyric_tokens[-35000:]
print(get_unique_lyrics(last_tokens))

6579


In [16]:
# random samplings
from random import sample
from statistics import mean

counter = 0
results = []
while counter < 10:
    lyric_sample = sample(lyric_tokens, 35000)
    uniques = get_unique_lyrics(lyric_sample)
    results.append(uniques)
    counter += 1

print(sorted(results))
print(mean(results))

[6493, 6505, 6521, 6524, 6525, 6526, 6546, 6546, 6554, 6568]
6530.8


# Second Conclusion

When using his whole corpus of 297,778 words, we find 8,456 unique ones. Using his last 35,000 words gets us 6,580 unique words, implying an increase in vocabulary over time. Finally, using a series of random samplings of 35,000 words, we get results that tend to average out just over 6,500, but can range from the high 6,400s to the low 6,600s.

While this is another good result, I have a hypothesis that these numbers will all go up noticably if I include two albums which he recorded as part of a collaboration with DJ Greetings from Tuskan.

In [29]:
# bike = genius.search_artist("Bike for Three!")
# bike.save_lyrics()

Searching for songs by Bike for Three...

Changing artist name to 'Bike For Three!'
Song 1: "Lazarus Phenomenon"
Song 2: "Always I Will Miss You. Always You."
Song 3: "There Is Only One Of Us"
Song 4: "All There Is to Say About Love"
Song 5: "Sublimation"
Song 6: "No Idea How"
Song 7: "You Can Be Everything"
Song 8: "Nightdriving"
Song 9: "Heart as Hell"
Song 10: "Wolf Sister"
Song 11: "Let’s Never Meet"
Song 12: "Can Feel Love (anymore)"
Song 13: "Ethereal Love"
Song 14: "More Heart Than Brains"
Song 15: "Full Moon"
Song 16: "The Departure"
Song 17: "MC Space"
Song 18: "Agony"
Song 19: "One More Time Forever"
Song 20: "Successful With Heavy Losses"
Song 21: "The Muse Inside Me"
Song 22: "First Embrace"
Song 23: "The Last Romance"
Song 24: "Ending"
Song 25: "Intro"
Song 26: "Beginning"
Done. Found 26 songs.
Wrote Lyrics_BikeForThree.json.
