## How are the songs of the singers with the largest and lowest vocabulary evaluated by our Lexical Diversity approach?

In [4]:
import nltk
nltk.download('punkt') 
from PyLyrics import *
from nltk import word_tokenize, WordPunctTokenizer

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mohamedkhanafer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 1. Introduction
I came across an article discussing the overall vocabulary of some musicians across their careers. And so I wanted to compare some of the songs of the top 3 artists on that list and the last 2 artists to the 3 songs we saw in class.
On top of the list is (with no surprise) Eminem, which used around 8,800 words in his 100 lenghtiest songs. Then come Jay Z (6,900 words) and Tupac Shakur (6,600 words). On the bottom of the list come the Spice Girls and Bruno Mars (around 1500 words).
I tried to choose songs that seemed more lexically diverse and as expected, the results showed a large diversity difference between the two groups and also when compared to the 3 songs we saw.
(article can be found here: https://lab.musixmatch.com/largest_vocabulary/)

In [26]:
# Using the script we saw in class
wpt = WordPunctTokenizer()
def lexical_diversity(text):
    return len(set(text)) / len(text)

def read_and_split(lyrics):
    return lyrics.split()

def read_and_tokenize(lyrics):
    return word_tokenize(lyrics)

def read_and_wpt(lyrics):
    return wpt.tokenize(lyrics)

def calculate_diversity(author, song, pt=False, ll=20):
    song_lyrics = PyLyrics.getLyrics(author,song)
    txt_split = read_and_split(song_lyrics)
    txt_tok = read_and_tokenize(song_lyrics)
    txt_wpt = read_and_wpt(song_lyrics)
    print("\n--- LD of %s ---" % song)
    print("sp=%f, tk=%f, wpt=%f" % (lexical_diversity(txt_split), lexical_diversity(txt_tok), lexical_diversity(txt_wpt)))
    print("--------- // ----------")
    if pt:
        print(txt_split[:ll])
        print(txt_tok[:ll])
        print(txt_wpt[:ll])

# Note:
# sp: using Python's split
# tk: using NLTK word tokenizer
# wpt: using wordpunkttokenizer

### 2. The 3 examples seen in class

In [11]:
print("\n-- Lexical Diversity --\n")
calculate_diversity('The Beatles','All You Need Is Love', pt = False)
calculate_diversity('Queen','Bohemian Rhapsody', pt = False)
calculate_diversity('Luis Fonsi','Despacito', pt = False)


-- Lexical Diversity --


--- LD of All You Need Is Love ---
sp=0.167059, tk=0.120594, wpt=0.120135
--------- // ----------

--- LD of Bohemian Rhapsody ---
sp=0.526596, tk=0.373695, wpt=0.362903
--------- // ----------

--- LD of Despacito ---
sp=0.444211, tk=0.394636, wpt=0.394636
--------- // ----------


## 3. 3 songs of the top 3 singers on the list

In [9]:
print("3 songs by Eminem:")
calculate_diversity('Eminem','Rap God', pt = False)
calculate_diversity('Eminem','Headlights', pt = False)
calculate_diversity('Eminem','Sing for the Moment', pt = False)

print("")
print("")

print("3 songs by Jay-Z:")

calculate_diversity('Jay-Z','Dirt Off Your Shoulder', pt = False)
calculate_diversity('Jay-Z','Empire State of Mind', pt = False)
calculate_diversity('Jay-Z','Run This Town', pt = False)

print("")
print("")

print("3 songs by Tupac:")
calculate_diversity('Tupac','Changes', pt = False)
calculate_diversity('Tupac','Ghetto Gospel', pt = False)
calculate_diversity('Tupac','UNTIL THE END OF TIME', pt = False)

3 songs by Eminem:

--- LD of Rap God ---
sp=0.459530, tk=0.382319, wpt=0.360384
--------- // ----------

--- LD of Headlights ---
sp=0.457739, tk=0.385447, wpt=0.367041
--------- // ----------

--- LD of Sing for the Moment ---
sp=0.473373, tk=0.385930, wpt=0.375121
--------- // ----------


3 songs by Jay-Z:

--- LD of Dirt Off Your Shoulder ---
sp=0.388554, tk=0.304130, wpt=0.301075
--------- // ----------

--- LD of Empire State of Mind ---
sp=0.504478, tk=0.430628, wpt=0.408537
--------- // ----------

--- LD of Run This Town ---
sp=0.531915, tk=0.424899, wpt=0.404822
--------- // ----------


3 songs by Tupac:

--- LD of Changes ---
sp=0.455247, tk=0.379592, wpt=0.360051
--------- // ----------

--- LD of Ghetto Gospel ---
sp=0.518367, tk=0.454874, wpt=0.441536
--------- // ----------

--- LD of UNTIL THE END OF TIME ---
sp=0.490427, tk=0.421466, wpt=0.409149
--------- // ----------


## 3. 3 songs of the last 2 singers on the list

In [27]:
print("3 songs by the Spice Girls")
calculate_diversity('Spice Girls','Who Do You Think You Are', pt = False)
calculate_diversity('Spice Girls','Wannabe', pt = False)
calculate_diversity('Spice Girls','Spice Up Your Life', pt = False)

print("")
print("")

print("3 songs by Bruno Mars:")

calculate_diversity('Bruno Mars','LOCKED OUT OF HEAVEN', pt = False)
calculate_diversity('Bruno Mars','MARRY YOU', pt = False)
calculate_diversity('Bruno Mars','GRENADE', pt = False)


3 songs by the Spice Girls

--- LD of Who Do You Think You Are ---
sp=0.172291, tk=0.112549, wpt=0.119946
--------- // ----------

--- LD of Wannabe ---
sp=0.285156, tk=0.195051, wpt=0.206325
--------- // ----------

--- LD of Spice Up Your Life ---
sp=0.227488, tk=0.183365, wpt=0.178439
--------- // ----------


3 songs by Bruno Mars:

--- LD of LOCKED OUT OF HEAVEN ---
sp=0.297386, tk=0.240695, wpt=0.233831
--------- // ----------

--- LD of MARRY YOU ---
sp=0.339450, tk=0.270619, wpt=0.248792
--------- // ----------

--- LD of GRENADE ---
sp=0.345882, tk=0.250929, wpt=0.244248
--------- // ----------
