Nitya Kashyap
CIS 9 Final Project
6/27/2023

This project aims to study the classification powers of NLP. I've trained an ML model to choose the correct one of 3 artists given the lyrics of their song. I chose this topic because I was interested in seeing whether or not an artist's songs have some sort of noticeable general theme/genre or reccurring topics that an NLP model can pick up on, and with how much accuracy the model can match those songs to the correct artist.

Here's the dataset from Kaggle that I obtained my data from: https://www.kaggle.com/datasets/deepshah16/song-lyrics-dataset

For the first part of my analysis, I have intentionally chosen 3 very distinct artists. The rapper Drake produces songs in the Hip hop and trap genres. Singer-songwriter Taylor Swift's domain lies primarily in the pop and country genres. Singer-songwriter Beyonce's songs fall mostly under the R&B/soul and pop genres. Due to the different styles of music produced by each artist, I hypothesize that the NPL model will be able to differentiate their songs with much accuracy. 

In [1]:
import numpy as np
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

drake = pd.read_csv("Drake.csv")
beyonce = pd.read_csv("Beyonce.csv")
taylor = pd.read_csv("TaylorSwift.csv")

print("Here's a sample of the data for the 3 artists:")
display(drake.head())
display(beyonce.head())
display(taylor.head())

Here's a sample of the data for the 3 artists:


Unnamed: 0.1,Unnamed: 0,Artist,Title,Album,Year,Date,Lyric
0,0,Drake,God’s Plan,Scorpion,2018.0,2018-01-19,and they wishin' and wishin' and wishin' and w...
1,1,Drake,In My Feelings,Scorpion,2018.0,2018-06-29,drake trap trapmoneybenny this shit got me in ...
2,2,Drake,Hotline Bling,Views,2015.0,2015-07-25,you used to call me on my you used to you used...
3,3,Drake,One Dance,Views,2016.0,2016-04-05,kyla baby i like your style drake grips on y...
4,4,Drake,"Hold On, We’re Going Home",Nothing Was the Same,2013.0,2013-08-07,produced by nineteen85 majid jordan noah 40 s...


Unnamed: 0.1,Unnamed: 0,Artist,Title,Album,Year,Date,Lyric
0,0,Beyoncé,Drunk in Love,BEYONCÉ,2013.0,2013-12-17,beyoncé i've been drinkin' i've been drinkin' ...
1,1,Beyoncé,Formation,Lemonade,2016.0,2016-02-06,messy mya what happened at the new wil'ins bit...
2,2,Beyoncé,Partition,BEYONCÉ,2013.0,2013-12-13,part yoncé let me hear you say hey ms carte...
3,3,Beyoncé,Mine,BEYONCÉ,2013.0,2013-12-13,beyoncé i've been watching for the signs took ...
4,4,Beyoncé,Hold Up,Lemonade,2016.0,2016-04-23,hold up they don't love you like i love you sl...


Unnamed: 0.1,Unnamed: 0,Artist,Title,Album,Year,Date,Lyric
0,0,Taylor Swift,​cardigan,folklore,2020.0,2020-07-24,vintage tee brand new phone high heels on cobb...
1,1,Taylor Swift,​exile,folklore,2020.0,2020-07-24,justin vernon i can see you standing honey wit...
2,2,Taylor Swift,Lover,Lover,2019.0,2019-08-16,we could leave the christmas lights up 'til ja...
3,3,Taylor Swift,​the 1,folklore,2020.0,2020-07-24,i'm doing good i'm on some new shit been sayin...
4,4,Taylor Swift,Look What You Made Me Do,reputation,2017.0,2017-08-25,i don't like your little games don't like your...


Now, I'll be cleaning the data, removing any irrelevant columns and NaNs. Since the lyrics and artist columns are the only two I will need to train the model, I will drop every other column. Then, I'll check to see if there are any NaNs in the artist or lyrics column. 
When I was browsing this data on Kaggle, I noticed that for some of the songs whose lyrics were missing, the lyrics column was filled with "lyrics for this song have yet to be released please check back once the song has been released". These rows are essentially NaNs, so I will be dropping all rows with that phrase. Some lyrics entries were just plain null, so those will also be dropped. I also noticed that in some of the files, the same lyrics seemed to have been recorded in multiple entries– I will use a new function that we have not used or learned in this class to drop any of those duplicates that exist.

In [2]:
artists = [drake, taylor, beyonce]
for artist in artists:
    artist.drop(columns=["Unnamed: 0", "Title", "Album", "Year", "Date"], inplace=True)
    artist.drop_duplicates(subset="Lyric", keep='first', inplace=True)
    artist.dropna(inplace=True)
    display(artist.head())
    print("rows, columns left:", artist.shape)

Unnamed: 0,Artist,Lyric
0,Drake,and they wishin' and wishin' and wishin' and w...
1,Drake,drake trap trapmoneybenny this shit got me in ...
2,Drake,you used to call me on my you used to you used...
3,Drake,kyla baby i like your style drake grips on y...
4,Drake,produced by nineteen85 majid jordan noah 40 s...


rows, columns left: (461, 2)


Unnamed: 0,Artist,Lyric
0,Taylor Swift,vintage tee brand new phone high heels on cobb...
1,Taylor Swift,justin vernon i can see you standing honey wit...
2,Taylor Swift,we could leave the christmas lights up 'til ja...
3,Taylor Swift,i'm doing good i'm on some new shit been sayin...
4,Taylor Swift,i don't like your little games don't like your...


rows, columns left: (445, 2)


Unnamed: 0,Artist,Lyric
0,Beyoncé,beyoncé i've been drinkin' i've been drinkin' ...
1,Beyoncé,messy mya what happened at the new wil'ins bit...
2,Beyoncé,part yoncé let me hear you say hey ms carte...
3,Beyoncé,beyoncé i've been watching for the signs took ...
4,Beyoncé,hold up they don't love you like i love you sl...


rows, columns left: (380, 2)


After all that cleaning, Beyonce is left with a little less data than the other two. Drake has about 20% more data than Beyonce does, so the data cannot exactly be considered balanced between them. So before I combine all 3 artists' data, I'm going to downsample Taylor Swift's and Drake's data to the size of Beyonce's to balance the data– this is also something new we've not covered how to do in class.

In [3]:
smallest = min([drake.shape[0], beyonce.shape[0], taylor.shape[0]])
drake = drake.sample(n=smallest, ignore_index=True)
taylor = taylor.sample(n=smallest, ignore_index=True)
print("drake:", drake.shape, "taylor:", taylor.shape)

drake: (380, 2) taylor: (380, 2)


Now that everything is balanced (all artists have 380 songs), I'll be combining all the DataFrames into one. Since it's easier for the ML model to work with numbers, I'll convert the artist categories into numbers.

In [4]:
# create a aggregate DataFrame comprised of the 3 artists' data
df = pd.concat([drake, beyonce, taylor], ignore_index=True)
lookup = dict(zip(range(len(df.Artist.unique())), df.Artist.unique()))
convert = dict(zip(df.Artist.unique(), range(len(df.Artist.unique()))))
df.Artist.replace(convert, inplace=True)
print("combined data:")
df

combined data:


Unnamed: 0,Artist,Lyric
0,0,it's 0 o'clock on a wednesday and i know your ...
1,0,tell me lies scene melissa mcintyre as ash...
2,0,drake partynextdoor it's your fuckin' birthda...
3,0,drake yeah yeah drake dropped outta school n...
4,0,drake listen girl you so bad and you single wh...
...,...,...
1135,2,i remember the eyes of the kid in the crowd wh...
1136,2,last christmas i gave you my heart but the ver...
1137,2,i was reminiscing just the other day while hav...
1138,2,watch me go into the world today watch me try ...


There are definitely over 400 rows, so it seems the model has sufficient data!

Next, I'll create the X and y datasets and print the shape of each:

In [5]:
X = pd.DataFrame(df.Lyric)
y = df.Artist
print("shape of X:", X.shape, "shape of y:", y.shape)

shape of X: (1140, 1) shape of y: (1140,)


I'll need to take a look at one sample song lyrics to see if it needs any preprocessing for the model.

In [7]:
print(X.iloc[1138, 0]) # this is one of Taylor Swift's!

watch me go into the world today watch me try to blow the past away watch me laugh and watch me cry watch me fall and watch me fly cause im living in a brand new world and you think im just another girl but im living just a day at a time and keeping it mine someday im gonna fly   watch me catch a star and pin it down watch me live another time around watch me doubt the things i love watch me find what im proud of cause im living in a brand new world and you think im just another girl but im living just a day at a time and keeping it mine someday im gonna fly   and i may break my wings and fall flat on my shattered heart and i may hit the ground but i will never fall apart cause im living in a brand new world and you think im just another girl but im living just a day at a time and keeping it mine im living in a brand new world and you think im just another girl but im living just a day at a time and keeping it mine someday im gonna fly


It looks like I'll need to remove stop words and apply stemming. Words like "the" still exist in the lyrics, and verbs like "living" and "keeping" have not been stemmed, indicating that the data has not yet been processed.

In [8]:
tokenizer = RegexpTokenizer('\w+')
stop_words=set(stopwords.words("english"))
stemmer = PorterStemmer()

def preprocess(string):
    s = tokenizer.tokenize(string.lower()) 
    s = [word for word in s if word not in stop_words]
    s = [stemmer.stem(word) for word in s]
    return ' '.join(s)

X_processed = pd.Series([preprocess(X.loc[i, "Lyric"]) for i in range(len(X))])
X_processed.head()

0    0 clock wednesday know home drivin street hopi...
1    tell lie scene melissa mcintyr ashley kerwin s...
2    drake partynextdoor fuckin birthday oh birthda...
3    drake yeah yeah drake drop outta school dumb r...
4    drake listen girl bad singl hear met yet name ...
dtype: object

Now I'll convert the preprocessed data to numbers so it's ready for the ML model.

In [9]:
vect = CountVectorizer()
vect.fit(X_processed)
X_vectors = vect.transform(X_processed)
print("shape of the X dataset that will be used with the model:", X_vectors.shape)

shape of the X dataset that will be used with the model: (1140, 11253)


Now I'll train and test the model. I'll need to create X and y training and testing datasets.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_vectors, y, test_size = 0.3)
print("X training dataset size:", X_train.shape, "X testing dataset size:", X_test.shape, "y training dataset size:", y_train.shape, "y testing dataset size:", y_test.shape)

X training dataset size: (798, 11253) X testing dataset size: (342, 11253) y training dataset size: (798,) y testing dataset size: (342,)


Training and testing the ML model:

In [11]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
y_predict = classifier.predict(X_test)

# measure accuracy
print("accuracy score:", metrics.accuracy_score(y_test, y_predict))
print("confusion matrix:")
metrics.confusion_matrix(y_test, y_predict)

accuracy score: 0.8830409356725146
confusion matrix:


array([[116,   3,   2],
       [  9,  70,  19],
       [  2,   5, 116]])

The confusion matrix indicates that the model seems to be a bit unreliable in differentiating Beyonce's songs from Taylor Swift's- it classified almost 20% of Beyonce's songs as Swift's. This might be because both Swift and Beyonce dabble in pop– Swift more so than Beyonce– but there seems to be enough of a pop vibe in Beyonce's songs for the model to confuse her with Swift for a significant proportion of her songs. Interestingly, this confusion doesn't go the other way- the model does not often wrongly classify Swift's songs as Beyonce's. 
However, since Swift's and Drake's songs are so vastly different, the model rarely classifies Swift's songs as Drake's (or vice versa). The accuracy score seems to be a bit more optimistic than the confusion matrix– even though the model overall had an accuracy of 88%, when classifying Beyonce's songs it only correctly classified 71% of them, which is much lower than 88%. I think its high performace with Drake's and Swift's songs (96% and 94% accuracy respectively) brought up the accuracy score to make the model seem like it did very well– which it did, but not uniformly across all the categories.
One factor that could've confuddled the model is that some of the songs in an artist's csv file are collaborations with other artists. The style of the featured artist usually is different from the main artist's. In this dataset, only the main artist is listed for the entire song, even though the whole song cannot be attributed to just them. 
Despite featuring other artists, the songs of these 3 particular artists have enough similarity for the model to somewhat accurately detect that they're from the same artist (more so for Drake and Taylor Swift than for Beyonce). 

I would like to take it a step further in testing the limits of the NLP models we've talked about. I would like to see how similar two artists' songs can get before the model can no longer tell them apart. I’ve done one iteration of the project with 3 very distinct artists, and now I’ll do a second iteration with two artists who are really similar in terms of genre, and see how the model's results compare with the first iteration.
I'll swap out Taylor Swift for Rihanna. Rihanna's songs mostly fall under the reggae, R&B, and pop genres, which is very similar to Beyonce's main genres. It is expected that the model will not differentiate Beyonce and Rihanna as well as it would differentiate Drake and Beyonce or Drake and Rihanna. I would like to see just how confused the model gets between Beyonce and Rihanna- how much worse will it perform when the artists have much overlap in genre? Do Beyonce and Rihanna have unique enough styles in their lyrics for the model be able to tell them apart?

Read in Rihanna's data and perform same cleaning steps:

In [12]:
rihanna = pd.read_csv("Rihanna.csv")
print("Here's a sample of Rihanna's data:")
display(rihanna.head())

Here's a sample of Rihanna's data:


Unnamed: 0.1,Unnamed: 0,Artist,Title,Album,Year,Date,Lyric
0,0,Rihanna,Work,ANTI,2016.0,2016-01-27,rihanna work work work work work work he said ...
1,1,Rihanna,Love on the Brain,ANTI,2016.0,2016-01-28,and you got me like oh what you want from me w...
2,2,Rihanna,Needed Me,ANTI,2016.0,2016-01-28,yg mustard on the beat ho i was good on my o...
3,3,Rihanna,Stay,Unapologetic,2013.0,2013-01-07,rihanna all along it was a fever a cold sweat ...
4,4,Rihanna,Kiss It Better,ANTI,2016.0,2016-01-28,kiss it kiss it better baby kiss it kiss it be...


In [13]:
rihanna.drop(columns=["Unnamed: 0", "Title", "Album", "Year", "Date"], inplace=True)
rihanna.drop_duplicates(subset="Lyric", keep='first', inplace=True)
rihanna.dropna(inplace=True)
display(rihanna.head())
print("rows, columns left:", rihanna.shape)

Unnamed: 0,Artist,Lyric
0,Rihanna,rihanna work work work work work work he said ...
1,Rihanna,and you got me like oh what you want from me w...
2,Rihanna,yg mustard on the beat ho i was good on my o...
3,Rihanna,rihanna all along it was a fever a cold sweat ...
4,Rihanna,kiss it kiss it better baby kiss it kiss it be...


rows, columns left: (360, 2)


After all that cleaning, Rihanna is left with a little less data than the other two (Drake and Beyonce each have 380 data values from the last analysis). So before I combine all 3 artists' data, I'm going to downsample Beyonce's and Drake's data to the size of Rihanna's to balance the data

In [35]:
artists = [drake, beyonce, rihanna]
smallest = min([drake.shape[0], beyonce.shape[0], rihanna.shape[0]])
drake = drake.sample(n=smallest, ignore_index=True)
beyonce = beyonce.sample(n=smallest, ignore_index=True)

In [36]:
print("drake:", drake.shape, "beyonce:", beyonce.shape)

drake: (360, 2) beyonce: (360, 2)


Now that everything is balanced, I'll be combining all the DataFrames into one. Since it's easier for the ML model to work with numbers, I'll convert the artist categories into numbers.

In [39]:
# create a aggregate DataFrame comprised of the 3 artists' data
df = pd.concat([drake, beyonce, rihanna], ignore_index=True)
lookup = dict(zip(range(len(df.Artist.unique())), df.Artist.unique()))
convert = dict(zip(df.Artist.unique(), range(len(df.Artist.unique()))))
df.Artist.replace(convert, inplace=True)
print("combined data:")
df

combined data:


Unnamed: 0,Artist,Lyric
0,0,drake yeah 9th wonder don't judge me man they ...
1,0,sampha don't think about it too much too much ...
2,0,uh uh yeah lube up get ready boys little crust...
3,0,produced by boida frank dukes noah 40 shebib ...
4,0,yeah never thought i'd be talkin' from this p...
...,...,...
1075,2,and you can see my heart said im terrified pul...
1076,2,and you can see my heart beating you can see i...
1077,2,take a breath take it deep calm yourself he sa...
1078,2,rihanna feels so good being bad ohohohohoh the...


There are definitely over 400 rows, so it seems the model has sufficient data!

Next, I'll create the X and y datasets and print the shape of each:

In [40]:
X = pd.DataFrame(df.Lyric)
y = df.Artist
print("shape of X:", X.shape, "shape of y:", y.shape)

shape of X: (1080, 1) shape of y: (1080,)


Process the data:

In [41]:
X_processed = pd.Series([preprocess(X.loc[i, "Lyric"]) for i in range(len(X))])
X_processed.head()

0    drake yeah 9th wonder judg man tend say us rap...
1    sampha think much much much much need us rush ...
2    uh uh yeah lube get readi boy littl crusti go ...
3    produc boida frank duke noah 40 shebib ninetee...
4    yeah never thought talkin perspect realli sure...
dtype: object

In [42]:
vect.fit(X_processed)
X_vectors = vect.transform(X_processed)
print("shape of the X dataset that will be used with the model:", X_vectors.shape)

shape of the X dataset that will be used with the model: (1080, 10509)


Now I'll train and test the model. I'll need to create X and y training and testing datasets.

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X_vectors, y, test_size = 0.3)
print("X training dataset size:", X_train.shape, "X testing dataset size:", X_test.shape, "y training dataset size:", y_train.shape, "y testing dataset size:", y_test.shape)

X training dataset size: (756, 10509) X testing dataset size: (324, 10509) y training dataset size: (756,) y testing dataset size: (324,)


Training and testing the ML model:

In [53]:
classifier.fit(X_train, y_train)
y_predict = classifier.predict(X_test)

# measure accuracy
print("accuracy score:", metrics.accuracy_score(y_test, y_predict))
print("confusion matrix:")
metrics.confusion_matrix(y_test, y_predict)

accuracy score: 0.8333333333333334
confusion matrix:


array([[102,   6,   3],
       [ 13,  78,   9],
       [  9,  14,  90]])

The model does have a lower accuracy score on this iteration. The confusion matrix indicates that the model did really well (92%) with Drake's songs and much poorer on both Beyonce's and Rihanna's songs (78% and 80% respectively). What's puzzling is that Beyonce's songs are more often confused with Drake's than Rihanna's. However, Rihanna's songs are more often confused with Beyonce's songs than Drake's. It seems that since Beyonce's and Rihanna's songs are mostly in the same genres, the model is not as accurate in classifying their songs. But since Drake's songs are in an entirely different genre with no overlap, the model is able to correctly identify his songs. 
It's also worth noting the limits of this project. Artists express their unique style not just through their lyrics. Because the model is only given the artist's lyrics in this project, it may not distinguish the artist as well as if it were given the actual audio of the song. Training the model on the audio of these songs is a potential future improvement to this project. 