# Song Genre Classifier (via Lyric Analysis)

**By Josiah Nielsen**

**For: Collecting and Analyzing Big Data**

This notebook contains code to analyze a dataset of song lyrics and genres. I will attempt to classify the genre of each song using only the song lyrics, through the use of various algorithms and feature extraction.  

First, I will preprocess the data. This consists of removing punctuation and stop words from the lyrics. The lyrics are then tokenized and stemmed.

Then, I performed classification (Naive Bayes, SVM, XGBoost, RandomForests, etc), using Count and TF-IDF Vectorization before classification. 

Finally, I performed feature extraction using Parts of Speech and Embeddding (Word2Vec), followed by classification modelling using these features. These models performed poorly, and only my Gradient Boosting Classifer with Word2Vec has been retained to demonstrate use of other methods.

The XGBoost classifier w/ TF-IDF vectorization performed the best out of all models, with a 63% accuracy. The SVM classifier performed almost as good, with an accuracy of 62%.

In [None]:
#Import dependencies
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None

import nltk
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

import json
import re
import string

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from xgboost import XGBClassifier

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#Data Preprocessing

In [None]:
#Load Data
df = pd.read_csv("SongLyrics.csv")

In [None]:
#Remove songs with too few lyrics
df['word_count'] = df['lyrics'].str.split( ).str.len()
df = df[df['word_count'] > 50]

In [None]:
df.genre.value_counts()

Country    12339
Pop        12070
Rock       11875
Metal      11346
Hip-Hop    10651
Name: genre, dtype: int64

In [None]:
#Create copy of unprocessed dataframe
df_unprocessed = df 

In [None]:
#Define function for preprocessing the lyrics
def preprocessText(text, remove_stops=True):
    
    # Remove everything between hard brackets
    text = re.sub(pattern="\[.+?\]( )?", repl='', string=text)

    # Change "walkin'" to "walking", for example
    text = re.sub(pattern="n\\\' ", repl='ng ', string=text)

    # Remove x4 and (x4), for example
    text = re.sub(pattern="(\()?x\d+(\))?", repl=' ', string=text)

    # Fix apostrophe issues
    text = re.sub(pattern="\\x91", repl="'", string=text)
    text = re.sub(pattern="\\x92", repl="'", string=text)
    text = re.sub(pattern="<u\+0092>", repl="'", string=text)
    
    # Make lowercase
    text = text.lower()
    
    # Special cases/words
    text = re.sub(pattern="'til", repl="til", string=text)
    text = re.sub(pattern="'til", repl="til", string=text)
    text = re.sub(pattern="gon'", repl="gon", string=text)

    # Remove \n from beginning
    text = re.sub(pattern='^\n', repl='', string=text)

    # Strip , ! ?, : and remaining \n from lyrics
    text = ''.join([char.strip(",!?:") for char in text])
    text = text.replace('\n', ' ')

    # Remove contractions
    # specific
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"won\’t", "will not", text)
    text = re.sub(r"can't", "can not", text)
    text = re.sub(r"can\’t", "can not", text)
    text = re.sub(r"let's", "let us", text)
    text = re.sub(r"let\’s", "let us", text)
    text = re.sub(r"ain't", "aint", text)
    text = re.sub(r"ain\’t", "aint", text)

     # general
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    text = re.sub(r"n\’t", " not", text)
    text = re.sub(r"\’re", " are", text)
    text = re.sub(r"\’s", " is", text)
    text = re.sub(r"\’d", " would", text)
    text = re.sub(r"\’ll", " will", text)
    text = re.sub(r"\’t", " not", text)
    text = re.sub(r"\’ve", " have", text)
    text = re.sub(r"\’m", " am", text)

    # Remove remaining punctuation
    #punc = string.punctuation
   # text = ''.join([char for char in text if char not in punc])

    # Remove stopwords
    if remove_stops:
        stops = stopwords.words('english')
        text = ' '.join([word for word in text.split(' ') if word not in stops])
    
    # Remove double spaces and beginning/trailing whitespace
    text = re.sub(pattern='( ){2,}', repl=' ', string=text)
    text = text.strip()
    
    return(text)

In [None]:
#Apply preprocessing function to data
df['clean_lyrics'] = df.apply(lambda x: preprocessText(x['lyrics']), axis=1)

In [None]:
#Save processed dataframe to .csv
df.to_csv("lyrics_clean.csv")

In [None]:
#Function for stemming the lyrics
df['stemmed_lyrics'] = df.apply(lambda x: stemming(x['lyrics']), axis=1)

Tokenization was not needed for the models below, as the CountVectorizer and TfidfVectorizer both tokenize the text on their own. Stemming was still performed as it improves model fit. 

# Classification Models (with CountVectorizer and TfidfVectorizer)

**Support Vector Machine (SVM)**

I trained an SVM model using SGDClassifier and TfidfVectorizer. The SVM model performed the second best of all models attempted, with slightly lower accuracy than the XGBoost model. 

In [None]:
#SVM model with k-fold CV
text_svm = Pipeline([('vect', TfidfVectorizer(ngram_range=(1,2))),
                     ('svm', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-4,
                                           max_iter=25, random_state=123))])
text_svm = text_svm.fit(train.stemmed_lyrics, train.genre)
cross_val_score(estimator=text_svm, X=train.stemmed_lyrics, y=train.genre, cv=7).mean()

0.6219237180115698

In [None]:
#SVM test results
print(text_svm.score(y=test.genre, X=test.stemmed_lyrics))
preds_svm = text_svm.predict(test.stemmed_lyrics)
print(classification_report(y_pred=preds_svm, y_true=test.genre))
pd.crosstab(preds_svm, test.genre)

0.6215041464112097
              precision    recall  f1-score   support

     Country       0.58      0.86      0.69      3702
     Hip-Hop       0.73      0.86      0.79      3195
       Metal       0.64      0.84      0.73      3404
         Pop       0.56      0.42      0.48      3621
        Rock       0.54      0.16      0.25      3563

    accuracy                           0.62     17485
   macro avg       0.61      0.63      0.59     17485
weighted avg       0.61      0.62      0.58     17485



genre,Country,Hip-Hop,Metal,Pop,Rock
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Country,3179,105,134,941,1151
Hip-Hop,67,2746,151,467,313
Metal,177,92,2855,415,920
Pop,197,219,165,1521,613
Rock,82,33,99,277,566


The SVM model heavily misclassified Country as Rock and Pop, similarly to the Naive Bayes model. The lyrics in these genres can tend to be similar, so it is not too surprising. 

In [None]:
#Which songs did the SVM model misclassify?
test['preds_svm'] = preds_svm
misclasses_svm = test[test.preds_svm != test.genre]
misclasses_svm['misclass_combo'] = misclasses_svm.apply(lambda x: x['genre']+'-'+x['preds_svm'], axis=1)

In [None]:
#Number of misclasses per pair of genres. 
misclasses_svm.misclass_combo.value_counts()

Rock-Country       1151
Pop-Country         941
Rock-Metal          920
Rock-Pop            613
Pop-Hip-Hop         467
Pop-Metal           415
Rock-Hip-Hop        313
Pop-Rock            277
Hip-Hop-Pop         219
Country-Pop         197
Country-Metal       177
Metal-Pop           165
Metal-Hip-Hop       151
Metal-Country       134
Hip-Hop-Country     105
Metal-Rock           99
Hip-Hop-Metal        92
Country-Rock         82
Country-Hip-Hop      67
Hip-Hop-Rock         33
Name: misclass_combo, dtype: int64

**XGBoost**

Now I fit an XGBoost classifier model, in conjunction with the TfidfVectorizer. This model performed the best of all the models/parameterizations I attempted, with an accuracy of 63.3%.

In [None]:
#XGB model
vect = TfidfVectorizer(ngram_range=(1,2))
vect.fit_transform(train.stemmed_lyrics)
vect_train = vect.transform(pd.Series(train.stemmed_lyrics))
vect_test = vect.transform(pd.Series(test.stemmed_lyrics))

In [None]:
#Define XGBoost model
xgb = XGBClassifier(learning_rate=0.25, subsample=0.8, gamma=1, random_state=123, max_depth=4, max_delta_step=1).fit(vect_train, train.genre)

In [None]:
#Look at results of XGBoost model
print(xgb.score(y=test.genre, X=vect_test))
preds_xgb = xgb.predict(vect_test)
print(classification_report(y_pred=preds_xgb, y_true=test.genre))
pd.crosstab(xgb.predict(vect_test), test.genre)

0.6327137546468401
              precision    recall  f1-score   support

     Country       0.64      0.73      0.68      3702
     Hip-Hop       0.88      0.79      0.83      3195
       Metal       0.70      0.73      0.72      3404
         Pop       0.54      0.51      0.52      3621
        Rock       0.43      0.42      0.43      3563

    accuracy                           0.63     17485
   macro avg       0.64      0.64      0.64     17485
weighted avg       0.63      0.63      0.63     17485



genre,Country,Hip-Hop,Metal,Pop,Rock
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Country,2699,70,116,547,753
Hip-Hop,16,2527,65,203,60
Metal,129,85,2495,235,599
Pop,371,344,220,1839,648
Rock,487,169,508,797,1503


The XGBoost model performed the best of all the models I attempted. It classified songs in the Hip-Hop (f1=0.83) and Metal (f1=0.72) genres the most accurately, while poorly classifying songs in the Rock genre (f1=0.43).

In [None]:
#Which songs did the xgb model misclassify?
test['preds_xgb'] = preds_xgb
misclasses_xgb = test[test.preds_xgb != test.genre]
misclasses_xgb['misclass_combo'] = misclasses_xgb.apply(lambda x: x['genre']+'-'+x['preds_xgb'], axis=1)

In [None]:
#Misclassified songs for each genre pair
misclasses_xgb.misclass_combo.value_counts()

Pop-Rock           797
Rock-Country       753
Rock-Pop           648
Rock-Metal         599
Pop-Country        547
Metal-Rock         508
Country-Rock       487
Country-Pop        371
Hip-Hop-Pop        344
Pop-Metal          235
Metal-Pop          220
Pop-Hip-Hop        203
Hip-Hop-Rock       169
Country-Metal      129
Metal-Country      116
Hip-Hop-Metal       85
Hip-Hop-Country     70
Metal-Hip-Hop       65
Rock-Hip-Hop        60
Country-Hip-Hop     16
Name: misclass_combo, dtype: int64

# Classification Using Word2Vec

**Word2Vec - Feature Extraction**

In [None]:
#Import Dependencies 
import gensim
import pandas as pd
import nltk
import numpy as np
from nltk.corpus import brown
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [None]:
#Tokenize lyrics using simple_preprocessor
from gensim.utils import simple_preprocess
# Tokenize the text column to get the new column 'tokenized_text'
df['tokenized_lyrics'] = [simple_preprocess(line, deacc=True) for line in df['clean_lyrics']] 
print(df['tokenized_lyrics'].head(10))

0     [love, crawdads, wheels, old, mag, fact, joint...
2     [female, newsreporter, talking, also, hearing,...
3     [ala, derecha, dale, suave, cabra³n, dale, sua...
4     [uh, ah, uh, uhh, hahaha, check, flipmode, squ...
5     [verse, getting, hard, tell, months, weeks, st...
6     [living, phat, pockets, flat, wit, tha, gat, r...
7     [word, bond, got, goin, amon, throw, ya, hands...
8     [gorilla, fucking, coupe, finna, pull, zoo, ni...
10    [many, brothers, fell, victim, streets, rest, ...
11    [ohhh, ohhh, ohhhhhh, ohh, ohhh, ohhhhh, heard...
Name: tokenized_lyrics, dtype: object


In [None]:
#Stem the tokenized lyrics via PorterStemmer
from gensim.parsing.porter import PorterStemmer
porter_stemmer = PorterStemmer()
df['stemmed_tokens'] = [[porter_stemmer.stem(word) for word in tokens] for tokens in df['tokenized_lyrics'] ]

In [None]:
#Create list of tokenized lyrics
Lyrics=list(df['clean_lyrics'])

In [None]:
#Create corpus from the list of lyrics 
corpus=(Lyrics)

In [None]:
#Tokenize and Create Vocabulary List using the lyrics and the very large Brown Corpus of words
sentence_data=[] 
Tokens=[]
for i in range(len(corpus)):
    g=corpus[i].split()
    sentence_data.append(set(g))
    Tokens.append((g))
#Brown Corpus Data
sentence_brown = brown.sents()   
for i in range(len(sentence_brown)):
    sentence_data.append(set(sentence_brown[i]))

In [None]:
#Fit lyrics data to gensim model
model = gensim.models.Word2Vec(sentence_data, size=50,window=5,min_count=5)

In [None]:
#Create the vocab list from the fitted Word2Vec model
vocab=model.wv.vocab
vocab=(set(vocab))

In [None]:
#Apply Word2Vec to all of the data
word2vecTokens=[] 
i=0;
#Run a loop over all the tokens
for g in Tokens: 
    vc=[] #to store temp vector for each token in an instance
    for s in g:
        if (s in vocab):
            vc.append(model.wv[s])
    word2vecTokens.append(vc) # appending all the vectors of an instance

In [None]:
g=[]
for i in range (len(word2vecTokens)):
    g.append(np.sum(word2vecTokens[i], axis=0)/len(word2vecTokens[i]))

  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
#Create the columns and append our embedded tokens
columns=[]
index=[]
for k in range(0,57096):
    index.append(k)
for i in range(1,51):
    columns.append("w2v_"+str(i))
df_ = pd.DataFrame(columns=columns)
df_ = df_.fillna(0)

In [None]:
for x in range (len(g)):
    try:
        if(type(g[x])==float):
            g[x]=[0]*50
        g[x]=g[x].tolist()
    except:
        print(x)

In [None]:
#Convert the Word2Vec features into a pandas dataframe
for x in range (len(g)):
    try:
        if(type(g[x])==float):
            g[x]=[0]*50
    except:
        print(x)
w2v_df=pd.DataFrame(g,columns=columns)
w2v_df.head()

Unnamed: 0,w2v_1,w2v_2,w2v_3,w2v_4,w2v_5,w2v_6,w2v_7,w2v_8,w2v_9,w2v_10,w2v_11,w2v_12,w2v_13,w2v_14,w2v_15,w2v_16,w2v_17,w2v_18,w2v_19,w2v_20,w2v_21,w2v_22,w2v_23,w2v_24,w2v_25,w2v_26,w2v_27,w2v_28,w2v_29,w2v_30,w2v_31,w2v_32,w2v_33,w2v_34,w2v_35,w2v_36,w2v_37,w2v_38,w2v_39,w2v_40,w2v_41,w2v_42,w2v_43,w2v_44,w2v_45,w2v_46,w2v_47,w2v_48,w2v_49,w2v_50
0,0.209785,0.181589,0.110289,0.109864,0.298785,-0.378487,0.146127,0.044905,0.02104,0.100183,-0.000413,-0.820316,-0.157412,-0.409688,-0.852307,0.228529,-0.537322,0.429066,0.378882,-0.032191,0.105517,0.440574,0.349704,0.657562,-0.18592,-0.477514,0.345601,0.204147,-0.075273,-0.206848,-0.045877,-0.069673,-0.066996,-0.186448,-0.012974,0.323681,-0.281494,0.945937,-0.54209,0.494095,-0.157334,-0.785356,-0.271331,-0.439568,0.191536,0.002863,0.060437,0.31354,-1.144182,-0.270229
1,0.147199,0.214286,0.371582,0.243956,0.18548,-0.534881,-0.071013,0.204643,0.098194,0.150684,0.173831,-0.347603,-0.317625,-0.494478,-0.641134,0.077006,-0.380763,0.288521,0.307991,0.023034,-0.059503,-0.039249,0.250375,0.601551,-0.165686,-0.347886,0.210369,0.256628,-0.02567,-0.29795,-0.246597,-0.074416,0.073985,-0.604431,-0.149943,0.260114,-0.182868,0.621283,-0.378396,0.392131,-0.086574,-0.417222,-0.345181,-0.567082,0.284757,0.004774,0.019047,0.45957,-1.035808,-0.219823
2,-0.023518,-0.641457,-1.102616,0.380452,-1.563758,-0.624014,0.615991,1.620445,0.710752,0.300238,-0.407481,-0.011696,1.617881,0.648333,1.317425,0.245065,-1.071978,0.299933,0.573878,0.9378,-0.032214,-0.41871,-1.294968,-1.844275,-1.447322,0.665796,-2.58309,-0.276443,-0.919307,-1.212407,0.919041,-2.441631,0.402402,0.129662,1.93881,1.376162,-0.031456,-0.560316,-0.5137,0.740478,0.116013,0.45014,-0.342815,-0.540263,0.702031,-0.791063,1.578859,1.100754,-0.190145,1.338918
3,0.419271,0.081302,0.228797,0.156991,-0.062284,-0.295888,-0.12368,-0.054129,-0.107155,0.173701,-0.043598,-0.361134,-0.134886,-0.504117,-0.7306,0.153167,-0.736951,0.262178,0.704675,-0.058566,-0.170795,0.531184,0.174029,0.765985,-0.321834,-0.084343,0.08742,0.368625,-0.315431,-0.255213,-0.395816,0.243046,0.134219,-0.186369,-0.047282,0.532313,-0.196392,1.082691,-0.249974,0.476764,-0.025872,-0.43888,-0.510833,-1.136053,-0.064628,-0.077814,-0.106368,0.475145,-1.34144,-0.276306
4,-0.489338,0.686841,-0.446464,0.35198,0.492617,-0.199488,0.246129,-0.238532,-0.061256,1.106874,-0.564325,-0.58148,-0.406552,-0.863759,-0.65103,0.327285,-0.983178,0.918216,0.010463,-0.175525,-0.296195,-0.080797,0.934673,1.08481,-0.220089,-1.16309,0.935209,0.449411,-0.035598,0.508712,-0.063932,0.08124,-0.268076,-0.354819,0.19315,-0.078406,0.109638,0.786496,0.520406,0.774771,0.619919,-1.41184,-0.000969,-0.530563,0.288686,1.004326,0.331517,1.322752,-1.471923,-0.246995


In [None]:
#Save Word2Vec data to .csv file
w2v_df.to_csv("w2vData.csv",index=False) # Saving the dataframe to a file

**Gradient Boosting Classification Using Word2Vec Features**

In [None]:
#Import dependencies
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [None]:
#Create column for genres in Word2Vec dataframe
w2v_df['genre'] = df['genre']

In [None]:
#Drop all NAs
w2v_df = w2v_df.dropna(how='any')

In [None]:
#Create train/test datasets
train,test,train_y,test_y = train_test_split(w2v_df[w2v_df.columns.difference(['genre'])],w2v_df['genre'],train_size=0.67)

In [None]:
#Fit gradient boosting classification model
gd = GradientBoostingClassifier(max_depth=20)
gd.fit(train,train_y)
pred = gd.predict(test)
print (accuracy_score(test_y,pred))

0.4504994224366379


The Gradient Boosting Classifier achieved an accuracy of 45%, which is far lower than achieved by the XGBoost and SVM model. This is likely due to the fact that song lyrics often don't have cohesive word structures like, say, a book would. 

Other models that I tried included: Naive Bayes, SVM, Random Forests, and XGBoost. However, these all has even worse accuracy than the Gradient Boosting model. 