# Predicting Song Genre from Song Lyrics

### Data Cleaning, Tokenization, & Lemmatization (Part 1)

The data is an accumulation of 380,000+ lyrics from the website MetroLyrics.com. We have multiple artists in 10 varying genres ranging from Pop & Rock to Indie & Folk. The songs range from years 1960 to 2016. 

The goal of the project was, as said in title, to predict genre based on the lyrics. In order to tackle this problem, this involved doing natural language processing in order to effectively analyze the lyrics. 

In [1075]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score
import seaborn as sns
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn import preprocessing

In [1076]:
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Manda\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Manda\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Manda\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Manda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [1077]:
df1 = pd.read_csv('lyrics.csv', delimiter=',')
df1.dataframeName = 'lyrics.csv'
emptyLyrics = len(df1)
df1 = df1[df1['lyrics']!='instrumental'].dropna()
emptyLyrics -= len(df1)
print(str(emptyLyrics) + " rows dropped (no lyrics)")

#drop 'Not Available' genre
df1 = df1[df1['genre']!='Not Available'].dropna()
#drop "other" genre
df1 = df1[df1['genre']!='Other'].dropna()

df1['lyrics'].dropna(inplace=True)
df1['lyrics'].dropna(inplace=True)

#Change all the text to lower case. 
df1['lyrics'] = [entry.lower() for entry in df1['lyrics']]

95761 rows dropped (no lyrics)


In [1078]:
df1 = df1.sample(frac = 0.5)

### Work Tokenization

Basically, what word tokenization does is splits the text into individual words. For instance, for the lyrics "oh baby, how you doing?" into a list with the tokens ['oh','baby', ",",'how','you','doing','?']. As seen here, it splits punctuations into their own tokens which will be removed later.

In [1079]:
df1['lyrics']= [word_tokenize(entry) for entry in df1['lyrics']]

### Lemmatizing the Lyrics and Removing Stop Words from Lyrics

##### Stop Words:
As most of us are aware, there are words in songs that have no significance to the listener for understanding the song. Some examples include "la la", "oh", "ooo", "na", etc. It is important that we remove these words along with other words that the NLTK package deems as stopwords such as "the", "an", "a", etc.

In [1080]:
#reset the index (because we took random sample of data) for next function
df1 = df1.drop(columns = ['index'])
df1 = df1.reset_index(drop = True)

In [1081]:
customStopWords = ["'s", "n't", "'m", "'re", "'ll","'ve","...", "ä±", "''", '``',\
                  '--', "'d", 'el', 'la', 'chorus', 'verse', 'oh', 'la', 'ya', 'na', 'wo', 'wan', 'Chorus', 'Verse',
                  'ca', 'cuz', '[Verse 1:]', '[Intro:]', '[Chorus]', '\n', 's', 't', 'n', 'don',
                  'ya','aah','ye','hey','ba','da','buh','duh','doo','oh','ooh','woo','uh','hoo','ah','yeah',
                   'oo','la','chorus','beep','ha', 'un', 'se', 'que', 'mi', 'con', 'en', 'por', 'un', 'de', 'yo', 'los', 
                   'si', 'le', 'los']

stopWords = stopwords.words('english') + stopwords.words('spanish') + stopwords.words('german')+ customStopWords

###### Lemmatization:
WordNetLemmatizer() enables the process of converting a word to its base form. This involves grouping together different inflected forms of a word, while still keeping the context of the word. Words with multiple variations, but with similar meanings, can be analysed as a single item. An example of this is "better" to "good", "saying" to "say", and "heard" to "hear".

WordNetLemmatizer() takes a part of speech parameter: here it is "pos_tag". This is important for lemmatization so the computer can recognize the context of the word to lemmatize it properly. 

If a speech parameter is not supplied, the default is "noun." This means that an attempt will be made to find the closest noun, which can create trouble.

### Note: I ran this notebook for 1.5 days so the result (after my lack of patience) was the dataframe "df2"

In [1082]:
#labeling words as their respective parts of speech
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

#segmenting by parts of speech
for index,entry in enumerate(df1['lyrics']):
    Final_words = []
    verb_words = []
    adv_words = []
    noun_words = []
    adj_words = []
    #WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        if word not in stopwords.words('english') and word not in stopWords and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
        df1.loc[index,'lyrics_final'] = str(Final_words)
        
        if word not in stopwords.words('english') and word not in stopWords and word.isalpha() and 'NN' in tag:
            noun_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            noun_words.append(noun_Final)
        df1.loc[index,'lyrics_noun'] = str(noun_words)


KeyboardInterrupt: 

In [None]:
#I ran a random sample of 50% of through the code, and it took me 1.5 days. df2 is the result of 
#what was finished after my patience ran out.
df2 = df1[0:46940] 

In [1128]:
#df2.to_csv('C:/Users/Manda/Documents/NCF/Machine Learning/ML Project/lyrics_final.csv')