# Can NLP helps with musical genre classification?

## 1. Overview

As seen in the [previous project](https://github.com/gustavolopeso/spotify-genre-classifier), audio features extracted from songs by Spotify audio analysis software can be useful to help us to differentiate brazilian Rap from brazilian Indie. But how could we improve the classifier accuracy?

Natural Language Processing (NLP) is a set of concepts and methods that look to make it possible for computers to understand natural human language. As rap and indie are different genres that are different in how they "sound", they are also different in what they talk about and how they do it.

Brazilian rap is a genre well known for dealing with social problems, representing the urban peripheral youth. Its lyrics protest, tell real stories, and seek to bring a motivational message to those who listen.

On the other hand, Brazilian Indie, which aggregates, in the case of this project, other genres such as New MPB and Alternative Rock, brings lyrics that deal with emotions and, to a certain extent, criticize the status quo. Many times the lyrics are not so obvious about the message they want to convey, being full of figures of speech, unlike rap, which is usually more direct.


## 2. ETL

We will use the song data extracted in the previous project.

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import nltk
from scipy.stats import ttest_ind

import matplotlib.pyplot as plt
import seaborn as sns

from tqdm import tqdm

In [2]:
df = pd.read_csv('spotify_data.csv')

Now, we need to get the lyrics for each song in the dataframe. For that, we will use the [Letras.mus.br website](www.letras.mus.br). The url for accessing the lyrics page from a song has the following format:

https://www.letras.mus.br/ARTIST_NAME/SONG_NAME

A GET request will be made for each url and BeautifulSoup will help us to find the lyrics on the website html code. The lyrics will be stored in the dataframe as a list of string for each song.

The get_lyrics function will be applied for the dataset to create the **lyrics** column.

In [5]:
import requests
from bs4 import BeautifulSoup

In [6]:
def get_lyrics(x):
    try:
        r = requests.get('https://www.letras.mus.br/{}/{}'.format(x['artist'].replace(' ','-').strip(),x['name'].replace(' ','-').strip()))
        soup = BeautifulSoup(r.content)
        lyrics = list(soup.find_all(class_='cnt-letra')[0].find_all('p'))
        lyrics = [str(item) for item in lyrics]
        lyrics =  '\n'.join(lyrics).replace('<br/>','\n').replace('<p>','\n').replace('</p>','\n').split('\n')
        lyrics = [item for item in lyrics if item != '' and '[' not in item]
        return lyrics
    except:
        return np.nan

In [5]:
lyrics_list = []
for i in tqdm(df.index):
    row = df.iloc[i]
    lyrics_list.append(get_lyrics(row))
df['lyrics'] = lyrics_list

100%|██████████████████████████████████████████████████████████████████████████████| 1058/1058 [09:16<00:00,  1.90it/s]


The lyrics for some songs couldn't be found, so we are going to drop these from the dataset. We found the lyrics for almost all songs.

In [35]:
df = df.dropna(subset=['lyrics'])
df['id'].count()

Unnamed: 0,index,id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,...,track_href,analysis_url,duration_ms,time_signature,artist,name,album,release,genre,lyrics
0,0,5zwvCa9LuVB46IQwKODSW3,0.509,0.696,5,-7.341,1,0.0384,0.1100,0.434000,...,https://api.spotify.com/v1/tracks/5zwvCa9LuVB4...,https://api.spotify.com/v1/audio-analysis/5zwv...,300587,4,Boogarins,Começa em Você,"Manchaca, Vol. 2 (A Compilation of Boogarins M...",2021-04-02,indie,"[Começa em você, Todo dedo sujo aponta, Encont..."
1,1,7gTAaRW4AqsArF1vDUFY03,0.491,0.819,9,-5.520,1,0.1520,0.0565,0.000075,...,https://api.spotify.com/v1/tracks/7gTAaRW4AqsA...,https://api.spotify.com/v1/audio-analysis/7gTA...,281067,4,Boogarins,Correndo em Fúria,"Manchaca, Vol. 2 (A Compilation of Boogarins M...",2021-04-02,indie,"[Correndo em fúria, Contra toda essa angústia,..."
2,2,4XLqe8UMsnlsaa2qguc5xW,0.293,0.828,2,-6.863,1,0.0517,0.2590,0.001440,...,https://api.spotify.com/v1/tracks/4XLqe8UMsnls...,https://api.spotify.com/v1/audio-analysis/4XLq...,129493,5,Boogarins,Vc Sabe Mto - Improviso Fábrica dos Sonhos,"Manchaca, Vol. 2 (A Compilation of Boogarins M...",2021-04-02,indie,"[Você sabe muito, Você sabe muito, Suas ideias..."
3,3,3Tnaywzt8FACGnI0dBeeAv,0.724,0.769,6,-5.489,0,0.0737,0.4470,0.015300,...,https://api.spotify.com/v1/tracks/3Tnaywzt8FAC...,https://api.spotify.com/v1/audio-analysis/3Tna...,270653,4,Boogarins,Basic Lines,"Manchaca, Vol. 2 (A Compilation of Boogarins M...",2021-04-02,indie,"[Basic lines, Are like zig zag roads, Blowing ..."
4,4,3blzViOH3HZuRnCfVIYuPg,0.578,0.824,11,-5.838,1,0.0460,0.4960,0.003360,...,https://api.spotify.com/v1/tracks/3blzViOH3HZu...,https://api.spotify.com/v1/audio-analysis/3blz...,203453,4,Boogarins,No Meio de Tanto Cobertor - Medo de Falar,"Manchaca, Vol. 2 (A Compilation of Boogarins M...",2021-04-02,indie,"[Nem te achava, No meio de tanto cobertor, E a..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
997,1053,0qrsrne61cwFynJ0uzC0iS,0.695,0.705,1,-8.525,1,0.3100,0.1260,0.000000,...,https://api.spotify.com/v1/tracks/0qrsrne61cwF...,https://api.spotify.com/v1/audio-analysis/0qrs...,236058,4,Dalsin,Luxo,Cinza Chumbo,2015-08-25,rap,"[No janelão do apê mais alto da city, Vendo as..."
998,1054,0RKRkL6tM4OiBDaQ3qPUpL,0.391,0.526,9,-8.879,0,0.1800,0.3210,0.000000,...,https://api.spotify.com/v1/tracks/0RKRkL6tM4Oi...,https://api.spotify.com/v1/audio-analysis/0RKR...,237567,4,Dalsin,Blindado,Cinza Chumbo,2015-08-25,rap,"[A meta é ser melhor que ontem meu chapa, Nem ..."
999,1055,1Y64QrrZqNxbjnqsl9lxOl,0.690,0.648,0,-8.226,1,0.4870,0.2220,0.000000,...,https://api.spotify.com/v1/tracks/1Y64QrrZqNxb...,https://api.spotify.com/v1/audio-analysis/1Y64...,131591,4,Dalsin,Full,Cinza Chumbo,2015-08-25,rap,[E eu faço essa porra virar antes que tu imagi...
1000,1056,2Y7sFVvhXLHrll5wm8eguy,0.592,0.578,3,-9.127,0,0.3100,0.3950,0.000000,...,https://api.spotify.com/v1/tracks/2Y7sFVvhXLHr...,https://api.spotify.com/v1/audio-analysis/2Y7s...,184941,5,Dalsin,Nave,Cinza Chumbo,2015-08-25,rap,"[Bateu neurose e ela correu pro mar, Com aquel..."


In [36]:
df.to_csv('spotify_lyrics_data.csv')

In [7]:
def str2list(x): ### This function solves the problem of load dataframe .csv file containing list objects.
    result_list = []
    for i in x.split("',"):
        result_list.append(i.replace("'",'').replace('[','').replace(']','').strip())
    return result_list

In [8]:
df = pd.read_csv('spotify_lyrics_data.csv')
df['lyrics'] = df['lyrics'].apply(str2list)

## 3. Feature Engineering

Now that we have extracted the lyrics, we are going to use NLP to extract features from them. As said in the Overview section, the two genres are in different in **what** they talk about and **how** they do it. Therefore, we will use two methods to address these two problems.

NLTK is one of the most important NLP toolkits for python, and we will use it in addition to sklearn nlp-related functions.

### 3.1 Bag of Words

Bag of Words is a very simple wayto extract features from text. It's based on counting the occurences of the words from a vocabulary in a text. For example, having the following vocabulary:

- "I"
- "LOVE"
- "SHE"
- "APPLE"
- "ME"
- "HIM"
- "MONEY"
- "PEOPLE"

Considereing the following text:

"I LOVE MONEY. PEOPLE LOVE APPLES."

We can now score this text according to that vocabulary:

- "I": 1
- "LOVE": 2
- "SHE": 0
- "APPLES": 1
- "ME": 0
- "HIM": 0
- "MONEY": 1
- "PEOPLE": 1

The bag of words model doesn't capture any relationship between words or the order in which they are placed in the text, but instead focuses on the count of occurrences. It's easy to see that with Bag of Words we can turn complex texts into vectors of word occurences, what can be useful to train our classification model.

In order to create the bag of words, we will need to transform the text into a list of "tokens", that is, groups of characters. These tokens will be stemmed, or, in another words, reduced to their stem, the "root" word that carries the meaning of the word. Finally, the stems will be counted and stored in a vector.

#### 3.1.1 Word Tokenizing

The tokenizing process could be done through some string splits, replaces and regex, but we will use nltk.word_tokenize() function to do it. For the bag of words model we are going to use the lyrics as an unique string, and not a list of strings.

In [12]:
def lyrics_tokenize(text):
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    return tokens

lyrics_tokenize('Eu sou o Gustavo! Sou brasileiro!')

['eu', 'sou', 'o', 'gustavo', '!', 'sou', 'brasileiro', '!']

#### 3.1.2 Token Stemming

Stemming is a good way for generalization. It makes it possible to understand the context more easily and to reduce the vocabulary complexity, reducing, in this way, the number of features, and, finally, the computational cost of training models.

We are going to suppose that almost all lyrics are in portuguese. The language of the text is very important to the stemming process, because it's based on vocabulary dictionaries created for each language. Therefore, we will use the "RSLP Stemmer", a portuguese Stemmer created by **Viviane Moreira Orengo** and **Christian Huyck**.

In [13]:
nltk.download('rslp')

[nltk_data] Downloading package rslp to
[nltk_data]     C:\Users\irong\AppData\Roaming\nltk_data...
[nltk_data]   Package rslp is already up-to-date!


True

In [14]:
def token_stemming(tokens):
    stemmer = nltk.stem.RSLPStemmer()
    stemmed_sentence = []
    for token in tokens:
        stemmed_sentence.append(stemmer.stem(token))
    return stemmed_sentence
token_stemming(lyrics_tokenize('Eu sou o Gustavo! Sou brasileiro!'))

['eu', 'sou', 'o', 'gustav', '!', 'sou', 'brasil', '!']

#### 3.1.3 Count Vectorizing

Now we need to create our vocabulary and describe each lyrics as a vector of word counts from it. A good way to do that is counting the stem occurences for each lyrics, storing it in a dictionary. This dictionary will be appended to an auxiliary dataframe. If a new stem, that is, a stem that was not seen in any previous lyrics, is added, a new column will be created, and a NaN value will be assigned to all previous value in that column. Finally, the NaN values will be replaced by 0, indicating that this stem was not found in that lyrics.

In [15]:
len(df)

1002

In [16]:
bow_df = pd.DataFrame()
dict_list = []
for i in tqdm(df.reset_index().index):
    row = df.iloc[i]
    lyrics = ' '.join(row['lyrics'])
    tokens = lyrics_tokenize(lyrics)
    stemmed_tokens = token_stemming(tokens)
    count_vector = {}
    word_count = 0
    for stemmed_token in stemmed_tokens:
        word_count += 1
        if stemmed_token in count_vector.keys():
            count_vector[stemmed_token] += 1
        else:
            count_vector[stemmed_token] = 1
    for key in count_vector.keys():
        count_vector[key] = count_vector[key]/word_count
    count_vector['word_count'] = word_count
    count_vector['id'] = row['id']
    count_vector['genre'] = row['genre']
    dict_list.append(count_vector)
bow_df = pd.DataFrame.from_records(dict_list)
bow_df = bow_df.fillna(0)
bow_df.to_csv('bow_df.csv')

100%|██████████████████████████████████████████████████████████████████████████████| 1002/1002 [00:43<00:00, 22.87it/s]


In [17]:
len(bow_df.columns)

9165

#### 3.1.4 Analysis

We got 9162 new features from the Bag of Words model. Maybe we could select only the most important ones to keep in our dataset, reducing the computational cost of dealing with a large number of features. For that, we are going to use Student's t to find out the tokens that are more likely to appear in one genre than in another. scipy module stats has a funcion called ttest_ind() that can help us with that.

In [127]:
from scipy.stats import ttest_ind

bow_p_dict = {}

for col in tqdm(bow_df.columns):
    if col not in ['id','genre','word_count']:
        bow_p_dict[col] = ttest_ind(bow_df.loc[bow_df['genre'] == 'indie'][col],bow_df.loc[bow_df['genre'] == 'rap'][col])[1]

100%|██████████████████████████████████████████████████████████████████████████████| 9166/9166 [08:41<00:00, 17.57it/s]


Now, we are going to select the 50 features with the lowest calculated p-values to train our model. The smaller the p-value, the more different are the distributions of that stem in each genre.

In [142]:
index = bow_p_dict.keys()
values = bow_p_dict.values()

bow_p_df = pd.DataFrame(zip(bow_p_dict.keys(),bow_p_dict.values()))
bow_p_df.columns = ['stem','p']
bow_p_df = bow_p_df.sort_values(by='p')
selected_words = list(bow_p_df.iloc[:50]['stem'])

Analysing the selected stems we can see that they are very common in rap songs and not in indie songs. We will use a strategy to keep the balance of the features.

For each stem, we are going to check the genre it's more common to appear, so we can select characteristic stems from each genre. Maybe this approach can reduce the model accuracy, but if we want to increase the number of genres that the model is capable of classifying in the future, it's very important that it has information about all genres.

In [149]:
genre_list = []
genre_count = {}
genres = bow_df['genre'].unique()
for genre in genres:
    genre_count[genre] = 0
for stem in tqdm(bow_p_df['stem']):
    max_mean = 0
    for genre in genres:
        if max_mean < bow_df.loc[bow_df['genre'] == genre][stem].mean():
            max_genre = genre
    genre_count[max_genre] += 1
    flag = True
    genre_list.append(max_genre)
    for genre in genres:
        if genre_count[genre] < 25:
            flag = False
    if flag:
        break
zeros = [0 for i in range(len(bow_p_df) - len(genre_list))]
bow_p_df['genre'] = genre_list + zeros

 21%|████████████████                                                              | 1890/9162 [01:50<07:06, 17.03it/s]


In [150]:
nans = [None for i in range(len(bow_p_df) - len(genre_list))]
bow_p_df['genre'] = genre_list + nans
bow_p_df.columns = ['stem','p','genre']
indie_stems = list(bow_p_df['stem'].loc[bow_p_df['genre'] == 'indie'][:25])
rap_stems = list(bow_p_df['stem'].loc[bow_p_df['genre'] == 'rap'][:25])
selected_stems = indie_stems + rap_stems

#### 3.1.5 Classification Perfomance

Using the selected stems as features, the classification perfomance will be evaluated with the RandomForest Classification algorithm.

In [131]:
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestClassifier

In [152]:
bow_train_df = bow_df[['id']+selected_stems+['genre']].copy()
bow_train_df['target'] = bow_train_df['genre'].map({'rap': True,'indie': False})
X = bow_train_df[selected_stems]
y = bow_train_df['target']
acc = []
auc = []
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X,y)
    scaler = RobustScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    rfc = RandomForestClassifier(n_estimators = 200)
    rfc.fit(X_train,y_train)
    y_pred = rfc.predict(X_test)
    acc.append(accuracy_score(y_test,y_pred))
    y_score = rfc.predict_proba(X_test)[:, 1]
    auc.append(roc_auc_score(y_test,y_score))
print('Average accuracy: {:.4f}'.format(sum(acc)/len(acc)))
print('Average ROC AUC score: {:.4f}'.format(sum(auc)/len(auc)))

Average accuracy: 0.9271
Average ROC AUC score: 0.9760


The classifier has achieved an average accuracy of 92.7%, which can be considered high.

### 3.2 Pos tagging

The Part of Speech (POS) tagging process has the objetive of assigning a gramatical class for each word in a sentence. This method can, in a certain way, extract features about how the text message is delivered, that is, there are many ways to delivery the same message using text. Counting the frequencies of occurences of each POS in the lyrics is a good way to extract features about them. For example, considering the following sentence:

"I love to play soccer!"

We can POS tag the words as:

- I: Personal Pronoun
- love: Verb, 3rd person singular present
- to: To (The word "to" is considered a POS tag)
- play: Verb, base form
- soccer: Noun, singular or mass

A complete list of Parts of Speech is available [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).


To do that, we will use a POS tagger created by [Matheus Inoue](https://github.com/inoueMashuu/POS-tagger-portuguese-nltk) for portuguese text. For this task it's important to have the lyrics separated line by line.

We will need to tokenize the text again, so we will use the function lyrics_tokenize() created before.

In [39]:
import joblib

folder = 'trained_POS_taggers/'
pos_tagger = joblib.load(folder+'POS_tagger_brill.pkl')

In [103]:
dict_list = [] 
pos_df = pd.DataFrame()
for i in tqdm(df.index):
    row = df.iloc[i]
    lyric = row['lyrics']
    word_count = 0
    pos_vector = {}
    for phrase in lyric:
        tokens = lyrics_tokenize(phrase)
        if len(tokens) > 3:
            for word in pos_tagger.tag(tokens):
                if word[1] in pos_vector.keys():
                    word_count += 1
                    pos_vector[word[1]] += 1
                else:
                    word_count += 1
                    pos_vector[word[1]] = 1
    for key in pos_vector.keys():
        pos_vector[key] = pos_vector[key]/word_count
    pos_vector['id'] = row['id']
    pos_vector['genre'] = row['genre']
    pos_vector['word_count'] = word_count
    dict_list.append(pos_vector)
pos_df = pd.DataFrame.from_records(dict_list)
pos_df = pos_df.fillna(0)
pos_df.to_csv('pos_df.csv')

100%|██████████████████████████████████████████████████████████████████████████████| 1002/1002 [00:36<00:00, 27.43it/s]


We will use Student's t test again to select the 25 most important feature to use in the training of our model.

In [104]:
from scipy.stats import ttest_ind

pos_dict = {}

for col in tqdm(pos_df.columns):
    if col not in ['id','genre','word_count','name']:
        pos_dict[col] = ttest_ind(pos_df.loc[pos_df['genre'] == 'indie'][col],pos_df.loc[pos_df['genre'] == 'rap'][col])[1]

100%|█████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 353.41it/s]


In [109]:
index = pos_dict.keys()
values = pos_dict.values()

pos_p_df = pd.DataFrame(zip(pos_dict.keys(),pos_dict.values()))
pos_p_df.columns = ['pos','p']
pos_p_df = pos_p_df.sort_values(by='p')
selected_pos = list(pos_p_df.iloc[:25]['pos'])

With the selected POS, we will repeat the classification evaluation process made to the selected stems.

In [153]:
pos_train_df = pos_df[['id']+selected_pos+['genre']].copy()
pos_train_df['target'] = pos_train_df['genre'].map({'rap': True,'indie': False})
X = pos_train_df[selected_pos]
y = pos_train_df['target']
acc = []
auc = []
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X,y)
    scaler = RobustScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    rfc = RandomForestClassifier(n_estimators = 200)
    rfc.fit(X_train,y_train)
    y_pred = rfc.predict(X_test)
    acc.append(accuracy_score(y_test,y_pred))
    y_score = rfc.predict_proba(X_test)[:, 1]
    auc.append(roc_auc_score(y_test,y_score))
print('Average accuracy: {:.4f}'.format(sum(acc)/len(acc)))
print('Average ROC AUC score: {:.4f}'.format(sum(auc)/len(auc)))

Average accuracy: 0.9143
Average ROC AUC score: 0.9733


The classifier has achieved an accuracy of 91.4%, which is lower than the accuracy obtained for the bag of words features and the audio features based model.

## 4 Model Training

Firstly, we will create a dataset with all features (audio features, pos tags e stems) and store it in a .csv file.

In [184]:
audio_features = ['danceability','energy','key','loudness','mode','speechiness','acousticness','instrumentalness','liveness','valence','tempo','time_signature']

train_df = df[['id'] + audio_features + ['genre']].copy()

bow_join_df = bow_train_df[['id']+selected_stems].copy()
bow_join_df.columns = ['bow_'+x if x != 'id' else x for x in bow_join_df.columns]

pos_join_df = pos_train_df[['id']+selected_pos].copy()
pos_join_df.columns = ['pos_'+x if x != 'id' else x for x in pos_join_df.columns]

train_df = train_df.merge(bow_join_df,on='id')
train_df = train_df.merge(pos_join_df,on='id')
train_df['target'] = train_df['genre'].map({'rap': True,'indie': False})
train_df.to_csv('song_data.csv')

In [189]:
train_df

Unnamed: 0,id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,...,pos_NPROP,pos_VAUX,pos_ADV-KS-REL,pos_PREP|+,pos_KC|[,pos_PCP,pos_PREP,pos_-,pos_?,target
0,5zwvCa9LuVB46IQwKODSW3,0.509,0.696,5,-7.341,1,0.0384,0.1100,0.434000,0.1250,...,0.021277,0.000000,0.000000,0.000000,0.000000,0.021277,0.063830,0.0,0.021277,False
1,7gTAaRW4AqsArF1vDUFY03,0.491,0.819,9,-5.520,1,0.1520,0.0565,0.000075,0.3520,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.058824,0.0,0.000000,False
2,4XLqe8UMsnlsaa2qguc5xW,0.293,0.828,2,-6.863,1,0.0517,0.2590,0.001440,0.3960,...,0.013699,0.000000,0.000000,0.013699,0.000000,0.000000,0.027397,0.0,0.000000,False
3,3Tnaywzt8FACGnI0dBeeAv,0.724,0.769,6,-5.489,0,0.0737,0.4470,0.015300,0.7260,...,0.692308,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,False
4,3blzViOH3HZuRnCfVIYuPg,0.578,0.824,11,-5.838,1,0.0460,0.4960,0.003360,0.1650,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.204545,0.0,0.000000,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
997,0qrsrne61cwFynJ0uzC0iS,0.695,0.705,1,-8.525,1,0.3100,0.1260,0.000000,0.1200,...,0.044655,0.014885,0.004060,0.001353,0.000000,0.018945,0.077131,0.0,0.002706,True
998,0RKRkL6tM4OiBDaQ3qPUpL,0.391,0.526,9,-8.879,0,0.1800,0.3210,0.000000,0.1400,...,0.049223,0.031088,0.000000,0.000000,0.002591,0.020725,0.090674,0.0,0.000000,True
999,1Y64QrrZqNxbjnqsl9lxOl,0.690,0.648,0,-8.226,1,0.4870,0.2220,0.000000,0.6570,...,0.074359,0.007692,0.000000,0.000000,0.000000,0.020513,0.071795,0.0,0.005128,True
1000,2Y7sFVvhXLHrll5wm8eguy,0.592,0.578,3,-9.127,0,0.3100,0.3950,0.000000,0.0965,...,0.065934,0.010989,0.002747,0.019231,0.000000,0.005495,0.074176,0.0,0.010989,True


### 4.1 Model Selection

Now, we need to select which model we will use to the classifier. As we did in the previous project, we will test the following models:

- K Nearest Neighbors
- Support Vector Machine
- Random Forest

#### 4.1.1 KNN

In [178]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

In [199]:
features = [x for x in train_df.columns if x not in ['id','genre','target']]
X = train_df[features]
y = train_df['target']
acc = []
auc = []
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = RobustScaler()
X = scaler.fit_transform(X)
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': [1,3,5,10,20,30]}
clf = GridSearchCV(knn,param_grid,cv=10)
clf.fit(X,y)
params = clf.cv_results_['params']
score = list(clf.cv_results_['mean_test_score'])
index = score.index(max(score))
print(params[index],score[index])

{'n_neighbors': 1} 0.8422475247524751


The maximum accuracy (84.2%) was reached for KNN algorithm with 1 neighbor.

#### 4.1.2 SVC

In [186]:
from sklearn.svm import SVC

In [200]:
features = [x for x in train_df.columns if x not in ['id','genre','target']]
X = train_df[features]
y = train_df['target']
acc = []
auc = []
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = RobustScaler()
X = scaler.fit_transform(X)
svc = SVC()
param_grid = {'kernel': ['linear','rbf'], 'C': [0.01,0.1,1,10]}
clf = GridSearchCV(svc,param_grid,cv=10)
clf.fit(X,y)
params = clf.cv_results_['params']
score = list(clf.cv_results_['mean_test_score'])
index = score.index(max(score))
print(params[index],score[index])

{'C': 10, 'kernel': 'rbf'} 0.8831089108910891


The SVC model with a C of 10 and a rbf kernel reached 88.3% of accuracy, which is considerably higher than the accuracy of the KNN model.

#### 4.1.3 RFC

In [201]:
features = [x for x in train_df.columns if x not in ['id','genre','target']]
X = train_df[features]
y = train_df['target']
acc = []
auc = []
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = RobustScaler()
X = scaler.fit_transform(X)
rfc = RandomForestClassifier()
param_grid = {'n_estimators': [20,50,100,200,500]}
clf = GridSearchCV(rfc,param_grid,cv=10)
clf.fit(X,y)
params = clf.cv_results_['params']
score = list(clf.cv_results_['mean_test_score'])
index = score.index(max(score))
print(params[index],score[index])

{'n_estimators': 100} 0.9410792079207921


RFC outperformed both KNN and SVC, reaching 94.1% accuracy with a forest of 100 trees each trained on a random slice of the dataset, which is expected because of the results of the previous project.

### 4.2 Performance Evaluation

So, using the Random Forest Classifier with 100 estimators we will check the accuracy and the AUC-ROC metric from the classificator.

In [202]:
X = train_df[features]
y = train_df['target']
acc = []
auc = []
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X,y)
    scaler = RobustScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    rfc = RandomForestClassifier(n_estimators = 100)
    rfc.fit(X_train,y_train)
    y_pred = rfc.predict(X_test)
    acc.append(accuracy_score(y_test,y_pred))
    y_score = rfc.predict_proba(X_test)[:, 1]
    auc.append(roc_auc_score(y_test,y_score))

In [203]:
print('Average accuracy: {:.4f}'.format(sum(acc)/len(acc)))
print('Average ROC AUC score: {:.4f}'.format(sum(auc)/len(auc)))

Average accuracy: 0.9653
Average ROC AUC score: 0.9917


Finally, we got 96.5% accuracy and 0.99 of ROC AUC score. This result is significantly better than the result achieved by the RFC trained with audio features only.

## 5 Conclusion

As seen, the addition of BOW and POS features extracted by NLP increased the accuracy of the RFC model trained on the dataset by almost 5% (91.9% -> 96.5%). We usually associate NLP with sentiment analysis and other complex tasks such as topic classification and translation, but NLP includes other simpler concepts such as Bag of Words and Part of Speech tagging, which can be very useful for extracting features from text without requiring a lot of computational power.