# Can NLP helps with musical genre classification?

## 1. Overview

As seen in the [previous project](https://github.com/gustavolopeso/spotify-genre-classifier), audio features extracted from songs by Spotify audio analysis software can be useful to help us to differentiate brazilian Rap from brazilian Indie. But how could we improve the classifier accuracy?

Natural Language Processing (NLP) is a set of concepts and methods that look to make it possible for computers to understand natural human language. As rap and indie are different genres that are different in how they "sound", they are also different in what they talk about and how they do it.

Brazilian rap is a genre well known for dealing with social problems, representing the urban peripheral youth. Its lyrics protest, tell real stories, and seek to bring a motivational message to those who listen.

On the other hand, Brazilian Indie, which aggregates, in the case of this project, other genres such as New MPB and Alternative Rock, brings lyrics that deal with emotions and, to a certain extent, criticize the status quo. Many times the lyrics are not so obvious about the message they want to convey, being full of figures of speech, unlike rap, which is usually more direct.


## 2. ETL

We will use the song data extracted in the previous project.

In [75]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
df = pd.read_csv('spotify_data.csv')

Now, we need to get the lyrics for each song in the dataframe. For that, we will use the [Letras.mus.br website](www.letras.mus.br). The url for accessing the lyrics page from a song has the following format:

https://www.letras.mus.br/{ARTIST_NAME}/{SONG_NAME}

A GET request will be made for each url and BeautifulSoup will help us to find the lyrics on the website html code. The lyrics will be stored in the dataframe as a list of string for each song.

The get_lyrics function will be applied for the dataset to create the **lyrics** column.

In [3]:
import requests
from bs4 import BeautifulSoup

In [4]:
def get_lyrics(x):
    try:
        r = requests.get('https://www.letras.mus.br/{}/{}'.format(x['artist'].replace(' ','-').strip(),x['name'].replace(' ','-').strip()))
        soup = BeautifulSoup(r.content)
        lyrics = list(soup.find_all(class_='cnt-letra')[0].find_all('p'))
        lyrics = [str(item) for item in lyrics]
        lyrics =  '\n'.join(lyrics).replace('<br/>','\n').replace('<p>','\n').replace('</p>','\n').split('\n')
        lyrics = [item for item in lyrics if item != '' and '[' not in item]
        return lyrics
    except:
        return np.nan

In [5]:
df['lyrics'] = df.apply(get_lyrics,axis=1)

The lyrics for some songs couldn't be found, so we are going to drop these from the dataset. We found the lyrics for almost all songs.

In [6]:
df = df.dropna(subset=['lyrics'])
df['id'].count()

1010

In [12]:
df.to_csv('spotify_lyrics_data.csv')

## 3. Feature Engineering

Now that we have extracted the lyrics, we are going to use NLP to extract features from them. As said in the Overview section, the two genres are in different in **what** they talk about and **how** they do it. Therefore, we will use two methods to address these two problems.

NLTK is one of the most important NLP toolkits for python, and we will use it in addition to sklearn nlp-related functions.

### 3.1 Bag of Words

Bag of Words is a very simple wayto extract features from text. It's based on counting the occurences of the words from a vocabulary in a text. For example, having the following vocabulary:

- "I"
- "LOVE"
- "SHE"
- "APPLE"
- "ME"
- "HIM"
- "MONEY"
- "PEOPLE"

Considereing the following text:

"I LOVE MONEY. PEOPLE LOVE APPLES."

We can now score this text according to that vocabulary:

- "I": 1
- "LOVE": 2
- "SHE": 0
- "APPLES": 1
- "ME": 0
- "HIM": 0
- "MONEY": 1
- "PEOPLE": 1

The bag of words model doesn't capture any relationship between words or the order in which they are placed in the text, but instead focuses on the count of occurrences. It's easy to see that with Bag of Words we can turn complex texts into vectors of word occurences, what can be useful to train our classification model.

In order to create the bag of words, we will need to transform the text into a list of "tokens", that is, groups of characters. These tokens will be stemmed, or, in another words, reduced to their stem, the "root" word that carries the meaning of the word. Finally, the stems will be counted and stored in a vector.

#### 3.1.1 Word Tokenizing

The tokenizing process could be done through some string splits, replaces and regex, but we will use nltk.word_tokenize() function to do it. For the bag of words model we are going to use the lyrics as an unique string, and not a list of strings.

In [7]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\irong\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
def lyrics_tokenize(text):
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    return tokens

lyrics_tokenize('Eu sou o Gustavo! Sou brasileiro!')

['eu', 'sou', 'o', 'gustavo', '!', 'sou', 'brasileiro', '!']

#### 3.1.2 Token Stemming

Stemming is a good way for generalization. It makes it possible to understand the context more easily and to reduce the vocabulary complexity, reducing, in this way, the number of features, and, finally, the computational cost of training models.

We are going to suppose that almost all lyrics are in portuguese. The language of the text is very important to the stemming process, because it's based on vocabulary dictionaries created for each language. Therefore, we will use the "RSLP Stemmer", a portuguese Stemmer created by **Viviane Moreira Orengo** and **Christian Huyck**.

In [9]:
nltk.download('rslp')

[nltk_data] Downloading package rslp to
[nltk_data]     C:\Users\irong\AppData\Roaming\nltk_data...
[nltk_data]   Package rslp is already up-to-date!


True

In [10]:
def token_stemming(tokens):
    stemmer = nltk.stem.RSLPStemmer()
    stemmed_sentence = []
    for token in tokens:
        stemmed_sentence.append(stemmer.stem(token))
    return stemmed_sentence
token_stemming(lyrics_tokenize('Eu sou o Gustavo! Sou brasileiro!'))

['eu', 'sou', 'o', 'gustav', '!', 'sou', 'brasil', '!']

#### 3.1.3 Count Vectorizing

Now we need to create our vocabulary and describe each lyrics as a vector of word counts from it. A good way to do that is counting the stem occurences for each lyrics, storing it in a dictionary. This dictionary will be appended to an auxiliary dataframe. If a new stem, that is, a stem that was not seen in any previous lyrics, is added, a new column will be created, and a NaN value will be assigned to all previous value in that column. Finally, the NaN values will be replaced by 0, indicating that this stem was not found in that lyrics.

In [13]:
aux_df = pd.DataFrame()
vocab = []
for i, row in df.iterrows():
    print(i)
    lyrics = ' '.join(row['lyrics'])
    tokens = lyrics_tokenize(lyrics)
    stemmed_tokens = token_stemming(tokens)
    count_vector = {'id': row['id'], 'genre': row['genre']}
    word_count = 0
    for stemmed_token in stemmed_tokens:
        word_count += 1
        if stemmed_token in count_vector.keys():
            count_vector[stemmed_token] += 1
        else:
            count_vector[stemmed_token] = 1
    count_vector['word_count'] = word_count
    aux_df = aux_df.append(count_vector, ignore_index=True)
aux_df = aux_df.fillna(0)
for col in aux_df.columns:
    if col not in ['id','genre','word_count']:
        aux_df[col] = aux_df[col]/aux_df['word_count']
aux_df = aux_df.fillna(0)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
46
47
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
214
215
217
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284


In [16]:
len(aux_df.columns)

9374

In [23]:
aux_df.to_csv('stem_df.csv')

#### 3.1.4 Analysis

We got 9371 new features from the Bag of Words model. Maybe we could select only the most important ones to keep in our dataset, reducing the computational cost of dealing with a large number of features. For that, we are going to use Student's t to find out the tokens that are more likely to appear in one genre than in another. scipy module stats has a funcion called ttest_ind() that can help us with that.

In [17]:
import tqdm

In [18]:
from scipy.stats import ttest_ind

p_dict = {}

for col in tqdm.tqdm(aux_df.columns):
    if col not in ['id','genre','word_count']:
        p_dict[col] = ttest_ind(aux_df.loc[aux_df['genre'] == 'indie'][col],aux_df.loc[aux_df['genre'] == 'rap'][col])[1]

100%|██████████████████████████████████████████████████████████████████████████████| 9374/9374 [10:45<00:00, 14.52it/s]


Now, we are going to select the 50 features with the lowest calculated p-values to train our model. The smaller the p-value, the more different are the distributions of that stem in each genre.

In [25]:
index = p_dict.keys()
values = p_dict.values()

p_df = pd.DataFrame(zip(p_dict.keys(),p_dict.values()))
p_df.columns = ['stem','p']
p_df = p_df.sort_values(by='p')
p_df.to_csv('stem_p_df.csv')
selected_words = list(p_df.iloc[:50].index)

Looking the selected stems we can see that they are very common in rap songs and not in indie songs. We will use a strategy to keep the balance of the features.

For each stem, we are going to check the genre it's more common to appear, so we can select characteristic stems from each genre. Maybe this approach can reduce the model accuracy, but if we want to increase the number of genres that the model is capable of classifying in the future, it's very important that it has information about all genres.

In [26]:
genre_list = []
genre_count = {}
genres = aux_df['genre'].unique()
for genre in genres:
    genre_count[genre] = 0
for stem in tqdm.tqdm(p_df['stem']):
    max_mean = 0
    for genre in genres:
        if max_mean < aux_df.loc[aux_df['genre'] == genre][stem].mean():
            max_genre = genre
    genre_count[max_genre] += 1
    flag = True
    genre_list.append(max_genre)
    for genre in genres:
        if genre_count[genre] < 25:
            flag = False
    if flag:
        break
zeros = [0 for i in range(len(p_df) - len(genre_list))]
p_df['genre'] = genre_list + zeros

 20%|███████████████▌                                                              | 1865/9371 [02:09<08:40, 14.42it/s]


In [27]:
nans = [np.nan for i in range(len(p_df) - len(genre_list))]
p_df['genre'] = genre_list + nans
p_df.columns = ['stem','p','genre']
indie_stems = list(p_df['stem'].loc[p_df['genre'] == 'indie'][:25])
rap_stems = list(p_df['stem'].loc[p_df['genre'] == 'rap'][:25])
selected_stems = indie_stems + rap_stems

#### 3.1.5 Classification Perfomance

Using the selected stems as features, the classification perfomance will be evaluated with the RandomForest Classification algorithm.

In [29]:
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestClassifier

In [30]:
stem_df = aux_df[['id']+selected_stems+['genre']]
stem_df['target'] = stem_df['genre'].map({'rap': True,'indie': False})
X = stem_df[selected_stems]
y = stem_df['target']
acc = []
auc = []
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X,y)
    scaler = RobustScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    rfc = RandomForestClassifier(n_estimators = 200)
    rfc.fit(X_train,y_train)
    y_pred = rfc.predict(X_test)
    acc.append(accuracy_score(y_test,y_pred))
    y_score = rfc.predict_proba(X_test)[:, 1]
    auc.append(roc_auc_score(y_test,y_score))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stem_df['target'] = stem_df['genre'].map({'rap': True,'indie': False})


In [31]:
print('Average accuracy: {:.4f}'.format(sum(acc)/len(acc)))
print('Average ROC AUC score: {:.4f}'.format(sum(auc)/len(auc)))

Average accuracy: 0.9273
Average ROC AUC score: 0.9720


The classifier has achieved an accuracy of 92.7%, which can be considered high.

### 3.2 Pos tagging

The Part of Speech (POS) tagging process has the objetive of assigning a gramatical class for each word in a sentence. This method can, in a certain way, extract features about how the text message is delivered, that is, there are many ways to delivery the same message using text. Counting the frequencies of occurences of each POS in the lyrics is a good way to extract features about them. For example, considering the following sentence:

"I love to play soccer!"

We can POS tag the words as:

- I: Personal Pronoun
- love: Verb, 3rd person singular present
- to: To (The word "to" is considered a POS tag)
- play: Verb, base form
- soccer: Noun, singular or mass

A complete list of Parts of Speech is available [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).


To do that, we will use a POS tagger created by [Matheus Inoue](https://github.com/inoueMashuu/POS-tagger-portuguese-nltk) for portuguese text. For this task it's important to have the lyrics separated line by line.

We will need to tokenize the text again, so we will use the function lyrics_tokenize() created before.

In [32]:
import joblib

folder = 'trained_POS_taggers/'
pos_tagger = joblib.load(folder+'POS_tagger_brill.pkl')

In [33]:
pos_df = pd.DataFrame()
                                  
for i, row in df.iterrows():
    print(i)
    lyric = row['lyrics']
    lyric_pos_df = {'id': row['id'], 'word_count': 0, 'genre': row['genre']}
    for phrase in lyric:
        tokens = lyrics_tokenize(phrase)
        if len(tokens) > 3:
            lyric_pos_df['word_count'] += len(tokens)
            for word in pos_tagger.tag(tokens):
                if word[1] in lyric_pos_df.keys():
                    lyric_pos_df[word[1]] += 1
                else:
                    lyric_pos_df[word[1]] = 1
    pos_df = pos_df.append(lyric_pos_df,ignore_index=True)
pos_df = pos_df.set_index('index')
pos_df = pos_df.fillna(0)

for col in pos_df.columns:
    if col != 'name' and col != 'genre' and col != 'word_count' and 'norm' not in col:
        pos_df[col] = pos_df[col]/pos_df['word_count']

pos_df = pos_df.fillna(0)
pos_df.to_csv('pos_df.csv')

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
46
47
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
214
215
217
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284


KeyError: "None of ['index'] are in the columns"

We will use Student's t test again to select the 20 most important feature to use in the training of our model.

In [40]:
from scipy.stats import ttest_ind

pos_dict = {}

for col in tqdm.tqdm(pos_df.columns):
    if col not in ['id','genre','word_count','name']:
        pos_dict[col] = ttest_ind(pos_df.loc[pos_df['genre'] == 'indie'][col],pos_df.loc[pos_df['genre'] == 'rap'][col])[1]

100%|█████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 361.46it/s]


In [41]:
index = pos_dict.keys()
values = pos_dict.values()

p_df = pd.DataFrame(values,index=index)
p_df.columns = ['p']
p_df = p_df.sort_values(by='p')
p_df
selected_pos = list(p_df.iloc[:20].index)

With the selected POS, we will repeat the classification evaluation process made to the selected stems.

In [44]:
pos_df = pos_df[['id']+selected_pos+['genre']]
pos_df['target'] = pos_df['genre'].map({'rap': True,'indie': False})
X = pos_df[selected_pos]
y = pos_df['target']
acc = []
auc = []
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X,y)
    scaler = RobustScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    rfc = RandomForestClassifier(n_estimators = 200)
    rfc.fit(X_train,y_train)
    y_pred = rfc.predict(X_test)
    acc.append(accuracy_score(y_test,y_pred))
    y_score = rfc.predict_proba(X_test)[:, 1]
    auc.append(roc_auc_score(y_test,y_score))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pos_df['target'] = pos_df['genre'].map({'rap': True,'indie': False})


In [45]:
print('Average accuracy: {:.4f}'.format(sum(acc)/len(acc)))
print('Average ROC AUC score: {:.4f}'.format(sum(auc)/len(auc)))

Average accuracy: 0.9462
Average ROC AUC score: 0.9740


The classifier has achieved an accuracy of 94.6%, which is higher than the accuracy obtained for the bag of words features.

## 4 Model Training

Firstly, we will create a dataset with all features (audio features, pos tags e stems) and store it in a .csv file.

In [56]:
train_df = df.copy()
stem_df = stem_df.rename(columns={',': 'COMMA'})
selected_stems = ['COMMA' if x == ',' else x for x in selected_stems] # Replacing the ',' column for 'COMMA' to remove duplication
train_df = train_df.merge(stem_df[['id']+selected_stems],on='id')
train_df = train_df.merge(pos_df[['id']+selected_pos],on='id')
train_df['target'] = train_df['genre'].map({'rap': True,'indie': False})
train_df.to_csv('song_data.csv')

### 4.1 Model Selection

Now, we need to select which model we will use to the classifier. As we did in the previous project, we will test the following models:

- K Nearest Neighbors
- Support Vector Machine
- Random Forest

#### 4.1.1 KNN

In [61]:
audio_features = ['danceability','energy','key','loudness','mode','speechiness','acousticness','instrumentalness','liveness','valence','tempo','time_signature']
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

In [67]:
X = train_df[selected_stems+selected_pos+audio_features]
y = train_df['target']
acc = []
auc = []
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = RobustScaler()
X = scaler.fit_transform(X)
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': [1,5,10,30,50,100]}
clf = GridSearchCV(knn,param_grid,cv=10)
clf.fit(X,y)
params = clf.cv_results_['params']
score = list(clf.cv_results_['mean_test_score'])
index = score.index(max(score))
print(params[index],score[index])

{'n_neighbors': 5} 0.887128712871287


The maximum accuracy (88.7%) was reached for KNN algorithm with 5 neighbors.

#### 4.1.2 SVC

In [65]:
from sklearn.svm import SVC

In [69]:
X = train_df[selected_stems+selected_pos+audio_features]
y = train_df['target']
acc = []
auc = []
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = RobustScaler()
X = scaler.fit_transform(X)
svc = SVC()
param_grid = {'kernel': ['linear','rbf'], 'C': [0.1,0.5,1,2,5,10,20]}
clf = GridSearchCV(svc,param_grid,cv=10)
clf.fit(X,y)
params = clf.cv_results_['params']
score = list(clf.cv_results_['mean_test_score'])
index = score.index(max(score))
print(params[index],score[index])

{'C': 10, 'kernel': 'linear'} 0.9227722772277227


The SVC model with a C of 10 and a linear kernel reached 92.3% of accuracy, which is considerably higher than the accuracy of the KNN model.

#### 4.1.3 RFC

In [70]:
X = train_df[selected_stems+selected_pos+audio_features]
y = train_df['target']
acc = []
auc = []
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = RobustScaler()
X = scaler.fit_transform(X)
rfc = RandomForestClassifier()
param_grid = {'n_estimators': [20,50,100,200,500]}
clf = GridSearchCV(rfc,param_grid,cv=10)
clf.fit(X,y)
params = clf.cv_results_['params']
score = list(clf.cv_results_['mean_test_score'])
index = score.index(max(score))
print(params[index],score[index])

{'n_estimators': 100} 0.9346534653465348


RFC outperformed both KNN and SVC, reaching 93.5% accuracy with a forest of 100 trees each trained on a random slice of the dataset, which is expected because of the results of the previous project.

### 4.2 Performance Evaluation

So, using the Random Forest Classifier with 100 estimators we will check the accuracy and the AUC-ROC metric from the classificator.

In [73]:
X = train_df[selected_stems+selected_pos+audio_features]
y = train_df['target']
acc = []
auc = []
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X,y)
    scaler = RobustScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    rfc = RandomForestClassifier(n_estimators = 100)
    rfc.fit(X_train,y_train)
    y_pred = rfc.predict(X_test)
    acc.append(accuracy_score(y_test,y_pred))
    y_score = rfc.predict_proba(X_test)[:, 1]
    auc.append(roc_auc_score(y_test,y_score))

In [74]:
print('Average accuracy: {:.4f}'.format(sum(acc)/len(acc)))
print('Average ROC AUC score: {:.4f}'.format(sum(auc)/len(auc)))

Average accuracy: 0.9549
Average ROC AUC score: 0.9899


Finally, we got almost 95.5% accuracy and 0.99 of ROC AUC score. This result is significantly better than the result achieved by the RFC trained with audio features only. This is a good example of how NLP can help us to extract useful features from text data with optimized and easy-to-interpret algorithms.