# NLP: CLASSIFICATION OF SONG LYRICS WITH EXPLICIT CONTENT

## Part 3: Modeling

### Vectorize with TF-IDF

Once again, here is our dataset, fully cleaned, lemmatized, and including word count and sentiment metrics. First, we will vectorize the words using TF-IDF, so that the vectors are normalized against document frequency. We then join these vectors with the other numeric columns in our dataset.

In [1]:
import pandas as pd
import pickle
from src.pre_model import *
from src.models import *
import warnings
warnings.filterwarnings("ignore")

In [2]:
data = pd.read_csv("./data/data.csv", converters={"lemmatized": eval,
                                                  "unique_words": eval})
data.drop("Unnamed: 0", axis=1, inplace=True)
data.head()

Unnamed: 0,song,explicit_label,lemmatized,unique_words,word_count,unique_word_count,lemma_str,sentiment
0,"Andante, Andante",0,"[take, easy, please, touch, gently, summer, ev...","[thousand, butterfly, slow, night, soul, body,...",119,41,take easy please touch gently summer even bree...,0.291667
1,As Good As New,0,"[never, know, go, put, lousy, rotten, show, bo...","[take, say, found, another, way, know, thank, ...",150,63,never know go put lousy rotten show boy tough ...,0.306381
2,Bang-A-Boomerang,0,"[make, somebody, happy, question, give, take, ...","[show, tool, boomerang, throw, found, boom, kn...",132,56,make somebody happy question give take learn s...,0.009459
3,Chiquitita,0,"[chiquitita, tell, wrong, enchain, sorrow, eye...","[shoulder, cry, candle, feather, best, way, so...",114,58,chiquitita tell wrong enchain sorrow eye hope ...,-0.019366
4,Dancing Queen,0,"[dance, jive, time, life, see, girl, watch, sc...","[light, teaser, leave, night, another, dance, ...",93,48,dance jive time life see girl watch scene digg...,0.226238


In [3]:
data_vec = vectorize(data)
data_vec.head()

Unnamed: 0,word_count,unique_word_count,sentiment,explicit_label,abandon,ability,able,aboard,absolutely,absurd,...,yonder,york,young,youth,zappa,zero,zip,zombie,zone,zoo
0,119,41,0.291667,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,150,63,0.306381,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,132,56,0.009459,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,114,58,-0.019366,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,93,48,0.226238,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.151178,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Expectedly, this new dataset has large dimensions.

In [4]:
data_vec.shape

(24482, 4284)

### Resample for class imbalance

From EDA, we know that our dataset has a class imbalance issue, with almost 20 times as many non-explicit songs as explicit songs. To combat this, we use upsampling to match number of explicit songs to that of non-explicit songs. We also have an option for downsampling, but that line is commented out as our results will show that up-sampling gives better model performance across most models.

In [5]:
data_resampled = resample_data(data_vec, "up")
# data_resampled = resample_data(data_vec, "down")
data_resampled.head()

Unnamed: 0,word_count,unique_word_count,sentiment,explicit_label,abandon,ability,able,aboard,absolutely,absurd,...,yonder,york,young,youth,zappa,zero,zip,zombie,zone,zoo
0,119,41,0.291667,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,150,63,0.306381,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,132,56,0.009459,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,114,58,-0.019366,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,93,48,0.226238,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.151178,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
data_resampled["explicit_label"].value_counts()

1    23132
0    23132
Name: explicit_label, dtype: int64

### Split into training and test data

We do a 75-25 train-test split on nearly 50,000 observations. This ensures that we have a good amount of training data. We only standardize `word_count` and `unique_word_count`, as the TF-IDF vectors are already normalized and sentiment scores all range from -1 to 1.



For cross validation, we further divide the training data into 5 folds and repeat the hyperparameter tuning process until each fold has had a chance to be the validation set. No testing observation is exposed to the models during hyperparameter tuning.

In [7]:
X_train, X_test, y_train, y_test = split(data_resampled, standardize=True)

### Build models

Our data has high dimensions. As such, we pick classfication models that work well with this, including penalized logistic regression, linear support vector machine (SVM), random forest, gradient boosting, and na&iuml;ve Bayes. We do not use models such as _k_-nearest neighbors (kNN), decision tree, or unregularized logistic regression, as these tend to be negatively impacted by the curse of dimensionality.

#### Logistic regression (with L1 penalty)

Logistic regression is a good starting place for a binary classification problem. We apply L1 penalty to combat high dimensionality and perform embedded feature selection.

In [8]:
lr_model, lr_train_pred, lr_test_pred = lr(X_train, X_test, y_train, y_test)

#### SVM with linear kernel

SVM tends to do well with high dimensional data, as the classifier only looks at support vectors. We use a linear kernel to preserve model parsimony.

In [9]:
svm_model, svm_train_pred, svm_test_pred = svm(X_train, X_test, y_train, y_test)

#### Random forest

The random forest is built using the number of trees that produces the best out-of-bag (OOB) score out of a few candidates ranging from 1 to 1,000 trees. We cap the `max_depth` at 20 splits and use the square root of all features at any given time.

In [10]:
rf_model, rf_train_pred, rf_test_pred, oob_score, rf_fi = rf(X_train, X_test, y_train, y_test)

Number of trees: 500
Out-of-bag score: 0.9454435414144907


#### Gradient boost

Similar to the random forest, the gradient boost is built using the number of trees that produces the best accuracy out of a few candidates ranging from 1 to 1,000 trees. We cap the `max_depth` at 20 splits and use the square root of all features at any given time.

In [11]:
gb_model, gb_train_pred, gb_test_pred, gb_fi = gb(X_train, X_test, y_train, y_test)

Number of trees: 200


#### Na&iuml;ve Bayes

The na&iuml;ve Bayes model assumes conditional independence of the features. This may not always be true for NLP problems, as words tend to have order. However, our vectorization no longer preserves word order and word order is not usually a determining factor for explicit content. Na&iuml;ve Bayes is also very fast to train as it does not require hyperparameter tuning.

In [12]:
nb_model, nb_train_pred, nb_test_pred = nb(X_train, X_test, y_train, y_test)

#### Save models and results

In [13]:
with open("./data/X_train.pickle", "wb") as handle:
    pickle.dump(X_train, handle)
with open("./data/X_test.pickle", "wb") as handle:
    pickle.dump(X_test, handle)
with open("./data/y_train.pickle", "wb") as handle:
    pickle.dump(y_train, handle)
with open("./data/y_test.pickle", "wb") as handle:
    pickle.dump(y_test, handle)

with open("./results/models/lr_model.pickle", "wb") as handle:
    pickle.dump(lr_model, handle)
with open("./results/models/svm_model.pickle", "wb") as handle:
    pickle.dump(svm_model, handle)    
with open("./results/models/rf_model.pickle", "wb") as handle:
    pickle.dump(rf_model, handle)
with open("./results/models/gb_model.pickle", "wb") as handle:
    pickle.dump(gb_model, handle)
with open("./results/models/nb_model.pickle", "wb") as handle:
    pickle.dump(nb_model, handle)
    
with open("./results/predictions/lr_train_pred.pickle", "wb") as handle:
    pickle.dump(lr_train_pred, handle)
with open("./results/predictions/svm_train_pred.pickle", "wb") as handle:
    pickle.dump(svm_train_pred, handle)
with open("./results/predictions/rf_train_pred.pickle", "wb") as handle:
    pickle.dump(rf_train_pred, handle)
with open("./results/predictions/gb_train_pred.pickle", "wb") as handle:
    pickle.dump(gb_train_pred, handle)
with open("./results/predictions/nb_train_pred.pickle", "wb") as handle:
    pickle.dump(nb_train_pred, handle)

with open("./results/predictions/lr_test_pred.pickle", "wb") as handle:
    pickle.dump(lr_test_pred, handle)
with open("./results/predictions/svm_test_pred.pickle", "wb") as handle:
    pickle.dump(svm_test_pred, handle)
with open("./results/predictions/rf_test_pred.pickle", "wb") as handle:
    pickle.dump(rf_test_pred, handle)
with open("./results/predictions/gb_test_pred.pickle", "wb") as handle:
    pickle.dump(gb_test_pred, handle)
with open("./results/predictions/nb_test_pred.pickle", "wb") as handle:
    pickle.dump(nb_test_pred, handle)
    
with open("./results/predictions/rf_fi.pickle", "wb") as handle:
    pickle.dump(rf_fi, handle)
with open("./results/predictions/gb_fi.pickle", "wb") as handle:
    pickle.dump(gb_fi, handle)