# Modelling

In [1]:
import time
import pandas as pd
from sklearn.svm import SVC
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings("ignore")

In [2]:
X_train = pd.read_csv('../data/X_train.csv', index_col=0)
y_train = pd.read_csv('../data/y_train.csv', index_col=0)
X_test = pd.read_csv('../data/X_train.csv', index_col=0)
y_test = pd.read_csv('../data/y_train.csv', index_col=0)

### Modelling

My initial idea for this project was to implement deep learning for modelling. However, after doing more research, it seems that deep learning models fair similarly to dictionary based and traditional machine learning methods. Therefore, for this project, I will go the more traditional route. This is to simply save processing time as the results would likely be similar to using a deep learning frame work.

I have selected 3 models to test plus one dummy model as a baseline:

- Naive Bayes
- SVM
- Logitic Regression

### Dummy Model

For the dummy model, I will use the most frequent label to make all the classifications.

In [3]:
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
y_predict_dummy = dummy.predict(X_test)
dummy_report = classification_report(y_test, y_predict_dummy)
print(dummy_report)

              precision    recall  f1-score   support

           0       0.85      1.00      0.92     16047
           1       0.00      0.00      0.00      2892

    accuracy                           0.85     18939
   macro avg       0.42      0.50      0.46     18939
weighted avg       0.72      0.85      0.78     18939



### Naive Bayes

In [4]:
start_time = time.time()
nb = MultinomialNB(alpha=0.1)
nb.fit(X_train, y_train)
y_predict_nb = nb.predict(X_test)
end_time = time.time()

In [5]:
nb_report = classification_report(y_test, y_predict_nb)
print(nb_report)
print("Execution time: %s min" % ((end_time - start_time)/60))

              precision    recall  f1-score   support

           0       0.91      0.98      0.94     16047
           1       0.83      0.43      0.57      2892

    accuracy                           0.90     18939
   macro avg       0.87      0.71      0.76     18939
weighted avg       0.89      0.90      0.89     18939

Execution time: 0.24596641858418783 min


### Logistic Regression

In [6]:
start_time = time.time()
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_predict_lr = lr.predict(X_test)
end_time = time.time()

In [7]:
lr_report = classification_report(y_test, y_predict_lr)
print(lr_report)
print("Execution time: %s min" % ((end_time - start_time)/60))

              precision    recall  f1-score   support

           0       0.89      0.99      0.94     16047
           1       0.91      0.30      0.45      2892

    accuracy                           0.89     18939
   macro avg       0.90      0.65      0.69     18939
weighted avg       0.89      0.89      0.86     18939

Execution time: 1.3989588499069214 min


### Support Vector Machine

In [8]:
start_time = time.time()
svm = SVC()
svm.fit(X_train, y_train)
y_predict_svm = svm.predict(X_test)
end_time = time.time()

In [9]:
svm_report = classification_report(y_test, y_predict_svm)
print(svm_report)
print("Execution time: %s min" % ((end_time - start_time)/60))

              precision    recall  f1-score   support

           0       0.95      1.00      0.97     16047
           1       0.98      0.70      0.82      2892

    accuracy                           0.95     18939
   macro avg       0.97      0.85      0.89     18939
weighted avg       0.95      0.95      0.95     18939

Execution time: 300.7205369830132 min


## Conclusions

It is clear that the best model for the job is SVM. It had the best metrics from all the other models. Additionally, My model is comparable, and in some cases better, to previous related works.

Potential issues:
- Imbalanced dataset
- Sentence structure from song to song varies. Many song will have a single sentence on multiple lines.
- Explicitness can be subjective
- A song could contain only one swear word or phrase and be categorized as explicit
- Some songs from lesser known artists have songs that could be considered explicit, but are truly not.

Future Work:
- Incorporate dictionary based method to the model
- Use more data