# Model Testing

Now that we have cleaned up all of our data and created a balanced training dataset with SMOTE, we can now create and test different models

### Contents:
* Data Preparation
* Logistic Regression
* Random Forest
* Naive Bayes
* Summary/Conclusion

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report

pd.options.mode.chained_assignment = None

## Final Data Preparation

Importing cleaned training datasets and processing testing dataset to be ready for testing

In [2]:
# Importing unbalanced data + labels
xtrain = pd.read_csv('cleaned_training.csv')
ytrain = pd.read_csv('train_labels.csv')

# Importing balanced data + labels
xtrain_bal = pd.read_csv('balanced_training.csv')
ytrain_bal = pd.read_csv('balanced_labels.csv')

# Importing test data + labels
xtest = pd.read_csv('cleaned_testing.csv')
ytest = pd.read_csv('test_labels.csv')

# Dropping unused columns from xtrain & ytrain like class and sms
xtrain = xtrain.drop(columns = ['class', 'sms'])
xtest = xtest.drop(columns = ['class', 'sms'])

In [3]:
xtrain.head()

Unnamed: 0,char_length,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,0207,02070836089,...,åômorrow,ìll,ìï,ìïll,ûªm,ûªt,ûªve,ûï,ûïharry,ûò
0,158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Creating Models

I will be testing four different classification models and see how they compare to one another
- Logistic Regression
- Random Forest
- K Nearest Neighbors (KNN)
- Naive Bayes

And for each, not only will we collect our model's accuracy. But we will also create a confusion matrix and a classification report to get a better understanding of our model's accuracy

In [4]:
# Collecting data for comparison afterwards
model_acc = []
model_acc_bal = []

# Logistic Regression

In [5]:
from sklearn.linear_model import LogisticRegression

lg = LogisticRegression(max_iter = 1000)
lg_bal = LogisticRegression(max_iter = 1000)

lg.fit(xtrain, ytrain['spam'])
lg_bal.fit(xtrain_bal, ytrain_bal['spam'])

LogisticRegression(max_iter=1000)

### Imbalanced Dataset Results

In [6]:
# Testing Accuracy
lg_pred = lg.predict(xtest)
model_acc.append(metrics.accuracy_score(ytest, lg_pred))
metrics.accuracy_score(ytest, lg_pred)


0.9619358346927678

In [7]:
# Confusion Matrix
metrics.confusion_matrix(ytest, lg_pred)

array([[1582,   15],
       [  55,  187]])

In [8]:
# Classification Report (remember that 0 represents Ham and 1 represents Spam)
print(classification_report(ytest,lg_pred))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1597
           1       0.93      0.77      0.84       242

    accuracy                           0.96      1839
   macro avg       0.95      0.88      0.91      1839
weighted avg       0.96      0.96      0.96      1839



### Balanced Dataset Results

In [9]:
lg_pred_bal = lg_bal.predict(xtest)
model_acc_bal.append(metrics.accuracy_score(ytest, lg_pred_bal))
metrics.accuracy_score(ytest, lg_pred_bal)

0.9684611201740077

In [10]:
metrics.confusion_matrix(ytest, lg_pred_bal)

array([[1558,   39],
       [  19,  223]])

In [11]:
print(classification_report(ytest,lg_pred_bal))

              precision    recall  f1-score   support

           0       0.99      0.98      0.98      1597
           1       0.85      0.92      0.88       242

    accuracy                           0.97      1839
   macro avg       0.92      0.95      0.93      1839
weighted avg       0.97      0.97      0.97      1839



# Random Forest

In [12]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf_bal = RandomForestClassifier()

rf.fit(xtrain, ytrain['spam'])
rf_bal.fit(xtrain_bal, ytrain_bal['spam'])

RandomForestClassifier()

### Imbalanced Dataset Results

In [13]:
rf_pred = rf.predict(xtest)
model_acc.append(metrics.accuracy_score(ytest, rf_pred))
metrics.accuracy_score(ytest, rf_pred)

0.9755301794453507

In [14]:
metrics.confusion_matrix(ytest, rf_pred)

array([[1597,    0],
       [  45,  197]])

In [15]:
print(classification_report(ytest,rf_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.99      1597
           1       1.00      0.81      0.90       242

    accuracy                           0.98      1839
   macro avg       0.99      0.91      0.94      1839
weighted avg       0.98      0.98      0.97      1839



### Balanced Dataset Results

In [16]:
rf_pred_bal = rf_bal.predict(xtest)
model_acc_bal.append(metrics.accuracy_score(ytest, rf_pred_bal))
metrics.accuracy_score(ytest, rf_pred_bal)

0.9782490483958673

In [17]:
metrics.confusion_matrix(ytest, rf_pred_bal)

array([[1595,    2],
       [  38,  204]])

In [18]:
print(classification_report(ytest,rf_pred_bal))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1597
           1       0.99      0.84      0.91       242

    accuracy                           0.98      1839
   macro avg       0.98      0.92      0.95      1839
weighted avg       0.98      0.98      0.98      1839



## Naive Bayes

In [19]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb_bal = MultinomialNB()

nb.fit(xtrain, ytrain['spam'])
nb_bal.fit(xtrain_bal, ytrain_bal['spam'])

MultinomialNB()

### Imbalanced Dataset Results

In [20]:
nb_pred = nb.predict(xtest)
model_acc.append(metrics.accuracy_score(ytest, nb_pred))
metrics.accuracy_score(ytest, nb_pred)

0.8689505165851006

In [21]:
metrics.confusion_matrix(ytest, nb_pred)

array([[1597,    0],
       [ 241,    1]])

In [22]:
print(classification_report(ytest,nb_pred))

              precision    recall  f1-score   support

           0       0.87      1.00      0.93      1597
           1       1.00      0.00      0.01       242

    accuracy                           0.87      1839
   macro avg       0.93      0.50      0.47      1839
weighted avg       0.89      0.87      0.81      1839



### Balanced Dataset Results

In [23]:
nb_pred_bal = nb_bal.predict(xtest)
model_acc_bal.append(metrics.accuracy_score(ytest, nb_pred_bal))
metrics.accuracy_score(ytest, nb_pred_bal)

0.9336595976073954

In [24]:
metrics.confusion_matrix(ytest, nb_pred_bal)

array([[1480,  117],
       [   5,  237]])

In [25]:
print(classification_report(ytest, nb_pred_bal))

              precision    recall  f1-score   support

           0       1.00      0.93      0.96      1597
           1       0.67      0.98      0.80       242

    accuracy                           0.93      1839
   macro avg       0.83      0.95      0.88      1839
weighted avg       0.95      0.93      0.94      1839



# Summary/ Conclusion

Let's take a look at our findings!

In [26]:
# Comparing unbalanced accuracies
model_labels = ['Logistic Regression', 'Random Forest', 'Naive Bayes']

pd.DataFrame(list(zip(model_acc, model_acc_bal)), index = model_labels, columns = ['Accuracy', 'Balanced Accuracy'])

Unnamed: 0,Accuracy,Balanced Accuracy
Logistic Regression,0.961936,0.968461
Random Forest,0.97553,0.978249
Naive Bayes,0.868951,0.93366


## Accuracy 

It seems that the random forest model performed the best with both the imbalanced and balanced datasets, with around 97% accuracy rate! This was surprising as many online sources mention that Naive Bayes Model are typically better suited for datasets with high dimensions such as this. 


## Imbalanced vs Balanced

Another thing of note is that with all of the models, the models all improved when trained with the balanced dataset. But the degrees of improvement also varied. While the Logistic Regression and Random Forest model saw <1% accuracy increase, the Naive Bayes Model improve 7%! This could be because since the balanced dataset corrected the large imbalance between spam and ham samples, the probabilities calculated in the naive bayes model could have been more accurate/representative of spam and ham differences. As a result of these more representative probabilities, this produced a more accurate model.


## Precision vs Recall

While Accuracy is important, we also should consider precision and recall. 

When we look at precision, we can see that all of them had a high precision, with the Naive Bayes trained on the balanced dataset being the highest. But with higher precision also means that our model also has a higher chance of predicting false negatives. This means that these models with high precision have a high chance of predicting spam messages as ham messages.

When we look at recall, we can also see that all of the models had high recall, with a three way tie between Random Forest with Imbalanced/Balanced train datasets and Naive bayes with imbalanced dataset. But with higher recall also means that our model also has a higher chance of predicting false positives. This means that these models with high recall have a high chance of predicting ham messages as spam.

When it comes to our scenario, spam vs ham sms, we should prioritize higher precision over recall. This is because even though higher precision leads to higher chances of predicting spam messages as ham messages, with high recall models we have a higher change of predicting a ham message as spam. A high recall model may lead us to miss an important ham message since it was labelled spam. But with a high precision model, although we may get some spam messages labelled as ham, fewer of our ham messages will be labeled as spam thus preventing us from lossing valuble information.

Great Resource I used for Precision vs Recall: https://towardsdatascience.com/precision-vs-recall-386cf9f89488

### Thanks for checking out my SMS Classification Project!