# spam-filter-gridsearch

This series of notebooks are used to develop a full NLP pipeline (tokenization, lemmatization, vectorization, cross-validation/testing) for creating a spam filter. The model will be trained/tested using a dataset available on Kaggle: https://www.kaggle.com/uciml/sms-spam-collection-dataset.

This notebook goes through the hyperparameter-tuning process for the machine learning models. For exploratory data analysis and data cleaning, see the *spam-filter* notebook; for vectorization and ML model training/testing, see the *spam-filter-ml* notebook.

### Preamble and repeated cells from previous notebook (vectorization and splitting data into training/testing)

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
import nltk
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support as scores, precision_score, recall_score, make_scorer

import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

In [7]:
df = pd.read_csv('spam_updated.csv')
pd.set_option('display.max_colwidth', 100)
df.head()

Unnamed: 0,label,text_len,text_punct,text_cap,raw_text
0,ham,92,0.098,0.033,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,24,0.25,0.083,Ok lar... Joking wif u oni...
2,spam,128,0.047,0.078,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,39,0.154,0.051,U dun say so early hor... U c already then say...
4,ham,49,0.041,0.041,"Nah I don't think he goes to usf, he lives around here though"


In [8]:
# Define clean_text function to be used as the analyser in the tfidf vectorizer
def clean_text(text):
    wnl = nltk.WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english')
    text = ''.join([char.lower() for char in text if char not in string.punctuation])
    tokenized_list = re.split('\W+', text)
    text = [wnl.lemmatize(word) for word in tokenized_list if word not in stopwords and word != '']
    return text

In [9]:
# Split data into features (X) and labels (y), then split into training/testing sets
X = df[['raw_text', 'text_len', 'text_punct', 'text_cap']]
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [10]:
# Initialise the vectorizer, fit using raw text in training data
tfidf = TfidfVectorizer(analyzer=clean_text)
tfidf_fit = tfidf.fit(X_train['raw_text'])

In [11]:
# Use vectorizer to transform both training and testing set; returns a sparse matrix, need to use .toarray() method to read
tfidf_train = pd.DataFrame(tfidf_fit.transform(X_train['raw_text']).toarray())
tfidf_test = pd.DataFrame(tfidf_fit.transform(X_test['raw_text']).toarray())

# Rename vectorized columns with names of tokens
tfidf_train.columns = tfidf_fit.get_feature_names()
tfidf_test.columns = tfidf_fit.get_feature_names()

# Concatenate vectorized text and other features into training and testing dataframes
X_train_vect = pd.concat([X_train[['text_len', 'text_punct', 'text_cap']].reset_index(drop=True), tfidf_train], axis=1)
X_test_vect = pd.concat([X_test[['text_len', 'text_punct', 'text_cap']].reset_index(drop=True), tfidf_test], axis=1)

In [12]:
# Train using X_train_vect, y_train; make predictions on X_test_vect, y_test

RFC = RandomForestClassifier(n_jobs=-1) # Run jobs in parallel; use default hyperparameters for now

fit_start = time.time()
RFC_model = RFC.fit(X_train_vect, y_train) 
fit_end = time.time()
fit_time = fit_end - fit_start

pred_start = time.time()
RFC_y_pred = RFC_model.predict(X_test_vect)
pred_end = time.time()
pred_time = pred_end - pred_start

# Calculate precision/recall/f1 scores by comparing predictions and test data; use 'spam' as the positive label
precision, recall, f1, support = scores(y_test, RFC_y_pred, pos_label='spam', average='binary')
print('Fit time: {}s, Pred time: {}s | Precision: {}, Recall: {}, F1: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round(f1, 3)))

Fit time: 0.759s, Pred time: 0.241s | Precision: 1.0, Recall: 0.841, F1: 0.914


In [13]:
# Repeat process using gradient boosted trees

GBT = GradientBoostingClassifier() # Run jobs in parallel; use default hyperparameters for now

fit_start = time.time()
GBT_model = GBT.fit(X_train_vect, y_train) 
fit_end = time.time()
fit_time = fit_end - fit_start

pred_start = time.time()
GBT_y_pred = GBT_model.predict(X_test_vect)
pred_end = time.time()
pred_time = pred_end - pred_start

# Calculate precision/recall/f1 scores by comparing predictions and test data; use 'spam' as the positive label
precision, recall, f1, support = scores(y_test, GBT_y_pred, pos_label='spam', average='binary')
print('Fit time: {}s, Pred time: {}s | Precision: {}, Recall: {}, F1: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round(f1, 3)))

Fit time: 36.019s, Pred time: 0.139s | Precision: 0.992, Recall: 0.89, F1: 0.938


## Tuning hyperparameters using GridSearchCV

Note: these grid-searches were performed without specifiying the scoring method. Since this is a spam-filter, it would be ideal to minimise the number of false positives (i.e. real emails/ham misclassified as spam), as this has negative consequences in a real-life business situation. Therefore, the precision score of the model is arguably the most important and should be maximised. The recall score takes into account the number of false negatives (i.e. spam misclassified as real emails/ham); ideally, this should also be maximised, but is less of a priority than the precision score for this problem. Proper grid-searches where the scoring method is explicity defined is done in the next section below.

### Random Forest Classifier

In [9]:
# RFC hyperparameter tuning

RFC = RandomForestClassifier(n_jobs=-1)
RFC_params = {'n_estimators': [10, 50, 100, 200, 300, 500],
             'max_depth': [20, 40, 60, 80, 100, None]}

RFC_GS = GridSearchCV(RFC, RFC_params, cv=5, n_jobs=-1) # Perform 5-fold cross-validation, conduct jobs in parallel
RFC_GS_fit = RFC_GS.fit(X_train_vect, y_train)

In [50]:
RFC_results = pd.DataFrame(RFC_GS_fit.cv_results_)
RFC_results.sort_values('mean_test_score', ascending=False)[:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
19,2.694666,0.125765,0.378199,0.248654,80.0,50,"{'max_depth': 80, 'n_estimators': 50}",0.974215,0.969731,0.976431,...,0.974871,0.003287,1,1.0,0.999439,0.99972,0.999159,0.999159,0.999495,0.000327
34,22.479615,2.0772,3.987035,0.38905,,300,"{'max_depth': None, 'n_estimators': 300}",0.975336,0.971973,0.978676,...,0.974422,0.003202,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
32,4.266202,0.585318,0.923006,0.047902,,100,"{'max_depth': None, 'n_estimators': 100}",0.974215,0.969731,0.974186,...,0.974198,0.002548,3,1.0,1.0,1.0,1.0,1.0,1.0,0.0
29,36.83829,1.897558,2.574735,1.242099,100.0,500,"{'max_depth': 100, 'n_estimators': 500}",0.975336,0.969731,0.977553,...,0.974198,0.003816,3,1.0,1.0,1.0,1.0,1.0,1.0,0.0
26,4.095117,0.603194,0.875162,0.092967,100.0,100,"{'max_depth': 100, 'n_estimators': 100}",0.975336,0.966368,0.975309,...,0.973749,0.003717,5,1.0,1.0,1.0,0.99972,1.0,0.999944,0.000112


### Gradient Boosted Trees

In [22]:
# GBT hyperparameter tuning

GBT = GradientBoostingClassifier()
GBT_params = {'n_estimators': [100, 300, 500], # GBT works better with many estimators that have are more shallow compared to RFC
             'max_depth': [5, 10, None],
             'learning_rate': [0.01, 0.1, 1]}

GBT_GS = GridSearchCV(GBT, GBT_params, cv=5, n_jobs=-1) # Perform 5-fold cross-validation, conduct jobs in parallel
GBT_GS_fit = GBT_GS.fit(X_train_vect, y_train)

In [49]:
GBT_results = pd.DataFrame(GBT_GS_fit.cv_results_)
GBT_results.sort_values('mean_test_score', ascending=False)[:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
11,471.033025,3.002779,0.217539,0.026598,0.1,5,500,"{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 500}",0.979821,0.966368,...,0.97689,0.005329,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
14,801.674959,54.059547,0.238761,0.015251,0.1,10,500,"{'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 500}",0.980942,0.964126,...,0.976442,0.006761,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
13,582.187606,3.391342,0.218129,0.019305,0.1,10,300,"{'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 300}",0.979821,0.966368,...,0.975544,0.005178,3,1.0,1.0,1.0,1.0,1.0,1.0,0.0
10,285.606467,1.023179,0.181514,0.002442,0.1,5,300,"{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300}",0.977578,0.965247,...,0.974647,0.004938,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
12,216.391659,4.214193,0.174543,0.008357,0.1,10,100,"{'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 100}",0.9787,0.960762,...,0.974422,0.007514,5,1.0,1.0,1.0,1.0,1.0,1.0,0.0


### Multilayer Perceptron

In [27]:
# MLP hyperparameter tuning

MLP = MLPClassifier()
MLP_params = {'hidden_layer_sizes': [10, 50, 100, 200], # GBT works better with many estimators that have are more shallow compared to RFC
             'activation': ['relu', 'tanh', 'logistic'],
             'learning_rate': ['constant', 'invscaling', 'adaptive']}

MLP_GS = GridSearchCV(MLP, MLP_params, cv=5, n_jobs=-1) # Perform 5-fold cross-validation, conduct jobs in parallel
MLP_GS_fit = MLP_GS.fit(X_train_vect, y_train)

In [48]:
MLP_results = pd.DataFrame(MLP_GS_fit.cv_results_)
MLP_results.sort_values('mean_test_score', ascending=False)[:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_activation,param_hidden_layer_sizes,param_learning_rate,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
21,332.182882,30.403524,0.220609,0.008088,tanh,200,constant,"{'activation': 'tanh', 'hidden_layer_sizes': 200, 'learning_rate': 'constant'}",0.9787,0.980942,...,0.984294,0.003812,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
34,678.082324,59.744037,0.217719,0.015229,logistic,200,invscaling,"{'activation': 'logistic', 'hidden_layer_sizes': 200, 'learning_rate': 'invscaling'}",0.9787,0.979821,...,0.983621,0.003852,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
15,221.636807,10.963911,0.129759,0.003409,tanh,50,constant,"{'activation': 'tanh', 'hidden_layer_sizes': 50, 'learning_rate': 'constant'}",0.979821,0.980942,...,0.983397,0.003118,3,1.0,1.0,1.0,1.0,1.0,1.0,0.0
33,596.520698,72.511875,0.222519,0.006347,logistic,200,constant,"{'activation': 'logistic', 'hidden_layer_sizes': 200, 'learning_rate': 'constant'}",0.979821,0.980942,...,0.983397,0.002776,3,1.0,1.0,1.0,1.0,1.0,1.0,0.0
29,347.985155,24.730214,0.136041,0.00552,logistic,50,adaptive,"{'activation': 'logistic', 'hidden_layer_sizes': 50, 'learning_rate': 'adaptive'}",0.980942,0.979821,...,0.983397,0.003118,3,0.999719,1.0,0.99972,1.0,1.0,0.999888,0.000137


### Support Vector Classifier

In [30]:
# Support Vector Classifier hyperparameter tuning

SVM = SVC()
SVM_params = {'C': [0.01, 0.1, 1, 10, 100]}

SVM_GS = GridSearchCV(SVM, SVM_params, cv=5, n_jobs=-1) # Perform 5-fold cross-validation, conduct jobs in parallel
SVM_GS_fit = SVM_GS.fit(X_train_vect, y_train)

In [31]:
SVM_results = pd.DataFrame(SVM_GS_fit.cv_results_)
SVM_results.sort_values('mean_test_score', ascending=False)[:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
4,75.127635,15.896474,16.831728,4.836345,100.0,{'C': 100},0.88565,0.892377,0.912458,0.912458,...,0.900606,0.010702,1,0.906592,0.903506,0.900168,0.901851,0.902412,0.902906,0.002135
3,81.057478,5.496299,17.681929,1.715388,10.0,{'C': 10},0.872197,0.877803,0.901235,0.895623,...,0.883105,0.012963,2,0.890603,0.884712,0.880538,0.88166,0.888671,0.885237,0.003892
2,73.213032,1.36658,16.51009,0.174032,1.0,{'C': 1},0.866592,0.876682,0.863075,0.863075,...,0.866951,0.005051,3,0.887518,0.880505,0.862591,0.862591,0.885025,0.875646,0.010894
0,66.559258,1.107112,16.058674,0.056827,0.01,{'C': 0.01},0.862108,0.862108,0.863075,0.863075,...,0.862688,0.000474,4,0.862833,0.862833,0.862591,0.862591,0.862591,0.862688,0.000119
1,70.511421,1.90348,16.282591,0.24129,0.1,{'C': 0.1},0.862108,0.862108,0.863075,0.863075,...,0.862688,0.000474,4,0.862833,0.862833,0.862591,0.862591,0.862591,0.862688,0.000119


### Logistic Regression

In [46]:
# Logistic Regression hyperparameter tuning

LogReg = LogisticRegression()
LogReg_params = {'C': [0.01, 0.1, 1, 10, 100]}

LogReg_GS = GridSearchCV(LogReg, LogReg_params, cv=5, n_jobs=-1) # Perform 5-fold cross-validation, conduct jobs in parallel
LogReg_GS_fit = LogReg_GS.fit(X_train_vect, y_train)

In [47]:
LogReg_results = pd.DataFrame(LogReg_GS_fit.cv_results_)
LogReg_results.sort_values('mean_test_score', ascending=False)[:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
4,1.160531,0.064991,0.066622,0.008425,100.0,{'C': 100},0.976457,0.973094,0.98092,0.984287,...,0.978461,0.003838,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
3,1.031799,0.052681,0.066923,0.004351,10.0,{'C': 10},0.976457,0.970852,0.976431,0.979798,...,0.976217,0.00295,2,0.999158,0.998878,0.998317,0.998878,0.998598,0.998766,0.000286
2,0.877978,0.026604,0.056938,0.004665,1.0,{'C': 1},0.952915,0.945067,0.955107,0.960718,...,0.954005,0.005145,3,0.971108,0.970266,0.970836,0.969994,0.970275,0.970496,0.000411
1,0.93882,0.067158,0.070013,0.026883,0.1,{'C': 0.1},0.881166,0.877803,0.8844,0.878788,...,0.879515,0.003059,4,0.885273,0.884432,0.885586,0.879697,0.886708,0.884339,0.002433
0,0.851371,0.054368,0.057052,0.002634,0.01,{'C': 0.01},0.857623,0.858744,0.852974,0.859708,...,0.85663,0.002635,5,0.856942,0.856942,0.857824,0.858104,0.856983,0.857359,0.000502


## Tuning hyperparameters using GridSearchCV, with precision_score

Tuning hyperparameters for the random forest classifier, gradient boosted trees, and multilayer perceptron. GridSearchCV method used to find the optimal hyperparameters to optimise the precision score for each model.

In [19]:
# Define scoring method to be used in GridSearch CV
precision_scorer = make_scorer(precision_score, pos_label='spam')
recall_scorer = make_scorer(recall_score, pos_label='spam')

### Random Forest Classifier

In [25]:
RFC = RandomForestClassifier(n_jobs=-1)
RFC_params = {'n_estimators': [10, 50, 100, 200, 300, 500],
             'max_depth': [80, 100, None]}

RFC_GS = GridSearchCV(RFC, RFC_params, cv=5, n_jobs=-1, scoring={'precision_scorer': precision_scorer, 'recall_scorer': recall_scorer}, refit='precision_scorer') # Perform 5-fold cross-validation, conduct jobs in parallel
RFC_GS_fit = RFC_GS.fit(X_train_vect, y_train)

In [40]:
RFC_results = pd.DataFrame(RFC_GS_fit.cv_results_)
cols = ['mean_fit_time', 'mean_score_time', 'params', 'mean_test_precision_scorer', 'mean_test_recall_scorer', 'rank_test_precision_scorer', 'rank_test_recall_scorer']
RFC_results[cols].sort_values(['mean_test_precision_scorer', 'mean_test_recall_scorer'], ascending=[False, False])[:5]

Unnamed: 0,mean_fit_time,mean_score_time,params,mean_test_precision_scorer,mean_test_recall_scorer,rank_test_precision_scorer,rank_test_recall_scorer
5,34.720276,4.393556,"{'max_depth': 80, 'n_estimators': 500}",1.0,0.812287,1,1
8,4.369905,2.474578,"{'max_depth': 100, 'n_estimators': 100}",1.0,0.810633,1,2
10,19.683171,6.218031,"{'max_depth': 100, 'n_estimators': 300}",1.0,0.810633,1,2
4,21.091032,6.592858,"{'max_depth': 80, 'n_estimators': 300}",1.0,0.805647,1,5
2,5.645418,2.992563,"{'max_depth': 80, 'n_estimators': 100}",1.0,0.802328,1,6


### Gradient Boosted Trees

In [26]:
# GBT hyperparameter tuning

GBT = GradientBoostingClassifier()
GBT_params = {'n_estimators': [400, 450, 500, 550, 600], # GBT works better with many estimators that have are more shallow compared to RFC
             'max_depth': [5, 6, 7, 8, 9, 10]}

GBT_GS = GridSearchCV(GBT, GBT_params, cv=5, n_jobs=-1, scoring={'precision_scorer': precision_scorer, 'recall_scorer': recall_scorer}, refit='precision_scorer') # Perform 5-fold cross-validation, conduct jobs in parallel
GBT_GS_fit = GBT_GS.fit(X_train_vect, y_train)

In [36]:
GBT_results = pd.DataFrame(GBT_GS_fit.cv_results_)
cols = ['mean_fit_time', 'mean_score_time', 'params', 'mean_test_precision_scorer', 'mean_test_recall_scorer', 'rank_test_precision_scorer', 'rank_test_recall_scorer']
GBT_results[cols].sort_values(['mean_test_precision_scorer', 'mean_test_recall_scorer'], ascending=[False, False])[:5]

Unnamed: 0,mean_fit_time,mean_score_time,params,mean_test_precision_scorer,mean_test_recall_scorer,rank_test_precision_scorer,rank_test_recall_scorer
4,608.522695,0.483408,"{'max_depth': 5, 'n_estimators': 600}",0.984607,0.840548,1,30
3,539.037942,0.52294,"{'max_depth': 5, 'n_estimators': 550}",0.983009,0.845534,2,22
13,740.086748,0.466765,"{'max_depth': 7, 'n_estimators': 550}",0.982944,0.842214,3,26
14,806.249868,0.509747,"{'max_depth': 7, 'n_estimators': 600}",0.979404,0.843892,4,23
5,494.682198,0.429304,"{'max_depth': 6, 'n_estimators': 400}",0.979403,0.852186,5,6


### Multilayer Perceptron

In [27]:
# MLP hyperparameter tuning

MLP = MLPClassifier()
MLP_params = {'hidden_layer_sizes': [10, 50, 100, 200], # GBT works better with many estimators that have are more shallow compared to RFC
             'activation': ['relu', 'tanh', 'logistic'],
             'learning_rate': ['constant', 'invscaling', 'adaptive']}

MLP_GS = GridSearchCV(MLP, MLP_params, cv=5, n_jobs=-1, scoring={'precision_scorer': precision_scorer, 'recall_scorer': recall_scorer}, refit='precision_scorer') # Perform 5-fold cross-validation, conduct jobs in parallel
MLP_GS_fit = MLP_GS.fit(X_train_vect, y_train)

In [38]:
MLP_results = pd.DataFrame(MLP_GS_fit.cv_results_)
cols = ['mean_fit_time', 'mean_score_time', 'params', 'mean_test_precision_scorer', 'mean_test_recall_scorer', 'rank_test_precision_scorer', 'rank_test_recall_scorer']
MLP_results[cols].sort_values(['mean_test_precision_scorer', 'mean_test_recall_scorer'], ascending=[False, False])[:5]

Unnamed: 0,mean_fit_time,mean_score_time,params,mean_test_precision_scorer,mean_test_recall_scorer,rank_test_precision_scorer,rank_test_recall_scorer
2,128.752426,0.239063,"{'activation': 'relu', 'hidden_layer_sizes': 10, 'learning_rate': 'adaptive'}",0.988483,0.838835,1,34
9,311.175,0.428066,"{'activation': 'relu', 'hidden_layer_sizes': 200, 'learning_rate': 'constant'}",0.983454,0.873783,2,31
11,323.391,0.376402,"{'activation': 'relu', 'hidden_layer_sizes': 200, 'learning_rate': 'adaptive'}",0.983452,0.873712,3,32
6,189.118396,0.307976,"{'activation': 'relu', 'hidden_layer_sizes': 100, 'learning_rate': 'constant'}",0.981644,0.885433,4,18
33,627.535976,0.42766,"{'activation': 'logistic', 'hidden_layer_sizes': 200, 'learning_rate': 'constant'}",0.979926,0.890335,5,10
