# spam-filter-ml

This series of notebooks are used to develop a full NLP pipeline (tokenization, lemmatization, vectorization, cross-validation/testing) for creating a spam filter. The model will be trained/tested using a dataset available on Kaggle: https://www.kaggle.com/uciml/sms-spam-collection-dataset.

This notebook goes through the machine learning parts in the pipeline; for exploratory data analysis and data cleaning, see the *spam-filter* notebook.

In [35]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
import nltk
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support as scores


## Vectorization

The next step is to vectorize each text message so that the data can be fed to a machine learning model for training. I will be using the TF-IDF vectorizer (term frequency - inverse document frequency) from scikit-learn. This takes two terms into account to determine the weighting of each token: 1) the frequency of each token in each message, and 2) the frequency of each token in the overall dataset. The vectorizer requires an analyser, which I will pass as a cleaning function which combines all the previous steps. The dataset will then be split into a training and testing set. The vectorizer is fit using the training set, which will then be used to transform the overall dataset.

In [2]:
df = pd.read_csv('spam_updated.csv')
pd.set_option('display.max_colwidth', 100)
df.head()

Unnamed: 0,label,text_len,text_punct,text_cap,raw_text
0,ham,92,0.098,0.033,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,24,0.25,0.083,Ok lar... Joking wif u oni...
2,spam,128,0.047,0.078,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,39,0.154,0.051,U dun say so early hor... U c already then say...
4,ham,49,0.041,0.041,"Nah I don't think he goes to usf, he lives around here though"


In [3]:
# Define clean_text function to be used as the analyser in the tfidf vectorizer
def clean_text(text):
    wnl = nltk.WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english')
    text = ''.join([char.lower() for char in text if char not in string.punctuation])
    tokenized_list = re.split('\W+', text)
    text = [wnl.lemmatize(word) for word in tokenized_list if word not in stopwords and word != '']
    return text

# Checking that clean_text function works
df_test_clean = df['raw_text'].apply(lambda x: clean_text(x))
df_test_clean.head()

0    [go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]
1                                                                         [ok, lar, joking, wif, u, oni]
2    [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...
3                                                          [u, dun, say, early, hor, u, c, already, say]
4                                                      [nah, dont, think, go, usf, life, around, though]
Name: raw_text, dtype: object

In [4]:
# Split data into features (X) and labels (y), then split into training/testing sets
X = df[['raw_text', 'text_len', 'text_punct', 'text_cap']]
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train.head()

Unnamed: 0,raw_text,text_len,text_punct,text_cap
2239,Every day i use to sleep after &lt;#&gt; so only.,40,0.15,0.025
1005,Give me a sec to think think about it,29,0.0,0.034
5136,There are some nice pubs near here or there is Frankie n Bennys near the warner cinema?,71,0.014,0.042
4714,S:)8 min to go for lunch:),21,0.19,0.048
5491,U studying in sch or going home? Anyway i'll b going 2 sch later.,52,0.058,0.038


In [5]:
# Initialise the vectorizer, fit using raw text in training data
tfidf = TfidfVectorizer(analyzer=clean_text)
tfidf_fit = tfidf.fit(X_train['raw_text'])
tfidf_fit

TfidfVectorizer(analyzer=<function clean_text at 0x000001135597D8C8>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), norm='l2',
        preprocessor=None, smooth_idf=True, stop_words=None,
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [10]:
# Use vectorizer to transform both training and testing set; returns a sparse matrix, need to use .toarray() method to read
tfidf_train = pd.DataFrame(tfidf_fit.transform(X_train['raw_text']).toarray())
tfidf_test = pd.DataFrame(tfidf_fit.transform(X_test['raw_text']).toarray())

# Rename vectorized columns with names of tokens
tfidf_train.columns = tfidf_fit.get_feature_names()
tfidf_test.columns = tfidf_fit.get_feature_names()
tfidf_train.head()

Unnamed: 0,0,008704050406,0089my,0121,01223585236,01223585334,02,020603,0207,02070836089,...,ìïll,ó,ö,û,ûthanks,ûªm,ûªt,ûï,ûò,ûówell
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
# Concatenate vectorized text and other features into training and testing dataframes
X_train_vect = pd.concat([X_train[['text_len', 'text_punct', 'text_cap']].reset_index(drop=True), tfidf_train], axis=1)
X_test_vect = pd.concat([X_test[['text_len', 'text_punct', 'text_cap']].reset_index(drop=True), tfidf_test], axis=1)
X_train_vect.head()

Unnamed: 0,text_len,text_punct,text_cap,0,008704050406,0089my,0121,01223585236,01223585334,02,...,ìïll,ó,ö,û,ûthanks,ûªm,ûªt,ûï,ûò,ûówell
0,40,0.15,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,29,0.0,0.034,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,71,0.014,0.042,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,21,0.19,0.048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,52,0.058,0.038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Machine Learning

The dataset is now ready to be used to train and test various machine learning models. For this part of the project I will be using both the Random Forest and Gradient Boosted Tree algorithms. A grid search is performed to tune the hyperparameters of each model to find the one that has the best performance. Common metrics for classification tasks include precision, recall, accuracy, and the F1 score. However, given that this ML model is used for a spam filter, it is ideal to minimise the false positives (i.e. it is more important that 'real' texts/emails are not incorrectly classified as spam; spam that is misclassified as ham presents a lesser impact in a real world setting). Therefore, precision is arguably the most important performance metric for this problem.

In [27]:
# Train using X_train_vect, y_train; make predictions on X_test_vect, y_test

RFC = RandomForestClassifier(n_jobs=-1) # Run jobs in parallel; use default hyperparameters for now

fit_start = time.time()
RFC_model = RFC.fit(X_train_vect, y_train) 
fit_end = time.time()
fit_time = fit_end - fit_start

pred_start = time.time()
RFC_y_pred = RFC_model.predict(X_test_vect)
pred_end = time.time()
pred_time = pred_end - pred_start

# Calculate precision/recall/f1 scores by comparing predictions and test data; use 'spam' as the positive label
precision, recall, f1, support = scores(y_test, RFC_y_pred, pos_label='spam', average='binary')
print('Fit time: {}s, Pred time: {}s | Precision: {}, Recall: {}, F1: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round(f1, 3)))

Fit time: 0.481s, Pred time: 0.184s | Precision: 0.992, Recall: 0.812, F1: 0.893


In [28]:
# Repeat process using gradient boosted trees

GBT = GradientBoostingClassifier() # Run jobs in parallel; use default hyperparameters for now

fit_start = time.time()
GBT_model = GBT.fit(X_train_vect, y_train) 
fit_end = time.time()
fit_time = fit_end - fit_start

pred_start = time.time()
GBT_y_pred = GBT_model.predict(X_test_vect)
pred_end = time.time()
pred_time = pred_end - pred_start

# Calculate precision/recall/f1 scores by comparing predictions and test data; use 'spam' as the positive label
precision, recall, f1, support = scores(y_test, GBT_y_pred, pos_label='spam', average='binary')
print('Fit time: {}s, Pred time: {}s | Precision: {}, Recall: {}, F1: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round(f1, 3)))

Fit time: 35.718s, Pred time: 0.107s | Precision: 0.992, Recall: 0.819, F1: 0.897


In [48]:
# Repeat process using support vector machine

SVM = SVC(kernel='linear') # Run jobs in parallel; use default hyperparameters for now

fit_start = time.time()
SVM_model = SVM.fit(X_train_vect, y_train) 
fit_end = time.time()
fit_time = fit_end - fit_start

pred_start = time.time()
SVM_y_pred = SVM_model.predict(X_test_vect)
pred_end = time.time()
pred_time = pred_end - pred_start

# Calculate precision/recall/f1 scores by comparing predictions and test data; use 'spam' as the positive label
precision, recall, f1, support = scores(y_test, SVM_y_pred, pos_label='spam', average='binary')
print('Fit time: {}s, Pred time: {}s | Precision: {}, Recall: {}, F1: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round(f1, 3)))

Fit time: 461.301s, Pred time: 22.981s | Precision: 0.993, Recall: 0.881, F1: 0.934


In [34]:
# Repeat process using multilayer perceptron

MLP = MLPClassifier() # Run jobs in parallel; use default hyperparameters for now

fit_start = time.time()
MLP_model = MLP.fit(X_train_vect, y_train) 
fit_end = time.time()
fit_time = fit_end - fit_start

pred_start = time.time()
MLP_y_pred = MLP_model.predict(X_test_vect)
pred_end = time.time()
pred_time = pred_end - pred_start

# Calculate precision/recall/f1 scores by comparing predictions and test data; use 'spam' as the positive label
precision, recall, f1, support = scores(y_test, MLP_y_pred, pos_label='spam', average='binary')
print('Fit time: {}s, Pred time: {}s | Precision: {}, Recall: {}, F1: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round(f1, 3)))

Fit time: 67.321s, Pred time: 0.094s | Precision: 0.993, Recall: 0.906, F1: 0.948


In [37]:
# Repeat process using logistic regression

LogReg = LogisticRegression() # Run jobs in parallel; use default hyperparameters for now

fit_start = time.time()
LogReg_model = LogReg.fit(X_train_vect, y_train) 
fit_end = time.time()
fit_time = fit_end - fit_start

pred_start = time.time()
LogReg_y_pred = LogReg_model.predict(X_test_vect)
pred_end = time.time()
pred_time = pred_end - pred_start

# Calculate precision/recall/f1 scores by comparing predictions and test data; use 'spam' as the positive label
precision, recall, f1, support = scores(y_test, LogReg_y_pred, pos_label='spam', average='binary')
print('Fit time: {}s, Pred time: {}s | Precision: {}, Recall: {}, F1: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round(f1, 3)))

Fit time: 0.581s, Pred time: 0.062s | Precision: 0.992, Recall: 0.775, F1: 0.87


It can be seen that all models have a high precision score of 99.2% to 99.3%, meaning that their false positive rates (i.e. ham misclassified as spam) is relatively low. Their recall scores are generally lower, ranging from 77.5% to 90.6%, meaning that there is still some spam that the models are unable to identify correctly. The results heavily depend on how the dataset was split using train_test_split, so different runs may return difference scores (this issue can be tackled by performing cross-validation, which is incorporated into the GridSearchCV method.

There is a general tradeoff between training time and accuracy. For example, the logistic regression model is one of the fastest models to train, but its recall score seems to be poor. Alternatively, the support vector classifier has precision and recall scores that are above average, but it takes a very long time to train. The multilayer perceptron takes an average amount of time to train, but gives better scores overall. Also note that the gradient boosting classifier takes much longer to fit compared to the random forest classifier, as the decision trees in GBT are constructed sequentially and are not independent from each other. Given these reasons, I will choose to further investigate the RFC, GBT, and MLP models.

The next step is to perform a grid search on both of these algorithms to find the hyperparameters which optimise the overall accuracy of the spam filtering system, which will be done in a separate notebook *spam-filter-gridsearch*.