# Spam Classifier

In this notebook, we will demostrate the end-to-end machine learning pipeline from data preparation to model selection for building a spam classifier for text messages.

## Import packages

In [1]:
import re
import string
import time
import warnings

import nltk
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score

pd.set_option('display.max_colwidth', 100)
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Read message file


In [2]:
data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

data

Unnamed: 0,label,body_text
0,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
1,ham,"Nah I don't think he goes to usf, he lives around here though"
2,ham,Even my brother is not like to speak with me. They treat me like aids patent.
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
4,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...
5,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c...
6,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came...
7,ham,"I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried ..."
8,spam,"SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, ..."
9,spam,"URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM..."


## Data preparation

Stemming: process of reducing inflected (or sometimes derived) words to their word stem or root

Lemmatization: process of grouping together the inflected forms of a word so they can be analytzed as a single term, identified by the word's lemma

Example:

| **Words** | **Stemming** | **Lemmatization** |
|-----------|--------------|-------------------|
| mean      | mean         | mean              |
| means     | mean         | mean              |
| meaning   | mean         | meaning           |
| meanness  | mean         | meanness          |
| goose     | goos         | goose             |
| geese     | gees         | goose             |

### Tradeoffs between lemmatization and stemming

The goal of lemmatization and stemming is to condense derived words into their base forms. However, the 2 approaches have tradeoffs between accuracy and speed. 

Stemming is typically **faster** but less **accurate** as it simply chops off the end of a word using heuristics, with understanding of the context in which a word is used. 

Lemmatizing is typically more **accurate** but **slower** as it uses vocabulary analysis to reduce groups of words with similar meaning to the dictionary form of words. 

In [3]:
stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()
wn = nltk.WordNetLemmatizer()

def clean_text(text):
    # 1. Remove punctuations & lowercase text
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    
    # 2. Tokenization
    tokens = re.split('\W+', text)
    
    # 3.a. Remove stopwords & stemming
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    
#     # 3.b. Remove stopwords & lemmatization
#     text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
    
    return text

## Feature engineering

Create new features or transform existing features to add dimensions to the feature set. In this case, we are creating 2 new features: length of text and punctuation percentage.

In [4]:
# Compute percentage of punctuations within a string
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

# Create new features: length of text and punctuation percentage
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

## Train/Test split

Split data into training and testing data.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    data[['body_text', 'body_len', 'punct%']],
    data['label'],
    test_size=0.2)

## Vectorization

Vectorization is the proess of encoding text as integers to create feature vectors. There are many text vectorization techniques. [Count vectorization](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer), [n-gram vectorization](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn-feature-extraction-text-countvectorizer) (see `ngram_range`), and [term frequency-inverse document frequency (TF-IDF) weighting](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) are examples of such techniques.

- Count vectorization: creates a document-term matrix where the entry of each cell will be a count of the number of times that word occurred in that document

  - Example:

![vectorization_example.jpg](./doc/vectorization_example.jpg)

- N-gram vectorization: creates a document-term matrix where counts still occupy the cell but instead of the columns representing single terms, they represent all combinations of adjacent words of length n in your text.

  - Example: "NLP is an interesting topic"

| n | Name      | Tokens                                                         |
|---|-----------|----------------------------------------------------------------|
| 2 | bigram    | ["nlp is", "is an", "an interesting", "interesting topic"]      |
| 3 | trigram   | ["nlp is an", "is an interesting", "an interesting topic"] |
| 4 | four-gram | ["nlp is an interesting", "is an interesting topic"]    |

- Term frequency-inverse document frequency (TF-IDF): Creates a document-term matrix where the columns represent single unique terms (unigrams) but the cell represents a weighting meant to represent how important a word is to a document.

![tf-idf.jpg](./doc/tf-idf.jpg)

For this notebook, we will be using TF-IDF weighting.

In [6]:
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])

tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])

X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), 
                         pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), 
                        pd.DataFrame(tfidf_test.toarray())], axis=1)

X_train_vect.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,7206,7207,7208,7209,7210,7211,7212,7213,7214,7215
0,41,2.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,72,2.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,123,5.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,21,9.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,124,6.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Modeling

Random Forest: ensemble learning method that constructs a collection of decision tree and then aggregates the predictions of each tree to determine the final prediction (bagging)

Gradient Boosting: ensemble learning method that takes an iterative approach to combining weak learners to create a strong learner by focusing on mistakes of prior iterations (boosting)

### Comparison of Random Forest and Gradient Boosting

![tradeoff_randomforest_gradientboosting.jpg](./doc/tradeoff_randomforest_gradientboosting.jpg)

In this notebook, we will train both random forest and gradient boosting models for comparison. 

In [7]:
# Gridsearch for random forest classifier
rf = RandomForestClassifier()
param = {
    'n_estimators': [10, 150, 300],
    'max_depth': [30, 60, 90, None]
}

rfclf = GridSearchCV(rf, param, cv=5, n_jobs=-1)
rfclf_fit = rfclf.fit(X_train_vect, y_train)

df_rfclf_cv_results = pd.DataFrame(rfclf_fit.cv_results_) \
    .sort_values('mean_test_score', ascending=False)[0:5]
df_rfclf_cv_results



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
8,15.866139,0.262468,0.19987,0.009523,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.967452,0.970819,0.971942,...,0.972378,0.004516,1,0.999439,0.999439,0.999158,0.999439,0.999439,0.999382,0.000112
11,16.584056,0.44137,0.190365,0.008809,,300,"{'max_depth': None, 'n_estimators': 300}",0.968575,0.970819,0.971942,...,0.972154,0.003972,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
7,8.026667,0.070631,0.122778,0.010428,90.0,150,"{'max_depth': 90, 'n_estimators': 150}",0.968575,0.970819,0.971942,...,0.971704,0.004339,3,0.999158,0.999439,0.999439,0.999439,0.999439,0.999382,0.000112
10,8.647201,0.254553,0.121825,0.003741,,150,"{'max_depth': None, 'n_estimators': 150}",0.969697,0.970819,0.969697,...,0.97148,0.003055,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
4,6.62755,0.172325,0.101679,0.004706,60.0,150,"{'max_depth': 60, 'n_estimators': 150}",0.964085,0.969697,0.973064,...,0.971031,0.004777,5,0.995508,0.994385,0.994947,0.993825,0.994106,0.994554,0.000604


In [8]:
# Gridsearch for gradient boosting classifier
gb = GradientBoostingClassifier()
param = {
    'n_estimators': [100, 150], 
    'max_depth': [7, 11, 15],
    'learning_rate': [0.1]
}

gbclf = GridSearchCV(gb, param, cv=5, n_jobs=-1)
gbclf_fit = gbclf.fit(X_train_vect, y_train)

df_gbclf_cv_results = pd.DataFrame(gbclf_fit.cv_results_) \
    .sort_values('mean_test_score', ascending=False)[0:5]
df_gbclf_cv_results



## Model evaluation

Evaluate best random forest and gradient boosting hyperparameters on test data.

In [16]:
best_rfclf_hyperparam = df_rfclf_cv_results.head(1)['params'].values[0]
print(best_rfclf_hyperparam)

rf = RandomForestClassifier(**best_rfclf_hyperparam, n_jobs=-1)

start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3),
    round(pred_time, 3),
    round(precision, 3),
    round(recall, 3),
    round((y_pred==y_test).sum()/len(y_pred), 3)))

{'max_depth': 90, 'n_estimators': 300}
Fit time: 5.311 / Predict time: 0.192 ---- Precision: 1.0 / Recall: 0.824 / Accuracy: 0.976


In [17]:
best_gbclf_hyperparam = df_gbclf_cv_results.head(1)['params'].values[0]
print(best_gbclf_hyperparam)

gb = GradientBoostingClassifier(**best_gbclf_hyperparam)

start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3),
    round(pred_time, 3),
    round(precision, 3),
    round(recall, 3),
    round((y_pred==y_test).sum()/len(y_pred), 3)))

{'learning_rate': 0.1, 'max_depth': 11, 'n_estimators': 150}
Fit time: 240.042 / Predict time: 0.084 ---- Precision: 0.942 / Recall: 0.85 / Accuracy: 0.972
