## Machine Learning Online Class 
##  Exercise 6 - Part 2 | Spam Classification with SVMs
Requires : nltk, scipy, scikit-learn <br />
For nltk <br />
pip install nltk <br />
then in a python console : <br />
import nltk <br />
nltk.download() <br />
Choose all packages

### Introduction

Many email services today provide spam filters that are able to classify emails into spam and non-spam email with high accuracy.<br><br>
In this part of the exercise, we will use SVMs to build our own spam filter.<br>
We will be training a classifier to classify whether a given email, x, is spam ($y = 1$) or non-spam ($y = 0$). <br>
In particular, you need to convert each email into a feature vector $x \in \mathbb{R}^n$. <br>
The following parts of the exercise will walk through how such a feature vector can be constructed from an email.<br><br>
The dataset included for this exercise is based on a a subset of the SpamAssassin Public Corpus.<br>
For the purpose of this exercise, we will only be using the body of the email (excluding the email headers).

### Python Imports

In [4]:
import warnings
warnings.filterwarnings('ignore')

import string
import numpy as np
import pandas as pd
from scipy.io import loadmat
from sklearn.svm import LinearSVC 
from sklearn.metrics import accuracy_score
import re               # regexp 
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
from IPython.display import Image

### Part 1: Email processing 

In [5]:
file_contents = open('emailSample1.txt', 'r').read()

Before starting on a machine learning task, it is usually insightful to take a look at examples from the dataset. <br>
Figure below shows a sample email that contains a URL, an email address (at the end), numbers, and dollar
amounts. 

![title](SampleEmail.png)

While many emails would contain similar types of entities (e.g. numbers, other URLs, or other email addresses), the specific entities (e.g., the specific URL or specific dollar amount) will be different in almost every email. <br><br>
Therefore, one method often employed in processing emails is to "normalize" these values, so that all URLs are treated the same, all numbers are treated the same, etc. <br>
For example, we could replace each URL in the email with the unique string "httpaddr" to indicate that a URL was present.
This has the effect of letting the spam classifier make a classification decision based on whether any URL was present, rather than whether a specific URL was present. <br><br>
This typically improves the performance of a spam classifier, since spammers often randomize the URLs, and thus the odds of seeing any particular URL again in a new piece of spam is very small.

In [6]:
def preProcessEmail(email_contents):
    
    #Preprocess email
    
    # Lower case
    email_contents = email_contents.lower()
    
    # Strip all HTML
    # Looks for any expression that starts with < and ends with > and replace
    # and does not have any < or > in the tag it with a space    
    email_contents = re.sub('<[^<>]+>',' ', email_contents)
    
    # Handle numbers
    # Look for one or more characters between 0-9 and replace by 'number'
    email_contents = re.sub('[0-9]+','number', email_contents)
    
    # Handle URLS
    # Look for strings starting with http:// or https:// and replace by httpaddr
    email_contents = re.sub('(http|https)://[^\s]*','httpaddr', email_contents)
    
    # Handle Email Addresses
    # Look for strings with @ in the middle and replace by emailaddr
    email_contents = re.sub('[^\s]+@[^\s]+','emailaddr', email_contents)
    
    # Handle $ sign
    email_contents = re.sub('[$]+','dollar', email_contents)
   
    # Get rid of special characters
    email_contents = re.sub('[^a-zA-Z0-9]',' ', email_contents)
    
    return email_contents    

In [7]:
email_preprocessed = preProcessEmail(file_contents)

Then we tokenize emails, ie ie split the text into a list of words and stem the words using Porter stemmer
(https://en.wikipedia.org/wiki/Stemming)

In [8]:
def tokenizeEmail(email_contents):
    # Tokenize email, ie split the text into a list of words
    # And stem the words (https://en.wikipedia.org/wiki/Stemming)
    
    # NLTK word_tokenize method
    tokenized_email = word_tokenize(email_contents)

    # Stem the words using Porter stemmer
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in tokenized_email]
    

In [9]:
email_tokenized = tokenizeEmail(email_preprocessed)

After preprocessing the emails, we have a list of words for each email, for example :<br><br>
![title](PreprocessedSample.png)
<br>
The next step is to choose which words we would like to use in our classifier and which we would want to leave out.<br>
For this exercise, we have chosen only the most frequently occuring words as our set of words considered (the vocabulary list).<br> Since words that occur rarely in the training set are only in a few emails, they might cause the model to overfit our training set.
The complete vocabulary list is in the file <i>vocab.txt</i>
![title](VocabList.png)
<br>
Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words. <br>
In practice, a vocabulary list with about 10,000 to 50,000 words is often used.<br>
Given the vocabulary list, we can now map each word in the preprocessed emails into a list of word indices that contains the index of the word in the vocabulary list. <br>
![title](WordIndices.png)
<br>
Specifically, in the sample email, the word "anyone" was first normalized to "anyon" and then mapped onto the index 86 in the vocabulary list.<br>

In [10]:
def indexEmail(email_tokenized):
       
    #Load vocabulary
    vocabList = pd.read_csv('vocab.txt', delimiter = '\t' , header = None).values
    
    #Return indices of words contained in vocabulary
    return [np.asscalar(np.argwhere(vocabList[:,1] == w)) for w in email_tokenized if np.argwhere(vocabList[:,1] == w).size > 0]

In [11]:
word_indices = indexEmail(email_tokenized)

### Part 2: Feature Extraction

We will now implement the feature extraction that converts each email into a vector in \mathbb{R}^N.
For this exercise, we will be using n = # words in vocabulary list. 
Specifically, the feature $x_t \in \{0,1\}$ for an email corresponds to whether the $i$ -th word in the dictionary occurs in the email. That is, $x_i = 1$ if the $i$-th word is in the email and $x_i = 1$  $i$-th word is not present in the email.

Thus, for a typical email, this feature would look like:
$$ x=
\quad
\begin{bmatrix} 
0 \\
\vdots \\
\vdots  \\
1 \\
0 \\
\vdots\\
1
\end{bmatrix}
\in \mathbb{R}^n
$$

In [12]:
def emailFeatures(word_indices):
    
    #takes in a word_indices vector and produces a feature vector from the word indices
    
    n = 1899 #Total number of words in the dictionary
    
    word_indices = np.array(word_indices)
    
    feat = np.zeros(n)
    
    for i in range(word_indices.size):
        feat[word_indices[i]] = 1
    
    return feat

In [13]:
features = emailFeatures(word_indices)

In [14]:
print('Length of feature vector : {:d}'.format(features.size))
print('Number of non-zero entries : {:d}'.format(np.sum(features > 0)))

Length of feature vector : 1899
Number of non-zero entries : 45


### Part 3: Train Linear SVM for Spam Classification

In [15]:
data = loadmat('spamTrain.mat')
X = data['X'] # X is already preformatted with feature vectors of 0 and 1
y = data['y'].ravel()

In [16]:
# declare linear SVM model
model = LinearSVC(tol = 1e-3, C = 0.1)

In [17]:
# fit on training data
model.fit(X,y)

LinearSVC(C=0.1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.001,
     verbose=0)

In [18]:
print('Training accuracy : {:2.2f} %'.format(accuracy_score(y,model.predict(X))*100))

Training accuracy : 99.98 %


### Part 4: Test Spam Classification
After training the classifier, we can evaluate it on a test set. We have
included a test set in spamTest.mat

In [19]:
data_test = loadmat('spamTest.mat')
Xtest = data_test['Xtest']
ytest = data_test['ytest'].ravel()

In [20]:
print('Test accuracy : {:2.2f} %'.format(accuracy_score(ytest,model.predict(Xtest))*100))

Test accuracy : 99.20 %


### Part 5: Top Predictors of Spam
Since the model we are training is a linear SVM, we can inspect the
weights learned by the model to understand better how it is determining
whether an email is spam or not. The following code finds the words with
the highest weights in the classifier. Informally, the classifier
'thinks' that these words are the most likely indicators of spam.


In [21]:
# Sort weights and store associated indices
weights = np.sort(model.coef_[0,:], axis = 0)
idx = np.argsort(model.coef_[0,:], axis = 0)

# Reverse order (so that it is in descending order)
weights = weights[::-1]
idx = idx[::-1]

In [22]:
# Retrieve vocabulary list
vocabList = pd.read_csv('vocab.txt', delimiter = '\t' , header = None).values

In [23]:
# Top 15 predictors of spam
for i in range(15):
    print('Top {:d} predictor : {} ({:2.6f})'.format(i+1,vocabList[idx[i],1],weights[idx[i]]))


Top 1 predictor : our (-0.016812)
Top 2 predictor : remov (-0.037361)
Top 3 predictor : click (0.060263)
Top 4 predictor : basenumb (0.093576)
Top 5 predictor : guarante (0.009091)
Top 6 predictor : visit (-0.115602)
Top 7 predictor : bodi (0.083997)
Top 8 predictor : will (-0.154599)
Top 9 predictor : numberb (-0.012502)
Top 10 predictor : price (-0.026996)
Top 11 predictor : dollar (0.034485)
Top 12 predictor : nbsp (-0.009752)
Top 13 predictor : below (0.089140)
Top 14 predictor : lo (-0.003161)
Top 15 predictor : most (-0.008167)


### Part 6: Try Your Own Emails

In [24]:
# Read email
file_contents = open('spamSample1.txt', 'r').read()
print(file_contents)

Do You Want To Make $1000 Or More Per Week?

 

If you are a motivated and qualified individual - I 
will personally demonstrate to you a system that will 
make you $1,000 per week or more! This is NOT mlm.

 

Call our 24 hour pre-recorded number to get the 
details.  

 

000-456-789

 

I need people who want to make serious money.  Make 
the call and get the facts. 

Invest 2 minutes in yourself now!

 

000-456-789

 

Looking forward to your call and I will introduce you 
to people like yourself who
are currently making $10,000 plus per week!

 

000-456-789



3484lJGv6-241lEaN9080lRmS6-271WxHo7524qiyT5-438rjUv5615hQcf0-662eiDB9057dMtVl72




In [25]:
# Transform to a feature vector
email_preprocessed = preProcessEmail(file_contents)
email_tokenized = tokenizeEmail(email_preprocessed)
word_indices = indexEmail(email_tokenized)

x = emailFeatures(word_indices)

# Predict using previously trained model
pred = model.predict(x.reshape(1,-1))[0]

print('Spam Classification : {:d}'.format(pred))
print('(1 indicates spam, 0 indicates not spam)')

Spam Classification : 1
(1 indicates spam, 0 indicates not spam)


## END