# Support Vector Machine Spam Classification

revision: dcfbda7

Many email services today provide spam filters that are able to classify emails
into spam and non-spam email with high accuracy. In this part of the exercise,
you will use SVMs to build your own spam filter. You will be training a classifier to classify whether a given email, $x$, is spam ($y = 1$) or non-spam ($y = 0$). In particular, you need to convert each
email into a feature vector $x \in \mathbb{R}^n$. The following parts of the exercise will walk you through how such a feature vector can be constructed from an email.


*References:* These exercises are based on the Stanford Machine Learning Course [CS229](http://cs229.stanford.edu) of Andrew Ng.

In [None]:
# @formatter:off
# PREAMBLE
import re
import nltk
import numpy as np
import pandas as pd
import scipy.io as si
import seaborn as sns

%matplotlib inline
sns.set_context("notebook", font_scale=1.1)
sns.set_style("ticks")
%load_ext autoreload
%autoreload 2
# @formatter:on

## Preprocessing Emails

Before starting on a machine learning task, it is usually insightful to
take a look at examples from the dataset.

```
> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors youre expecting. This can be
anywhere from less than 10 bucks a month to a couple of $100. You
should checkout http://www.rackspace.com/ or perhaps Amazon EC2 if
youre running something big..
To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com
```

This sample email contains a URL, an email address (at the end), numbers, and dollar
amounts. While many emails would contain similar types of entities (e.g.,
numbers, other URLs, or other email addresses), the specific entities (e.g.,
the specific URL or specific dollar amount) will be different in almost every
email. Therefore, one method often employed in processing emails is to
“normalize” these values, so that all URLs are treated the same, all numbers
are treated the same, etc. For example, we could replace each URL in the
email with the unique string “httpaddr” to indicate that a URL was present.
This has the effect of letting the spam classifier make a classification decision
based on whether any URL was present, rather than whether a specific URL
was present. This typically improves the performance of a spam classifier,
since spammers often randomize the URLs, and thus the odds of seeing any
particular URL again in a new piece of spam is very small.

In `normalizeEmail`, we have implemented the following email preprocessing
and normalization steps:

* **Lower-casing**: The entire email is converted into lower case, so
that captialization is ignored (e.g., `IndIcaTE` is treated the same as
`Indicate`).

* **Stripping HTML**: All HTML tags are removed from the emails.
Many emails often come with HTML formatting; we remove all the
HTML tags, so that only the content remains.

* **Normalizing URLs**: All URLs are replaced with the text "`httpaddr`".

* **Normalizing Email Addresses**: All email addresses are replaced
with the text "`emailaddr`".

* **Normalizing Numbers**: All numbers are replaced with the text "`number`".

* **Normalizing Dollars**: All dollar signs ($) are replaced with the text "`dollar`".

* **Word Stemming**: Words are reduced to their stemmed form. For example,
"discount", “discounts”, “discounted” and “discounting” are all
replaced with “discount”. Sometimes, the Stemmer actually strips off
additional characters from the end, so “include”, “includes”, “included”,
and “including” are all replaced with “includ”.

* **Removal of non-words**: Non-words and punctuation have been removed.
All white spaces (tabs, newlines, spaces) have all been trimmed
to a single space character.

In [None]:
def normalizeEmail(email_contents):
    # the result
    normalized = []

    # lower case
    email_contents = email_contents.lower()

    # strip all HTML
    email_contents = re.sub('<[^<>]+>', ' ', email_contents)

    # Handle numbers
    email_contents = re.sub('[0-9]+', 'number', email_contents)

    # Handle URLS
    email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)

    # Handle email addresses
    email_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents)

    # handle $ sign
    email_contents = re.sub('[$]+', 'dollar', email_contents)

    # tokenize
    tokens = re.split('[ ' +
                      re.escape("@$/#.-:&*+=[]?!(){},'\">_<;%") + ']',
                      email_contents)

    stemmer = nltk.stem.PorterStemmer()
    for token in tokens:
        token = re.sub('[^a-zA-Z0-9]', '', token)
        token = stemmer.stem(token.strip())
        if len(token) > 0:
            normalized.append(token)

    return normalized

The result of these preprocessing steps is
```
anyon know how much it cost to host a web portal well it depend on how 
mani visitor your expect thi can be anywher from less than number buck 
a month to a coupl of dollarnumb you should checkout httpaddr or perhap 
amazon ecnumb if your run someth big to unsubscrib yourself from thi 
mail list send an email to emailaddr
```

In [None]:
sample_email = """> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors youre expecting. This can be 
anywhere from less than 10 bucks a month to a couple of $100. You 
should checkout http://www.rackspace.com/ or perhaps Amazon EC2 if 
youre running something big..

To unsubscribe yourself from this mailing list, send an email to: 
groupname-unsubscribe@egroups.com"""

normalized_sample = normalizeEmail(sample_email)
print(" ".join(normalized_sample))

While preprocessing has left word fragments and non-words, this form turns out to be
much easier to work with for performing feature extraction.

## Vocabulary List

After preprocessing the emails, we have a list of words for
each email. The next step is to choose which words we would like to use in
our classifier and which we would want to leave out. 

For this exercise, we have chosen only the most frequently occuring words
as our set of words considered (the vocabulary list). Since words that occur
rarely in the training set are only in a few emails, they might cause the
model to overfit our training set. The complete vocabulary list is in the file
`vocab.txt`.

In [None]:
# read vocab file
def loadVocabulary():
    return pd.read_csv('vocab.txt', sep='\t', header=None).values


vocabList = loadVocabulary()
print(vocabList)

Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus,
resulting in a list of 1899 words. In practice, a vocabulary list with about 10,000 to 50,000 words is often used.

Given the vocabulary list, we can now map each word in the preprocessed emails into a list of word indices that contains the index of the word in the vocabulary list which results in 
```
[86, 916, 794, 1077, 883, 370, 1699, 790, 1822, 1831, 883, 431, 1171, 794, 1002, 1895, 592, 1676, 238, 688, 945, 1663, 1120, 1062, 1699, 375, 1162, 479, 799, 1182, 1237, 1440, 1547, 181, 1699, 1758, 1896, 688, 1676, 992, 961, 1477, 71, 530, 1699, 531]
```
as the vocabulary index representation of the sample email. 

Your task now is to complete the code in `processEmail` to perform this mapping. You should look up the word in the vocabulary list `vocabList` and find if the word exists in the vocabulary list. If the word
exists, you should add the index of the word into the word indices variable. If the word does not exist, and is therefore not in the vocabulary, you can skip the word.

In [None]:
def processEmail(email):
    raise NotImplementedError

In [None]:
print(processEmail(sample_email))

## Extracting Features from Emails

You will now implement the feature extraction that converts each email into
a vector in $\mathbb{R}^d$. For this exercise, you will be using $d = \# \text{words in vocabulary
list}$. Specifically, the feature $x_i \in \{0, 1\}$ for an email corresponds to whether
the $i$-th word in the dictionary occurs in the email. That is, $x_i = 1$ if the $i$-th
word is in the email and $x_i = 0$ if the $i$-th word is not present in the email.

You should see that the feature vector had length 1899 and 44 non-zero entries.

In [None]:
def emailFeatures(email):
    raise NotImplementedError

In [None]:
sample_feature = emailFeatures(sample_email)
print(f"length: {len(sample_feature)} non-zeros: {np.count_nonzero(sample_feature)}")

## Training SVM for Spam Classification

After you have completed the feature extraction functions, the next step is to load a preprocessed training dataset that will be used to train a SVM classifier. 

The training dataset contains 4000 training examples of spam
and non-spam email, while the test dataset contains 1000 test examples. Each
original email was processed using the `emailFeatures` function and converted into a vector 
$x_{i} \in \mathbb{R}^{1899}$.

After loading the dataset we will proceed to train a SVM to
classify between spam ($y = 1$) and non-spam ($y = 0$) emails. Once the
training completes, you should see that the classifier gets a training accuracy
of about $99.8\%$ and a test accuracy of about $98.9\%$.

In [None]:
def loadData():
    data = si.loadmat('spamTrain.mat')
    train = {'X': data['X'], 'y': data['y'].flatten()}

    data = si.loadmat('spamTest.mat')
    test = {'X': data['Xtest'], 'y': data['ytest'].flatten()}

    return train, test


train, test = loadData()

In [None]:
from sklearn.svm import SVC

# train support vector machine
svm = SVC(C=0.1, kernel='linear').fit(train['X'], train['y'])

print("Training set accuracy: %.1f%%" % (svm.score(train['X'], train['y']) * 100))
print("Test set accuracy: %.1f%%" % (svm.score(test['X'], test['y']) * 100))

## Top Predictors for Spam

To better understand how the spam classifier works, we can inspect the
parameters to see which words the classifier thinks are the most predictive
of spam. 

The next step finds the parameters with the largest
positive values in the classifier and displays the corresponding words. Thus, if an email contains words such as “guarantee”, “remove”, “dollar”,
and “price” it is likely to be
classified as spam.

In [None]:
topidx = np.argsort(svm.coef_[0]).tolist()[::-1]
vdict = dict(loadVocabulary())
top = list(map(lambda idx: vdict[idx + 1], topidx))
print(top[0:15])

## Test on some examples

In [None]:
def fileAsString(filename):
    with open(filename, 'r') as fh:
        return fh.read()


sample1 = fileAsString('spamSample1.txt')
sample3 = fileAsString('emailSample1.txt')
sample4 = fileAsString('emailSample2.txt')

# Test the support vector machine on the samples (sample1, sample3, sample4)

In [None]:
features = emailFeatures(sample3)
predict = svm.predict(features.reshape((1, -1))).squeeze()
print(sample3)
print("The email was classified as %s" % ("SPAM" if predict == 1 else "NOT SPAM"))