# Exercise 6-2: Spam Classification with SVMs

In this part of the exercise, we will build a spam filter with SVM.

## Part 1: Email Preprocessing

The `process_email()` preprocesses a the body of an email and returns a list of word indices with following preprocessing and normalization steps:

- Lower-casing
- Stripping HTML
- Normalizing URLs
- Normalizing Email Addresses
- Normalizing Numbers
- Normalizing Dollars
- Word Stemming
- Removal of non-words

In [61]:
import re
from nltk import PorterStemmer


def process_email(email_contents):
    """
    Preprocesses a the body of an email and returns a list of word_indices

    Parameters
    ----------
    email_contents : string
        The email content.

    Returns
    -------
    list
        A list of word indices.

    """
    vocab_list = get_vocablist()
    """
    <[^<>]+>
    [0-9]+
    (http|https)://[^\s]*
    [^\s]+@[^\s]+
    [$]+
    Remove this from email. use re.sub()
    """
    email_contents = email_contents.lower()
    email_contents = re.sub("<[^<>]+>",' ',email_contents)
    email_contents = re.sub("[0-9]+",'number',email_contents)
    email_contents = re.sub("(http|https)://[^\s]*","httpaddr",email_contents)
    email_contents = re.sub("[^\s]+@[^\s]+","emailaddr",email_contents)
    email_contents = re.sub("[$]+",'dollar',email_contents)
    """
    use split() for "" @$/#.-:&*+=[]?!(){},'">_<;%\n\r""
    learn stemming
    """
    word_indices = []
    words_list = split(""" @$/#.-:&*+=[]?!(){},'">_<;%\n\r""",email_contents)
    stemmer = PorterStemmer()
    for word in words_list:
        if word =='':
            continue
        word = stemmer.stem(word)
        print(word)
        for i in range(len(vocab_list)):
            if vocab_list[i]==word:
                word_indices.append(i)
            
    
    return word_indices


def split(delimiters, string, maxsplit=0):
    pattern = '|'.join(map(re.escape, delimiters))
    return re.split(pattern, string, maxsplit)


def get_vocablist():
    """
    Reads the fixed vocabulary list in vocab.txt and returns a list of the words.

    Returns
    -------
    list
        The vocabulary list.
    """
    vocabulary = []
    with open('vocab.txt') as f:
        for line in f:
            idx, word = line.split('\t')
            vocabulary.append(word.strip())
    return vocabulary

Preprocess the raw email:

In [62]:
with open('emailSample1.txt') as f:
    file_contents = f.read().replace('\n', '')

word_indices = process_email(file_contents)

# Print Stats
print ('Word Indices:', word_indices)

anyon
know
how
much
it
cost
to
host
a
web
portal
well
it
depend
on
how
mani
visitor
you
re
expect
thi
can
be
anywher
from
less
than
number
buck
a
month
to
a
coupl
of
dollarnumb
you
should
checkout
httpaddr
or
perhap
amazon
ecnumb
if
your
run
someth
big
to
unsubscrib
yourself
from
thi
mail
list
send
an
email
emailaddr
Word Indices: [85, 915, 793, 1076, 882, 369, 1698, 789, 1821, 1830, 882, 430, 1170, 793, 1001, 1892, 1363, 591, 1675, 237, 161, 88, 687, 944, 1662, 1119, 1061, 1698, 374, 1161, 478, 1892, 1509, 798, 1181, 1236, 809, 1894, 1439, 1546, 180, 1698, 1757, 1895, 687, 1675, 991, 960, 1476, 70, 529, 530]


## Part 2: Feature Extraction

The `email_features()` produces a feature vector x from the given word indices. x[i] is 1 if the i-th word is in the email and x[i] is 0 if the i-th word is not present in the email.

In [63]:
import numpy as np


def email_features(word_indices):
    """
    Takes in a word_indices vector and produces a feature vector from the word indices.

    Parameters
    ----------
    word_indices : array-like
        List of word indices.

    Returns
    -------
    ndarray
        Feature vector from word indices.
    """
    # Total number of words in the dictionary
    n = 1899

    x = np.zeros((n, 1))
    x[word_indices] = 1

    return x


Extracting features from sample email:

In [64]:
features = email_features(word_indices)
print ('Length of feature vector:', len(features))
print ('Number of non-zero entries:', np.sum(features > 0))

Length of feature vector: 1899
Number of non-zero entries: 45


## Part 3: Train Linear SVM for Spam Classification

Train SVM to classify spam emails:

In [65]:
import scipy.io as sio
import numpy as np
from sklearn import svm
data = sio.loadmat("spamTrain.mat")

# load data from spamTrain.mat
X = data["X"] 
y = data["y"].ravel()

C = 0.1
#Use Linear SVC
clf = svm.LinearSVC(C=C)
clf.fit(X,y)
p = clf.predict(X)

print ('Training Accuracy:', np.mean(p == y) * 100)

Training Accuracy: 99.97500000000001


## Part 4: Test Spam Classification

Evaluate the trained Linear SVM on a test set:

In [66]:
# load spamTest.mat 
data = sio.loadmat("spamTest.mat")
X_test = data['Xtest']
y_test = data['ytest'].ravel()

print ('Evaluating the trained Linear SVM on a test set...')
p = clf.predict(X_test)

print ('Test Accuracy:', np.mean(p == y_test) * 100)

Evaluating the trained Linear SVM on a test set...
Test Accuracy: 99.2


## Part 5: Top Predictors of Spam

Print the top predictors:

In [67]:
# use coef_.ravel()
coef = clf.coef_.ravel()
idx = coef.argsort()[::-1]
vocab_list = get_vocablist()

print ('Top predictors of spam:')
for i in range(15):
    print ("{0:<15s} ({1:f})".format(vocab_list[idx[i]], coef[idx[i]]))

Top predictors of spam:
our             (0.421665)
remov           (0.387173)
click           (0.387060)
basenumb        (0.346617)
guarante        (0.341686)
visit           (0.303028)
bodi            (0.263523)
will            (0.244394)
numberb         (0.238795)
price           (0.234199)
dollar          (0.232315)
nbsp            (0.227081)
below           (0.223199)
lo              (0.219994)
most            (0.214548)


## Part 6: Try Your Own Emails

Run the spam classifier over `spamSample1.txt`:

In [69]:
# load spamSample1.txt
#replace \n by ''
filename = 'spamSample1.txt'
file  = open("spamSample1.txt","r") 

file_contents=file.read()
file_contents = file_contents.replace("\n","")
word_indices = process_email(file_contents)
x = email_features(word_indices)
#predict
p = clf.predict(x.T)
print ('Processed', filename, '\nSpam Classification:', p)
print ('(1 indicates spam, 0 indicates not spam)')

do
you
want
to
make
dollarnumb
or
more
per
week
if
you
are
a
motiv
and
qualifi
individu
i
will
person
demonstr
to
you
a
system
that
will
make
you
dollarnumb
number
per
week
or
more
thi
is
not
mlm
call
our
number
hour
pre
record
number
to
get
the
detail
number
number
number
i
need
peopl
who
want
to
make
seriou
money
make
the
call
and
get
the
fact
invest
number
minut
in
yourself
now
number
number
number
look
forward
to
your
call
and
i
will
introduc
you
to
peopl
like
yourself
whoar
current
make
dollarnumb
number
plu
per
week
number
number
numberljgvnumb
numberleannumberlrmsnumb
numberwxhonumberqiytnumb
numberrjuvnumberhqcfnumb
numbereidbnumberdmtvlnumb
Processed spamSample1.txt 
Spam Classification: [1]
(1 indicates spam, 0 indicates not spam)
