In [None]:
#%autosave 0
from IPython.core.display import HTML, display
display(HTML('<style>.container { width:100%; } </style>'))

# Spam Detection  Using a Support Vector Machine

The process of creating a spam detector using a Support Vector Machine is split up into five steps.

  - Create a set of the most common words occurring in spam and ham (i.e. non-spam) emails.
  - Transform every mail into a <em style="color:blue">frequency vector</em>: For every word in the set of most common words, 
    the frequency vector stores the frequency of this word in the respective mail.
  - For every word in the list of most common word, compute the <em style="color:blue">inverse document frequency</em>.
  - Compute the <em style="color:blue">feature matrix</em> by transforming the frequency vectors into vectors that contain the product of the 
    <em style="color:blue">term frequency</em> with the <em style="color:blue">inverse document frequency</em>.
  - Train and test an SVM using this <em style="color:blue">feature matrix</em>.

## Step 1: Create the Set of Common Words

In [1]:
import os
import re
import numpy as np
import math

In [2]:
from collections import Counter

The directory 
https://github.com/karlstroetmann/Artificial-Intelligence/tree/master/Python/EmailData
contains 960 emails that are divided into four subdirectories:

  - `spam-train` contains 350 spam emails for training,
  - `ham-train`  contains 350 non-spam emails for training,
  - `spam-test`  contains 130 spam emails for testing,
  - `ham-test`   contains 130 non-spam emails for testing.

I have found this data on the page 
http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html provided by Andrew Ng.

We declare some variables so that this notebook can be adapted to other data sets.

In [3]:
spam_dir_train = 'EmailData/spam-train/'
ham__dir_train = 'EmailData/ham-train/'
spam_dir_test  = 'EmailData/spam-test/'
ham__dir_test  = 'EmailData/ham-test/'
Directories    = [spam_dir_train, ham__dir_train, spam_dir_test, ham__dir_test]

The function $\texttt{get_word_set}(\texttt{fn})$ takes a filename $\texttt{fn}$ as its argument.  It reads the file and returns a `set` of all words that are found in this file.  The words are transformed to lower case.

In [4]:
def get_words_set(fn):
    with open(fn) as file:
        text = file.read()
        text = text.lower()
        return set(re.findall(r"[\w']+", text))

The function `read_all_files` reads all files contained in those directories that are stored in the list `Directories`. 
It returns a `Counter`.  For every word $w$ this counter contains the number of files that contain $w$. 

In [5]:
def read_all_files():
    Words = Counter()
    for directory in Directories:
        for file_name in os.listdir(directory):
            Words.update(get_words_set(directory + file_name))
    return Words

`Common_Words` is a `numpy` array of the 2500 most common words found in all of our emails. 

In [6]:
M            = 2500             # number of the most common words to use
Word_Counter = read_all_files()
Common_Words = np.array(list({ w for w, _ in Word_Counter.most_common(M) }))

## Step 2: Transform Files into Frequency Vectors

`Index_Dict` is a dictionary that maps from the most common words to their index in the array `Common_Words`.

In [7]:
Index_Dict = { w: i for i, w in enumerate(Common_Words) }
Index_Dict

{'ph': 0,
 'are': 1,
 'seminar': 2,
 'instruction': 3,
 'figure': 4,
 'interdisciplinary': 5,
 'argument': 6,
 'worldwide': 7,
 'represent': 8,
 'invite': 9,
 'where': 10,
 'discover': 11,
 'integration': 12,
 'page': 13,
 'server': 14,
 'saturday': 15,
 'jackson': 16,
 'encode': 17,
 'paid': 18,
 'acceptance': 19,
 'europe': 20,
 'global': 21,
 'pic': 22,
 'bibliography': 23,
 'wealthy': 24,
 'classified': 25,
 'merge': 26,
 'removal': 27,
 'lower': 28,
 'overt': 29,
 'fellow': 30,
 'loui': 31,
 'author': 32,
 'install': 33,
 'fairchild': 34,
 'competitor': 35,
 'fresh': 36,
 'hardcore': 37,
 'christmas': 38,
 'satisfaction': 39,
 'below': 40,
 'away': 41,
 'file': 42,
 'information': 43,
 'management': 44,
 'importance': 45,
 'counter': 46,
 'alan': 47,
 'congress': 48,
 'telephone': 49,
 'frequently': 50,
 'locate': 51,
 'mistake': 52,
 'short': 53,
 'overview': 54,
 'create': 55,
 'sometime': 56,
 'teach': 57,
 'answer': 58,
 'accepted': 59,
 'theoretical': 60,
 'procedure': 61,
 '

The function $\texttt{transform_to_vector}(L)$ takes a list of words $L$ and transforms this list into a vector $\mathbf{v}$.  If 
$\texttt{CommonWords}[i] = w$, then $\mathbf{v}[i]$ specifies the number of times that $w$ occurs in $L$. 

In [8]:
def transform_to_vector(L):
    Result = np.zeros((len(Common_Words, )))
    for w in L:
        if w in Index_Dict:
            Result[Index_Dict[w]] += 1
    return Result

The function $\texttt{get_word_vector}(fn)$ takes a filename `fn`, reads the specified file and transforms it into a feature vector.

In [9]:
def get_word_vector(fn):
    with open(fn) as file:
        text = file.read()
        text = text.lower()
        return transform_to_vector(re.findall(r"[\w']+", text))

## Step 3: Compute the Inverse Document Frequency

In natural language processing, the notion <em style='color:blue;'>term</em> is used as a synonym for <em style='color:blue;'>word</em>.
Given a term $t$ and a document $d$, the <em style='color:blue;'>term frequency</em> $\texttt{tf}(t, d)$ is defined as
$$ \texttt{tf}(t, d) = \frac{d.\texttt{count}(t)}{\texttt{len}(d)}, $$
where $d.\texttt{count}(t)$ counts the number of times $t$ appears in $d$ and $\texttt{len}(d)$ is the length of the list representing $d$.

A <em style='color:blue;'>corpus</em> is a set of documents.  Given a term $t$ and a corpus $\mathcal{C}$, the <em style='color:blue;'>inverse document frequency</em> 
$\texttt{idf}(t,\mathcal{C})$ is defined as
$$ \texttt{idf}(t,\mathcal{C}) = \ln\left(\frac{\texttt{card}(\mathcal{C}) + 1}{\texttt{card}\bigl(\{ d \in \mathcal{C} \mid t \in d \}\bigr) + 1}\right). $$
The addition of $1$ in both nominator and denominator is called <em style="color:blue;">Laplace smoothing</em>.  This is necessary to prevent a **division by zero** error 
for those terms $t$ that do not occur in the list `Common_Words`.

## Step 4: Compute the Feature Matrix

The function $\texttt{feature_matrix}(\texttt{spam_dir}, \texttt{ham_dir})$ takes two directories that contain spam and ham, respectively.
It computes a matrix $X$ and a vector $Y$, where $X$ is the feature matrix and for
every row $r$ of the feature matrix, $Y[r]$ is 1 if the mail is ham and 0 if it's spam.

The way $X$ is computed is quite inefficient, it would have been better to initialize $X$ as a matrix with the shape $(N,M)$, where $N$ is the number of mails and $M$ is the number of common words.

In [10]:
def feature_matrix(spam_dir, ham_dir):
    X = []
    Y = []
    for fn in os.listdir(spam_dir):
        X.append(get_word_vector(spam_dir + fn))
        Y.append(0)
    for fn in os.listdir(ham_dir):    
        X.append(get_word_vector(ham_dir + fn))
        Y.append(+1)
    X = np.array(X)
    Y = np.array(Y)
    return X, Y

We convert the training set into a feature matrix.

In [11]:
%%time
X_train, Y_train = feature_matrix(spam_dir_train, ham__dir_train)

Wall time: 751 ms


Up to now, the feature matrix contains only the term frequencies.  Next we multiply with the inverse document frequencies.

In [12]:
N, _ = X_train.shape
IDF  = {}
for w, i in Index_Dict.items():
    IDF[w] = np.log((N + 1) / (Word_Counter[w] + 1))
    X_train[:, i] = X_train[:, i] * IDF[w]

We build the feature matrix for the test set.

In [13]:
X_test, Y_test = feature_matrix(spam_dir_test, ham__dir_test)

In [14]:
for w, i in Index_Dict.items():
    X_test[:, i] = X_test[:, i] * IDF[w]

## Step 5: Train and Test a Support Vector Machine

In [15]:
import sklearn.svm as svm

Train an SVM and compute the accuracy on the training data.

In [16]:
M = svm.SVC(kernel='linear', C=100000)
M.fit  (X_train, Y_train)
M.score(X_train, Y_train)

1.0

Compute the accuracy for the test data.

In [17]:
M.score(X_test, Y_test)

0.9884615384615385