The following script consists of a Python version of Andrew Ng Stanford Course 'Machine Learning' taught on the Coursera Platform
Note: All exercise data and structure are credited to Stanford University 

**Caveat:** Contrary to the modularity presented in Octave scripts and as I'm using Jupyter Notebooks for educational purposes we will implement the functions on the same notebook where we will call them

# Exercise 1 - Generate E-Mail Features

In [2]:
# Import numpy libraries to deal with matrixes and vectors
import numpy as np
# Import pandas do read data files
import pandas as pd
# Import matplotlib to plot data
import matplotlib.pyplot as plt
# Import regular expressions library
import re
# Import string helper library
import string

#Import NLTK Tokenizer
from nltk.tokenize import word_tokenize

# Import and load Porter Stemmer
from nltk.stem.porter import *
stemmer = PorterStemmer()

# Import math 
import math 

# Import scipy optimization function
from scipy import optimize, io
from scipy.ndimage import rotate

# Import Support Vector Machine
from sklearn.svm import LinearSVC, SVC
# Matplotlib notebook property
%matplotlib inline

One of the many problems that you can solve with machine learning is the classification of spam e-mails.
<br>
We will use an SVM to train this classifier. 
As usual, let's look at the data first:

In [4]:
# Read e-mail contents
file_contents = open("emailSample1.txt", "r")
file_contents = (file_contents.read())

In [5]:
print(file_contents)

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com




How to process this text into something readable for the SVM? 
<br> 
We need to turn those words into integers of some form - let's start by reading a vocabulary list (this vocab was pre-filtered with only the most common words) after pre-processing and doing some common Natural Language Processing tasks such as:
<br>
- keeping only alphanumeric characters;
- flagging emails or urls

In [9]:
def getVocabList():
    '''
    Generates vocabulary list.
    Maps string to integer (sti)
    
    Args:
        None
    Returns:
        vocab_dict(dict): Vocabulary_list
    '''
    vocab_dict = {}
    
    with open("vocab.txt", "r") as vocab:
        for line in vocab:
            vocab_dict[int((line.split('\t'))[0]),1] = line.split('\t')[1].replace('\n','')
            
    return vocab_dict

In [10]:
def processEmail(
    email_contents: str
) -> list:
    '''
    Preprocesses e-mail and returns
    word indices according to vocabulary.
    
    Args:
        email_contents(str): Content of the e-mail 
    Return:
        word_indices(list): List of word indexes.
    
    '''

    vocabList = getVocabList()
    
    word_indices = []

    #Lowercase all e-mail contents 
    email_contents = email_contents.lower()
    
    #Replace \n tags
    email_contents = email_contents.replace('\n',' ')

    #Regex pattern substitutions
    email_contents = re.sub('<[^<>]+>', ' ', email_contents)
    
    #Replace numbers 
    email_contents = re.sub('[0-9]+', 'number', email_contents)
    
    #Handle URL's
    email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)
    
    #Handle e-mail addresses
    email_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents)
    
    #Handle $ sign
    email_contents = re.sub('[$]+', 'dollar', email_contents)
    
    email_contents = word_tokenize(email_contents)
    
    process_email_contents = []
    
    for el in email_contents:
        # Remove punctuation
        element = (el.translate(str.maketrans('', '', string.punctuation)))
        # Retain only alphanumeric
        element = re.sub(r'\W+', '', element)
        if len(element)>=1:
            process_email_contents.append(stemmer.stem(element))

    # Loop through each element and find corresponding vocab integer value
    for el in process_email_contents:
        try:
            word_indices.append([k for k,v in vocabList.items() if v == el][0][0])
        except:
            pass
        
    return word_indices

In [11]:
# Generate Word indices for the process e-mail
word_indices = processEmail(file_contents)

In [12]:
def emailFeatures(
    word_indices: list
) ->np.array:
    '''
    Returns vectorized version of the e-mail using word 
    indexes.
    Each array element is mapped to an array 
    consisting of 0's and 1's where 1's are the
    presence of the word at index n in the e-mail.
    
    Args
        word_indices(list): List of word indexes
    Returns:
        vectorized_features(np.array): Word vector.
    '''
    
    vocabList = getVocabList()
    
    vectorized_features = np.zeros(len(vocabList))
    for i in range(0,len(vocabList)):
        if i in word_indices:
            vectorized_features[i] = 1
    
    return vectorized_features

In [13]:
features = emailFeatures(word_indices)

In [14]:
print('Length of feature vector is {}'.format(len(features)))
print('Length of non-zero elements is {}'.format(features.sum()))

Length of feature vector is 1899
Length of non-zero elements is 45.0


# Exercise 2 - Load Pre-Computed Features and Train SVM

In [15]:
# Use scipy Io to load matrix object with exercise data
spam_file = io.loadmat('spamTrain.mat')
X = np.array(spam_file['X'])
y = np.array(spam_file['y'])

We have pre-loaded the matrixes for all the spam e-mails using the vocab list above.
<br>
This matrix object was given by Andrew on his class so we don't need to compute anything.

**As in the first part of exercise 6, we are going to train a Linear SVM and assess the results.**

In [16]:
def svmTrain(
    X: np.array, 
    y: np.array, 
    C: float,
    max_iter:int
) -> SVC:
    
    '''
    Trains a Support Vector Machine Classifier using sklearn
    library. 
    
    Args:
        X(np.array): Array of original features.
        y(np.array): Array of target values.
        C(float): Penalty of the Support Vector Machine
        max_iter(int): Number of iterations
        
    Returns:
        svm_classifier(sklearn.base.ClassifierMixin): trained
        classifier.
    '''
    
    svm_classifier = SVC(C=C, kernel='linear', probability=True)
    svm_classifier.fit(X,y.reshape(len(y),))     
    
    return svm_classifier

In [17]:
# Train Model with a 0.1 penalty
C = 0.1
model = svmTrain(X,y,C,100)

In [18]:
# Predict if spam/not spam based on model - we'll use the sklearn predict method 
p = model.predict(X)

In [19]:
print('Model accuracy is {}'.format((p.reshape(len(p),1)==y).sum()/len(y)*100))

Model accuracy is 99.825


**Accuracy is really high on the training set.
<br>
Let's check the performance on the test set:**

In [20]:
# Use scipy Io to load matrix object with exercise test data
spam_file = io.loadmat('spamTest.mat')
X_test = np.array(spam_file['Xtest'])
y_test = np.array(spam_file['ytest'])

In [21]:
# Predict if spam/not spam based on model - we'll use the sklearn predict method 
p_test = model.predict(X_test)

In [22]:
print('Model accuracy is {}'.format((p_test.reshape(len(p_test),1)==y_test).sum()/len(y_test)*100))

Model accuracy is 98.9


Model accuracy on the test set is also really good.
<br>
**Let's take a look at the weight of the features on the algorithm and extract the influence of those features on the target variable.**
<br>
**Let's look at the top predictors for spam - this is, the words that weigh more on the classification of spam/not spam:**

In [24]:
vocabList = getVocabList()

# Rely on the coefficients of the model to obtain the variable influence
weights = model.coef_[0]
weights = dict(np.hstack((np.arange(1,1900).reshape(1899,1),weights.reshape(1899,1))))

# Sort Weights in Dictionary
weights = sorted(weights.items(), key=lambda kv: kv[1], reverse=True)

# Printing the top predictors of spam
top_15 = {}
for i in weights[:15]:
    print({v for k,v in vocabList.items() if k[0] == i[0]})

{'our'}
{'click'}
{'remov'}
{'guarante'}
{'visit'}
{'basenumb'}
{'dollar'}
{'will'}
{'price'}
{'pleas'}
{'most'}
{'nbsp'}
{'lo'}
{'ga'}
{'hour'}
