Programming Exercise 6: Support Vector Machines
============

In this exercise, you will be using support vector machines (SVMs) to build a spam classifier. Before starting on the programming exercise, we strongly recommend watching the video lectures and completing the review questions for the associated topics. The files included in this Exercise are:  

- spamTrain.mat - Spam training set
- spamTest.mat - Spam test set
- emailSample1.txt - Sample email 1
- emailSample2.txt - Sample email 2
- spamSample1.txt - Sample spam 1
- spamSample2.txt - Sample spam 2
- vocab.txt - Vocabulary list
- getVocabList.m - Load vocabulary list
- porterStemmer.m - Stemming function
- readFile.m - Reads a file into a character string

You only have to implement these two functions: 

- emailFeatures.m - Feature extraction from emails
- processEmail.m - Email preprocessing

The contained files are found in File ==> Open or in the readonly section of Assignment6b in the home page. We highly recommend that you take a look at them as you make progress in this exercise. 

### NOTE:

You will find cells which contain the comment % GRADED FUNCTION: functionName. Do not edit that comment. Those cells will be used to grade your assignment. Each block of code with that comment should only have the function. 

Instructions will be provided as needed in the exercise. 


#### After submitting your assignment, you can [check your grades here](https://www.coursera.org/learn/machine-learning/programming/jbLrz/svm-on-spam-email). 

Spam Classification
===================

Many email services today provide spam filters that are able to classify
emails into spam and non-spam email with high accuracy. In here you will use SVMs to build your own spam filter!

You will be training a classifier to classify whether a given email,
$x$, is spam ($y=1$) or non-spam ($y=0$). In particular, you need to
convert each email into a feature vector $x \in \mathbb{R}^n$. The
following parts of the exercise will walk you through how such a feature
vector can be constructed from an email.

The dataset included for this exercise is based on a
a subset of the SpamAssassin Public Corpus. For the purpose of this
exercise, you will only be using the body of the email (excluding the
email headers).

Preprocessing Emails
--------------------

<table border = "0" width = "600"><tr><td> 

> Anyone knows how much it costs to host a web portal ? <br 
/> > <br
/>Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if you’re running something big.. <br 
/> <br

/> To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com

<caption><center>Spam Email</center></caption>

</td></tr></table>

Before starting on a machine learning task, it is usually insightful to take a look at examples from the dataset. The email displayed above shows a sample email that contains a URL, an email address (at the end), numbers, and dollar amounts. While many emails would contain similar types of entities (e.g., numbers, other URLs, or other email addresses), the specific entities (e.g., the specific URL or specific dollar amount) will be different in almost every email. Therefore, one method often employed in processing emails is to “normalize” these values, so that all URLs are treated the same, all numbers are treated the same, etc. For example, we could replace each URL in the email with the unique string “httpaddr” to indicate that a URL was present. This has the effect of letting the spam classifier make a classification
decision based on whether *any* URL was present, rather than whether a
specific URL was present. This typically improves the performance of a
spam classifier, since spammers often randomize the URLs, and thus the
odds of seeing any particular URL again in a new piece of spam is very
small.

In **processEmail**, we have implemented the following email
preprocessing and normalization steps:

-   **Lower-casing:** The entire email is converted into lower case, so
    that captialization is ignored (e.g., IndIcaTE is treated the
    same way as Indicate.

-   **Stripping HTML:** All HTML tags are removed from the emails. Many
    emails often come with HTML formatting; we remove all the HTML tags,
    so that only the content remains.

-   **Normalizing URLs:** All URLs are replaced with the text
    “*httpaddr*”.

-   **Normalizing Email Addresses:** All email addresses are replaced
    with the text “*emailaddr*”.

-   **Normalizing Numbers:** All numbers are replaced with the text
    “*number*”.

-   **Normalizing Dollars:** All dollar signs (\$) are replaced with the
    text “*dollar*”.

-   **Word Stemming:** Words are reduced to their stemmed form. For
    example, “discount”, “discounts”, “discounted” and “discounting” are
    all replaced with “*discount*”. Sometimes, the Stemmer actually
    strips off additional characters from the end, so “include”,
    “includes”, “included”, and “including” are all replaced with
    “*includ*”.

-   **Removal of non-words:** Non-words and punctuation have been
    removed. All white spaces (tabs, newlines, spaces) have all been
    trimmed to a single space character.

The result of these preprocessing steps is shown below.
While preprocessing has left word fragments and non-words, this form turns out
to be much easier to work with for performing feature extraction.

<table border = "0" width = "600"><tr><td> 

anyon know how much it cost to host a web portal well it depend on how mani 
visitor your expect thi can be anywher from less than number buck a month to 
a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if 
your run someth big to unsubscrib yourself from thi mail list send an email 
to emailaddr  

<caption><center>Processed Spam Email</center></caption>

</td></tr></table>

Vocabulary List
--------------------
<table border = "0" width = "75" ><tr><td> 

1 aa <br
/> 2 ab <br
/> 3 abil <br
/> ... <br
/> 86 anyon <br
/> ... <br
/> 86 anyon <br
/> 916 know <br
/> ... <br
/> 1898 zero <br
/> 1899 zip
<caption><center>Vocab List</center></caption>

</td></tr></table>


After preprocessing the emails, we have a list of words (e.g., Vocab List) for each email. The next step is to choose which words we would like to use in our classifier and which we would want to leave out.
For this exercise, we have chosen only the most frequently occuring words in the email as our set of words to be considered (the vocabulary list). Since words that occur rarely in the training set are only in a few emails, they might cause the model to overfit our training set. The complete vocabulary list is in the file **vocab.txt** but the Vocab List above shows you what it looks like. Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about 10,000 to 50,000 words is often used.

<table border = "0" width = "160" ><tr><td> 

86 916 794 1077 883 370 1699 790 1822 1831 883 431 1171 794 1002 1893 1364 592 1676 238 162 89 688 945 1663 1120 1062 1699 375 1162 479 1893 1510 799 1182 1237 810 1895 1440 1547 181 1699 1758 1896 688 1676 992 961 1477 71 530 1699 531 <br 
/> 

<caption><center>Word Indices for Sample Email</center></caption>

</td></tr></table>
Given the vocabulary list, we can now map each word in the Processed emails (similar to the one found above) into a list of word indices that contains the index of the word in the vocabulary list. *Word Indices for Sample Email* shows the mapping for the sample email. Specifically, in the sample email, the word “anyone” was first normalized to “anyon” and then mapped onto the index 86 in the vocabulary list.
Your task now is to complete the code in *processEmail* to perform this mapping. 

In the code below, you are given a string **str** which is a single word from the processed email. You should look up the word in the vocabulary list, **vocabList**, and find if the word exists in the vocabulary list. If the word exists, you should add the index of the word into the word indices variable. If the word does not exist, and is therefore not in the vocabulary, you can skip the word.
Once you have implemented processEmail, then you will run your code on the email sample and you should get a vocabulary list along with the indices. 

#### Implementation

To use an SVM to classify emails into Spam v.s. Non-Spam, you first need to convert each email into a vector of features. In this part, you will implement the preprocessing steps for each email. You should complete the code in the function below to produce a word indices vector for a given email. 

In Octave/MATLAB, you can compare two strings with the strcmp function. For example, strcmp(str1, str2) will return 1 only when both strings are equal. In the provided starter code, vocabList is a “cell-array” containing the words in the vocabulary. In Octave/MATLAB, a cell-array is just like a normal array (i.e., a vector), except that its elements can also be strings (which they can’t in a normal Octave/MATLAB matrix/vector), and you index into them using curly braces instead of square brackets. Specifically, to get the word at index i, you can use vocabList{i}. You can also use length(vocabList) to get the number of words in the vocabulary.


word_indices = **processEmail**(email_contents) preprocesses the body of an email and returns a list of indices of the     words contained in the email.  

Fill in this function (in your code here) to add the index of str to word_indices if it is in the vocabulary. At this point of the code, you have a stemmed word from the email in the variable str. You should look up str in the vocabulary list (vocabList). If a match exists, you should add the index of the word to the word_indices vector. Concretely, if str = 'action', then you should look up the vocabulary list to find where in vocabList 'action' appears. For example, if vocabList{18} = 'action', then, you should add 18 to the word_indices  vector (e.g., word_indices = [word_indices ; 18]; ). **vocabList{idx}** returns a the word with index idx in the
vocabulary list. You can use strcmp(str1, str2) to compare two strings (str1 and
str2). It will return 1 only if the two strings are equivalent.
   




In [33]:
% GRADED FUNCTION: processEmail
function word_indices = processEmail(email_contents)
%PROCESSEMAIL preprocesses a the body of an email and
%returns a list of word_indices 
%   word_indices = PROCESSEMAIL(email_contents) preprocesses 
%   the body of an email and returns a list of indices of the 
%   words contained in the email. 
%

% Load Vocabulary
vocabList = getVocabList();

% Init return value
word_indices = [];

% ========================== Preprocess Email ===========================

% Find the Headers ( \n\n and remove )
% Uncomment the following lines if you are working with raw emails with the
% full headers

% hdrstart = strfind(email_contents, ([char(10) char(10)]));
% email_contents = email_contents(hdrstart(1):end);

% Lower case
email_contents = lower(email_contents);

% Strip all HTML
% Looks for any expression that starts with < and ends with > and replace
% and does not have any < or > in the tag it with a space
email_contents = regexprep(email_contents, '<[^<>]+>', ' ');

% Handle Numbers
% Look for one or more characters between 0-9
email_contents = regexprep(email_contents, '[0-9]+', 'number');

% Handle URLS
% Look for strings starting with http:// or https://
email_contents = regexprep(email_contents, ...
                           '(http|https)://[^\s]*', 'httpaddr');

% Handle Email Addresses
% Look for strings with @ in the middle
email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');

% Handle $ sign
email_contents = regexprep(email_contents, '[$]+', 'dollar');


% ========================== Tokenize Email ===========================

% Output the email to screen as well
fprintf('\n==== Processed Email ====\n\n');

% Process file
l = 0;

while ~isempty(email_contents)

    % Tokenize and also get rid of any punctuation
    [str, email_contents] = ...
       strtok(email_contents, ...
              [' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);
   
    % Remove any non alphanumeric characters
    str = regexprep(str, '[^a-zA-Z0-9]', '');

    % Stem the word 
    % (the porterStemmer sometimes has issues, so we use a try catch block)
    try str = porterStemmer(strtrim(str)); 
    catch str = ''; continue;
    end;

    % Skip the word if it is too short
    if length(str) < 1
       continue;
    end

    % Look up the word in the dictionary and add to word_indices if
    % found
    % ====================== YOUR CODE HERE ======================
    match = strcmp(str, vocabList);
    if sum(match) > 0
        word_indices(end+1,1) = find(match);
    end
%     for i = 1:size(vocabList,1),
%       if strcmp(vocabList{i},str) == 1,
%         word_indices = [word_indices; i];
%       end
%     end 



    % =============================================================


    % Print to screen, ensuring that the output lines are not too long
    if (l + length(str) + 1) > 78
        fprintf('\n');
        l = 0;
    end
    fprintf('%s ', str);
    l = l + length(str) + 1;

end

% Print footer
fprintf('\n\n=========================\n');

end

In [34]:
warning('off'); addpath('../../readonly/Assignment6b/');
file_contents = readFile('emailSample1.txt');     % Extract Features
word_indices  = processEmail(file_contents);
word_indices % == [86 916 794 1077 883 370 1699 790 1822 1831 883 431 1171 794 1002 1893 1364 592 1676 238 162 89 688 945 1663 1120 1062 1699 375 1162 479 1893 1510 799 1182 1237 810 1895 1440 1547 181 1699 1758 1896 688 1676 992 961 1477 71 530 1699 531]


==== Processed Email ====

anyon know how much it cost to host a web portal well it depend on how mani 
visitor you re expect thi can be anywher from less than number buck a month 
to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb 
if your run someth big to unsubscrib yourself from thi mail list send an 
email to emailaddr 

word_indices =

     86
    916
    794
   1077
    883
    370
   1699
    790
   1822
   1831
    883
    431
   1171
    794
   1002
   1893
   1364
    592
   1676
    238
    162
     89
    688
    945
   1663
   1120
   1062
   1699
    375
   1162
    479
   1893
   1510
    799
   1182
   1237
    810
   1895
   1440
   1547
    181
   1699
   1758
   1896
    688
   1676
    992
    961
   1477
     71
    530
   1699
    531



**Expected Output**:

*Word Indices for Sample Email* found in tables above.

Extracting Features from Emails
-------------------------------

You will now implement the feature extraction that converts each email
into a vector in $\mathbb{R}^n$. For this exercise, you will be using
$n=$ \# words in vocabulary list. Specifically, the feature
$x_i \in \{0,1\}$ for an email corresponds to whether the $i$-th word in
the dictionary occurs in the email. That is, $x_i=1$ if the $i$-th word
is in the email and $x_i=0$ if the $i$-th word is not present in the
email.

Thus, for a typical email, this feature would look like:

$$x = \begin{bmatrix} 
0 \\
\vdots \\ 
1 \\
0 \\
\vdots \\ 
1  \\
0 \\ 
\vdots \\ 
0 
\end{bmatrix} \in \mathbb{R}^n.$$

You should now complete the code in **emailFeatures** to generate a
feature vector for an email, given the *word_indices*. Once you have implemented **emailFeatures**, we will run your code on the email sample. 

**Implementation**: 

x = emailFeatures(word_indices) takes in a word_indices vector and produces a feature vector from the word indices. 

Fill in the function to return a feature vector for the given email (word_indices). To help make it easier to  process the emails, we have have already pre-processed each email and converted each word in the email into an index in a fixed dictionary (of 1899 words). The variable word_indices contains the list of indices of the words which occur in one email.

For example, if an email has the text:

 - The quick brown fox jumped over the lazy dog.

Then, the word_indices vector for this text might look like:
              
 - 60  100   33   44   10     53  60  58   5

where, we have mapped each word onto a number, for example:

- the   -- 60
- quick -- 100
- ...

(note: the above numbers are just an example and are not the actual mappings).

Your task is take one such word_indices vector and construct a binary feature vector that indicates whether a particular word occurs in the email. That is, x(i) = 1 when word i is present in the email. Concretely, if the word 'the' (say, index 60) appears in the email, then x(60) = 1. The feature vector should look like:

- x = [ 0 0 0 0 1 0 0 0 ... 0 0 0 0 1 ... 0 0 0 1 0 ..];






In [54]:
% GRADED FUNCTION: emailFeatures
function x = emailFeatures(word_indices)

n = 1899;                   % Total number of words in the dictionary

x = zeros(n, 1);            % You need to return the following variable correctly.

% ====================== YOUR CODE HERE ======================
% for w = 1:length(word_indices)
%     word = word_indices(w);
%     x(word,1) = x(word,1) + 1;
% end
x(word_indices) = 1;

% =============================================================
end

In [55]:
file_contents = readFile('emailSample1.txt');      % Extract Features
word_indices  = processEmail(file_contents);
features      = emailFeatures(word_indices);

% Print Stats
fprintf('Length of feature vector: %d\n', length(features));
fprintf('Number of non-zero entries: %d\n', sum(features > 0));



==== Processed Email ====

anyon know how much it cost to host a web portal well it depend on how mani 
visitor you re expect thi can be anywher from less than number buck a month 
to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb 
if your run someth big to unsubscrib yourself from thi mail list send an 
email to emailaddr 

Length of feature vector: 1899
Number of non-zero entries: 45


**Expected Output**:

Length of feature vector: 1899

Number of non-zero entries: 45

Training SVM for Spam Classification
------------------------------------

After you have completed the feature extraction functions, the next step
of will load a preprocessed training dataset that
will be used to train a SVM classifier. **spamTrain.mat** contains
4000 training examples of spam and non-spam email, while
**spamTest.mat** contains 1000 test examples. Each original email was
processed using the **processEmail** and **emailFeatures**
functions and converted into a vector $x^{(i)} \in \mathbb{R}^{1899}$.

After loading the dataset, we will proceed to train a
SVM to classify between spam ($y=1$) and non-spam ($y=0$) emails. This might take 1 to 2 minutes. 

In [47]:
load('spamTrain.mat');     % Load the Spam Email dataset to get X and y in you rdataset

C = 0.1;                   % Cost 
model = svmTrain(X, y, C, @linearKernel);

p = svmPredict(model, X);

Accuracy = mean(double(p == y)) * 100


Training ......................................................................
...............................................................................
...............................................................................
... Done! 

Accuracy =  99.825


**Expected Output**: 

Accuracy $\approx$  99.8 %

Now that we have trained our classifier, we will proceed to evaluate it on a test set. This might take 1 to 2 minutes. 

In [48]:
load('spamTest.mat');          % You will have Xtest, ytest in your environment

p = svmPredict(model, Xtest);

Accuracy =  mean(double(p == ytest)) * 100


Accuracy =  98.900


**Expected Output**

Accuracy $\approx$  98%


Top Predictors for Spam
-----------------------

<table border = "0" width = "550" ><tr><td> 
<center>
 our click  remov  guarante visit  basenumb dollar will price pleas most lo nbsp ga da </center> <br /> 
<caption><center>Top Predictors for Spam Email</center></caption>
</td></tr></table>

To better understand how the spam classifier works, we can inspect the
parameters to see which words the classifier thinks are the most
predictive of spam. The next step below finds the
parameters with the largest positive values in the classifier and
displays the corresponding words in the box above. Thus, if an
email contains words such as “guarantee”, “remove”, “dollar”, and
“price” (the top predictors above), it is
likely to be classified as spam.

**Implementation** 

Since the model we are training is a linear SVM, we can inspect the 
weights learned by the model to understand better how it is determining 
whether an email is spam or not. The following code finds the words with 
the highest weights in the classifier. Informally, the classifier 
'thinks' that these words are the most likely indicators of spam.




In [49]:
[weight, idx] = sort(model.w, 'descend');           % Sort the weights and obtin the vocabulary list
vocabList = getVocabList();

fprintf('\nTop predictors of spam: \n');
for i = 1:15
    fprintf(' %-15s (%f) \n', vocabList{idx(i)}, weight(i));
end


Top predictors of spam: 
 our             (0.503380) 
 click           (0.467845) 
 remov           (0.416106) 
 guarante        (0.384495) 
 visit           (0.367830) 
 basenumb        (0.349906) 
 dollar          (0.322995) 
 will            (0.271084) 
 price           (0.267594) 
 nbsp            (0.260645) 
 pleas           (0.259834) 
 most            (0.259631) 
 lo              (0.256661) 
 ga              (0.247400) 
 hour            (0.242709) 


Optional (ungraded) exercise: Try your own emails
-------------------------------------------------

Now that you have trained a spam classifier, you can start trying it out
on your own emails. In the starter code, we have included two email
examples (*emailSample1.txt* and *emailSample2.txt*) and two
spam examples (*spamSample1.txt* and *spamSample2.txt*). The
last part below runs the spam classifier over the first
spam example and classifies it using the learned SVM. You should now try
the other examples we have provided and see if the classifier gets them
right. You can also try your own emails by replacing the examples (plain
text files) with your own emails.

To add them, click File ==> Open , and upload them. 


In the starter code, we have included *spamSample1.txt*, *spamSample2.txt*, *emailSample1.txt* and *emailSample2.txt* as examples. The following code reads in one of these emails and then uses your learned SVM classifier to determine whether the email is Spam or Not Spam.

In [50]:
% Set the file to be read in (change this to spamSample2.txt,
% emailSample1.txt or emailSample2.txt to see different predictions on
% different emails types). Try your own emails as well!
filename = 'spamSample1.txt';

% Read and predict
file_contents = readFile(filename);
word_indices  = processEmail(file_contents);
x             = emailFeatures(word_indices);
p = svmPredict(model, x);

fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p);
fprintf('(1 indicates spam, 0 indicates not spam)\n\n');


==== Processed Email ====

do you want to make dollarnumb or more per week if you ar a motiv and qualifi 
individu i will person demonstr to you a system that will make you dollarnumb 
number per week or more thi is not mlm call our number hour pre record number 
to get the detail number number number i need peopl who want to make seriou 
monei make the call and get the fact invest number minut in yourself now 
number number number look forward to your call and i will introduc you to 
peopl like yourself who ar current make dollarnumb number plu per week number 
number number numberljgvnumb numberleannumberlrmsnumb 
numberwxhonumberqiytnumb numberrjuvnumberhqcfnumb numbereidbnumberdmtvlnumb 


Processed spamSample1.txt

Spam Classification: 1
(1 indicates spam, 0 indicates not spam)



Optional (ungraded) exercise: Build your own dataset
----------------------------------------------------

In this exercise, we provided a preprocessed training set and test set.
These datasets were created using the same functions
*processEmail* and *emailFeatures* that you now have
completed. For this optional (ungraded) exercise, you will build your
own dataset using the original emails from the [SpamAssassin Public
Corpus].

Your task in this optional (ungraded) exercise is to download the
original files from the public corpus and extract them. After extracting
them, you should run the processEmail and
emailFeatures functions on each email to extract a feature vector
from each email. This will allow you to build a dataset $X$, $y$ of
examples. You should then randomly divide up the dataset into a training
set, a cross validation set and a test set.

While you are building your own dataset, we also encourage you to try
building your own vocabulary list (by selecting the high frequency words
that occur in the dataset) and adding any additional features that you
think might be useful.

**Note:**  The original emails will have email headers that you might wish to leave out. We have included code in *processEmail* that will help you remove these headers.

In [None]:
% Your code below - Optional
