In [1]:
%autosave 0
from IPython.core.display import HTML, display
display(HTML('<style>.container { width:100%; } </style>'))

Autosave disabled


# Spam Detection  Using the Naive Bayes Algorithm

The process of creating a spam detector using the naive Bayes algorithm is split up into four steps.

  - Create a set of the most common words occurring in spam and ham (i.e. non-spam) emails.
  - For every word occurring in this set, compute the conditional probability that this words occurs in a spam or ham email.
  - Create a function that takes an email and the conditional probabilities computed before and that then computes the probability
    that the given email is spam.
  - Evaluate the <em style='color:blue;'>precision</em> and the <em style='color:blue;'>recall</em> of the spam classifier.

## Step 1: Create Word Dictionary

We need the module `os` for reading directories and the module `re` for 
<em style='color:blue;'>regular expressions</em>.

In [2]:
import os
import re
import numpy as np
import math

An object of class <a href='https://docs.python.org/2/library/collections.html#counter-objects'>`Counter`</a> is a special form of a `dictionary` that is used for counting.  We need a counter to figure out what the most common words are.

In [3]:
from collections import Counter

The directory 
https://github.com/karlstroetmann/Artificial-Intelligence/tree/master/Python/EmailData
contains 960 emails that are divided into four subdirectories:

  - `spam-train` contains 350 spam emails for training,
  - `ham-train`  contains 350 non-spam emails for training,
  - `spam-test`  contains 130 spam emails for testing,
  - `ham-test`   contains 130 non-spam emails for testing.

Originally, this data has been collected by **Ion Androutsopoulos**.  I have found this data on the page 
http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html provided by Andrew Ng.

We declare some variables so this notebook can be adapted to other data sets.

In [4]:
spam_dir_train = 'EmailData/spam-train/'
ham__dir_train = 'EmailData/ham-train/'
spam_dir_test  = 'EmailData/spam-test/'
ham__dir_test  = 'EmailData/ham-test/'
Directories    = [spam_dir_train, ham__dir_train, spam_dir_test, ham__dir_test]

In order to compute the <em style='color:blue;'>prior probability</em> that an email is ham or spam we need to count the number of spam and ham emails.

In [5]:
no_spam    = len(os.listdir(spam_dir_train))
no_ham     = len(os.listdir(ham__dir_train))
spam_prior = no_spam / (no_spam + no_ham)
ham__prior = no_ham  / (no_spam + no_ham)
spam_prior, ham__prior

(0.5, 0.5)

I have checked that the proportion of spam and ham emails in the test directory is also $1:1$.  If the proportion of spam and ham emails in life is different from $1:1$, then we would have to use this proportion in the spam filter to be developed.

The function $\texttt{get_words}(\texttt{fn})$ takes a filename $\texttt{fn}$ as its argument.  It reads the file and returns a set of all words that are found in this file.  The words are transformed to lower case.

In [6]:
def get_words(fn):
    file = open(fn)
    text = file.read()
    text = text.lower()
    return set(re.findall(r"[\w']+", text))

Let us test this function with a small example mail.

In [7]:
get_words('EmailData/ham-train/3-380msg4.txt')

{'anyone',
 'article',
 'berkeley',
 'book',
 'consonant',
 'edu',
 'english',
 'garnet',
 'hard',
 'helpful',
 'hi',
 'interest',
 'irish',
 'laurel',
 'm',
 'modern',
 'palatal',
 'phonetics',
 'posting',
 'project',
 'recommend',
 'slender',
 'source',
 'specifically',
 'sutton',
 'thank',
 'too',
 'work'}

The function `read_all_files` reads all files contained in those directories that are stored in the list `Directories`. 
It returns a `Counter`.  For every word $w$ this counter contains the number of files that contain $w$. 

In [8]:
def read_all_files():
    Words = Counter()
    for directory in Directories:
        for file_name in os.listdir(directory):
            Words.update(get_words(directory + file_name))
    return Words

`Common_Words` is a list of the 2500 most common words found in all of our emails.

In [9]:
N            = 2500             # number of the most common words to use
Word_Counter = read_all_files()
Word_Counter

Counter({'eminent': 9,
         'earn': 69,
         'experience': 123,
         'through': 155,
         'phd': 22,
         'prestige': 9,
         'increase': 69,
         'grant': 23,
         'effort': 75,
         'mba': 8,
         'choice': 51,
         'here': 259,
         'short': 86,
         'field': 117,
         'part': 131,
         'personal': 102,
         'programs': 21,
         'base': 134,
         'ba': 13,
         'phone': 202,
         'power': 52,
         'necessary': 55,
         'degree': 41,
         'further': 154,
         'detail': 143,
         'call': 347,
         'advance': 81,
         'require': 131,
         'nonaccredit': 8,
         'award': 20,
         'present': 142,
         'knowledge': 72,
         'money': 187,
         'university': 307,
         'diploma': 10,
         'ma': 37,
         'cost': 147,
         'entire': 45,
         'conference': 138,
         'grab': 9,
         'week': 173,
         'receive': 283,
         'start': 

In [10]:
Common_Words = { w for w, _ in Word_Counter.most_common(N) }
Common_Words

{'load',
 'comprise',
 'familiar',
 'teen',
 'massive',
 'gamble',
 'none',
 'implementation',
 'majority',
 'cgibin',
 'rejection',
 'smaller',
 'launch',
 'lee',
 'database',
 'food',
 'window',
 'transfer',
 'candidate',
 'delay',
 'frank',
 'multus',
 'late',
 'engage',
 'work',
 'government',
 'newest',
 'call',
 'vulgarity',
 'material',
 'organisation',
 'affect',
 'hundreds',
 'propose',
 'john',
 'campus',
 'competition',
 'view',
 'penny',
 'currency',
 'gender',
 'class',
 'santa',
 'andrew',
 'bonus',
 'refinance',
 'organization',
 'eric',
 'site',
 'ongo',
 'italy',
 'sprachwissenschaft',
 'operator',
 'little',
 'appear',
 'ms',
 'actually',
 'perform',
 'monthly',
 'opposite',
 'latex',
 'job',
 'forum',
 'correct',
 'install',
 'miss',
 'local',
 'remain',
 'chain',
 'music',
 'ready',
 'hundr',
 'dinner',
 'bill',
 'singapore',
 'option',
 'multimedium',
 'dialect',
 'translation',
 'most',
 'different',
 'literature',
 'unite',
 'sit',
 'sun',
 'desirous',
 'bear',
 

## Computing the Conditional Probabilities

Having computed the most common words, we are now ready to compute the conditional probability that a given word occurs in a spam email.

The function $\texttt{get_common_words}(\texttt{fn})$ takes a filename $\texttt{fn}$ 
as its argument.  It reads the file and returns the set of all words in `Common_Words` that are found in the given file.  

In [11]:
def get_common_words(fn):
    return get_words(fn) & Common_Words

We test this function for a small email.

In [12]:
get_common_words('EmailData/ham-train/3-380msg4.txt')

{'anyone',
 'article',
 'berkeley',
 'book',
 'consonant',
 'edu',
 'english',
 'hard',
 'helpful',
 'hi',
 'interest',
 'm',
 'modern',
 'phonetics',
 'project',
 'recommend',
 'source',
 'specifically',
 'thank',
 'too',
 'work'}

The function `count_common_words` takes a string specifying a `directory`.  It returns a 
`Counter` that counts how often the words in `Common_Words` occur in any of the files in `directory`.

In [13]:
def count_commmon_words(directory):
    Words = Counter()
    for file_name in os.listdir(directory):
        Words.update(get_common_words(directory + file_name))
    return Words

Next, we compute dictionaries that store the number of occurrences in emails for every common word.

In [14]:
Spam_Counter = count_commmon_words(spam_dir_train)
Spam_Counter

Counter({'earn': 51,
         'experience': 63,
         'through': 75,
         'phd': 6,
         'increase': 39,
         'grant': 12,
         'effort': 42,
         'choice': 23,
         'here': 146,
         'short': 38,
         'field': 33,
         'part': 50,
         'personal': 67,
         'programs': 15,
         'base': 42,
         'ba': 8,
         'phone': 93,
         'power': 30,
         'necessary': 25,
         'degree': 9,
         'further': 51,
         'detail': 55,
         'call': 132,
         'advance': 20,
         'require': 64,
         'award': 13,
         'present': 27,
         'knowledge': 30,
         'money': 140,
         'university': 15,
         'diploma': 7,
         'ma': 13,
         'cost': 99,
         'entire': 29,
         'conference': 6,
         'week': 104,
         'receive': 157,
         'start': 106,
         'our': 223,
         'delete': 39,
         'po': 27,
         'old': 40,
         'mailer': 15,
         'financial':

In [15]:
Ham__Counter = count_commmon_words(ham__dir_train)
Ham__Counter

Counter({'range': 29,
         'comprise': 4,
         'through': 33,
         'future': 20,
         'lab': 9,
         'practice': 11,
         'coordinate': 7,
         'language': 241,
         'international': 76,
         'research': 116,
         'promise': 5,
         'area': 72,
         'broad': 10,
         'www': 116,
         'fund': 12,
         'identify': 30,
         'pari': 15,
         'canada': 28,
         'work': 99,
         'sunday': 9,
         'call': 119,
         'umontreal': 7,
         'follow': 130,
         'assess': 9,
         'therefore': 16,
         'syntax': 65,
         'israel': 8,
         'modify': 8,
         'present': 79,
         'ca': 38,
         'outside': 10,
         'tag': 7,
         'view': 33,
         'usa': 53,
         'current': 32,
         'state': 57,
         'researcher': 40,
         'face': 13,
         'together': 43,
         'programme': 35,
         'morphology': 36,
         'provide': 86,
         'html': 56,
     

 For every common word $w$  we compute the probability that $w$ occurs in a spam or ham email.  The formula for spam is:
 $$ P(w \in\texttt{Spam}) = \frac{\mbox{number of spam emails containing $w$}}{\mbox{number of all spam emails}} $$
 The formula for ham is similar:
 $$ P(w \in\texttt{Ham}) = \frac{\mbox{number of ham emails containing $w$}}{\mbox{number of all ham emails}} $$
 However, if we would use this formular, than a common word $w$ that, for some reason, hasn't yet occurred in any spam email, would have a 
 probability of $0$ of occurring in spam email.  Hence, our classifier would never classify an email with the word $w$ as spam.  As this
 cannot be right, we assume that there is one further spam email that contains every common word.  This 
 <em style='color:blue;'>Laplace smoothing</em> assumption changes the formula for $P(w \in\texttt{Spam})$ as follows:
 $$ P(w \in\texttt{Spam}) = \frac{\mbox{number of spam emails containing $w$ + 1}}{\mbox{number of all spam emails + 1}} $$

In [16]:
Spam_Probability = {}
Ham__Probability = {}
for w in Common_Words:
    Spam_Probability[w] = (Spam_Counter[w] + 1) / (no_spam + 1) 
    Ham__Probability[w] = (Ham__Counter[w] + 1) / (no_ham  + 1) 
Spam_Probability

{'load': 0.037037037037037035,
 'comprise': 0.011396011396011397,
 'familiar': 0.022792022792022793,
 'teen': 0.045584045584045586,
 'massive': 0.02564102564102564,
 'gamble': 0.04843304843304843,
 'none': 0.019943019943019943,
 'implementation': 0.002849002849002849,
 'majority': 0.05128205128205128,
 'cgibin': 0.02564102564102564,
 'rejection': 0.005698005698005698,
 'smaller': 0.002849002849002849,
 'launch': 0.03418803418803419,
 'lee': 0.011396011396011397,
 'database': 0.07977207977207977,
 'food': 0.022792022792022793,
 'window': 0.09686609686609686,
 'transfer': 0.02564102564102564,
 'candidate': 0.017094017094017096,
 'delay': 0.02849002849002849,
 'frank': 0.03418803418803419,
 'multus': 0.05128205128205128,
 'late': 0.039886039886039885,
 'engage': 0.011396011396011397,
 'work': 0.35327635327635326,
 'government': 0.039886039886039885,
 'newest': 0.03133903133903134,
 'call': 0.3789173789173789,
 'vulgarity': 0.022792022792022793,
 'material': 0.07122507122507123,
 'organisa

According to our computation, the probabilty that a spam email contains the word `'consonant'` is about $0.28\%$, while the probability that this word occurs in a ham email is $2.55\%$.

In [17]:
Spam_Probability['consonant'], Ham__Probability['consonant']

(0.002849002849002849, 0.02564102564102564)

For the word `'dollar'` the probabilty that a spam email contains this word is about $21.1\%$, while the probability that this word occurs in a ham email is $1.99\%$.

In [18]:
Spam_Probability['dollar'], Ham__Probability['dollar']

(0.21082621082621084, 0.019943019943019943)

## Deciding whether an Email is Spam

Given a file name `fn`, this function returns the probability that the message contained in the given file is spam.  

When implementing the formula 
$$\arg\max\limits_{C \in \mathcal{C}}  \left(\prod\limits_{i=1}^m P(f_i \;|\; C)\right) \cdot P(C) $$
we have to be careful, because a naive implementation will eveluate the product
$$\prod\limits_{i=1}^m P(f_i \;|\; C)$$
as the number $0$ due to numerical underflow.  The trick to compute this product is to remember that
$$ \ln(a \cdot b) = \ln(a) + \ln(b) $$
and therefore transform the product into a sum of logarithms:
$$ \prod\limits_{i=1}^m P(f_i \;|\; C) = \exp\left(\alpha + \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; C)\bigr) \right) \cdot \exp(-\alpha)$$
Here, the constant $\alpha$ has to be chosen such that the application of the function `exp` to the value
$$ \alpha + \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; C)\bigr) $$
does not lead to an underflow error.

As we want to compute a probability, we have to be aware that the term
$$ \left(\prod\limits_{i=1}^m P(f_i \;|\; C)\right) \cdot P(C) $$
is not the probability that the object is of class $C$ but rather is only *proportional* to this probability.  The fact that the probability
of an email being spam + the probability that the email is ham must be $1$ enables us to compute the probability.

In [19]:
def spam_probability(fn):
    log_p_spam = 0.0
    log_p_ham  = 0.0
    words = get_common_words(fn)
    for w in Common_Words:
        if w in words:
            log_p_spam += math.log(Spam_Probability[w])
            log_p_ham  += math.log(Ham__Probability[w])
        else:
            log_p_spam += math.log(1.0 - Spam_Probability[w])
            log_p_ham  += math.log(1.0 - Ham__Probability[w])
    alpha  = abs(max(log_p_spam, log_p_ham))
    p_spam = math.exp(log_p_spam + alpha) * spam_prior
    p_ham  = math.exp(log_p_ham  + alpha) * ham__prior
    return p_spam / (p_spam + p_ham)

Let us test this with a ham email.

In [20]:
spam_probability('EmailData/ham-train/3-430msg1.txt')

6.289803980920058e-29

Ok, we got this one right.  Let us check the general performance.

## Evaluate Precision and Recall

In order to evalate the performance of this algorithm, we need to define two new concepts: <em style='color:blue;'>precision</em> and 
<em style='color:blue;'>recall</em>.  Let us call the ham emails the <em style='color:blue;'>positives</em>, while the spam emails are called the
<em style='color:blue;'>negatives</em>.  Then we define

  - <em style='color:blue;'>true positives</em>: ham emails that are classified as ham,
  - <em style='color:blue;'>false positives</em>: spam emails that are classified as ham,
  - <em style='color:blue;'>true negatives</em>: spam emails that are classified as spam,
  - <em style='color:blue;'>false negatives</em>: ham emails that are classified as spam.
  
The <em style='color:blue;'>precision</em> of the spam classifier is then defined as
$$ \texttt{precision} = \frac{\mbox{number of true positives}}{\mbox{number of true positives} + \mbox{number of false positives}} $$
Therefore, the **precision** measures the percentage of the ham emails in the set of all emails that are classified as ham.
The <em style='color:blue;'>recall</em> of the spam classifier is defined as
$$ \texttt{recall} = \frac{\mbox{number of true positives}}{\mbox{number of true positives} + \mbox{number of false negatives}} $$
Therefore, the **recall** measures the percentage of those ham emails that are indeed classified as ham.  

Usually, it is very important that the recall is high as we don't want to loose a ham email because our classifier has incorrectly classified it as a spam email.  
On the other hand, having a high precision is not that important.  After all, if $10\%$ of the emails offered to us as ham are, in fact, spam, we might tolerate this.  However, we would certainly not tolerate loosing $10\%$ of our ham emails because they are incorrectly specified as spam.

The function `precission_recall` takes two directories as arguments: `spam_dir` is supposed to contain spam emails, while `ham_dir` contains ham emails.  It computes the **precision** and the **recall** of our spam classifier with respect to these test data.

In [21]:
def precission_recall(spam_dir, ham_dir):
    TN = 0 # true negatives
    FP = 0 # false positives
    for email in os.listdir(spam_dir):
        if spam_probability(spam_dir + email) > 0.5:
            TN += 1
        else:
            FP += 1
    FN = 0 # false negatives
    TP = 0 # true positives
    for email in os.listdir(ham_dir):
        if spam_probability(ham_dir + email) > 0.5:
            FN += 1
        else:
            TP += 1
    precision = TP / (TP + FP)
    recall    = TP / (TP + FN)
    accuracy  = (TN + TP) / (TN + TP + FN + FP)
    return precision, recall, accuracy

In [22]:
precission_recall(spam_dir_train, ham__dir_train)

(0.8495145631067961, 1.0, 0.9114285714285715)

In [23]:
precission_recall(spam_dir_test, ham__dir_test)

(0.7791411042944786, 0.9769230769230769, 0.85)