In [None]:
from IPython.display import HTML
HTML(open("../style.css", "r").read())

If `nb-mypy` isn't installed, run:
```
!pip install nb-mypy 
```

In [None]:
%load_ext nb_mypy
%nb_mypy On

# Spam Detection  Using the Naive Bayes Algorithm

The process of creating a spam detector using the naive Bayes algorithm is split up into four steps.

  - Create a set of the most common words occurring in spam and ham (i.e. non-spam) emails.
  - For every word occurring in this set, compute the 
  probability that this word occurs in a spam or ham email.
  - Create a function that takes an email and the conditional probabilities computed before and that then computes the probability
    that the given email is spam.
  - Evaluate the <em style='color:blue;'>precision</em> and the <em style='color:blue;'>recall</em> of the spam classifier.

## Step 1: Create Word Dictionary

We need the module `os` for reading directories and the module `re` for 
<em style='color:blue;'>regular expressions</em>.

In [None]:
import os
import re
import math

An object of class <a href='https://docs.python.org/2/library/collections.html#counter-objects'>`Counter`</a> is a special form of a `dictionary` that is used for counting.  We need a counter to figure out what the most common words are.

In [None]:
from collections import Counter

The constructor `Counter` can be called with an iterable (e.g. a list, a set, or a string) as its argument.
It returns a dictionary where the values are the number of occurrences.  The example below counts characters in a string.

In [None]:
Cntr = Counter("abracadabra")
Cntr

An important method of the class `Counter` is `update`.  If `Ctr` is a counter and `S` is an iterable, then the elements of `S` are added to `Ctr`. If an element `e` occurs `k` times in `S`, then 
the count of `e` in `Ctr` is incremented by `k`.

In [None]:
Cntr.update("abba")
Cntr

### The Email Data

The directory 
https://github.com/karlstroetmann/Artificial-Intelligence/tree/master/Python/6%20Classification/EmailData
contains 960 emails that are divided into four subdirectories:

  - `spam-train` contains 350 spam emails for training,
  - `ham-train`  contains 350 ham (i.e. non-spam) emails for training,
  - `spam-test`  contains 130 spam emails for testing,
  - `ham-test`   contains 130 ham  emails for testing.

Originally, this data has been collected by **Ion Androutsopoulos**.  I have found this data on a now defunct 
*open classroom* page on https://online.stanford.edu/free-courses provided by Andrew Ng.

We declare some variables so this notebook can be adapted to other data sets.

In [None]:
spam_dir_train: str = 'EmailData/spam-train/'
ham__dir_train: str = 'EmailData/ham-train/'
spam_dir_test:  str = 'EmailData/spam-test/'
ham__dir_test:  str = 'EmailData/ham-test/'
Directories: list[str] = [spam_dir_train, ham__dir_train, spam_dir_test, ham__dir_test]

In order to compute the <em style='color:blue;'>prior probability</em> that an email is ham or spam we need to count the number of spam and ham emails.

In [None]:
no_spam:    int   = len(os.listdir(spam_dir_train)) # number of spam mails
no_ham:     int   = len(os.listdir(ham__dir_train)) # number of ham  mails
spam_prior: float = no_spam / (no_spam + no_ham)   # probability of a spam mail
ham__prior: float = no_ham  / (no_spam + no_ham)   # probability of a ham mail
spam_prior, ham__prior

I have checked that the proportion of spam and ham emails in the test directory is also $1:1$.  If the proportion of spam and ham emails in real life is different from $1:1$, then we would have to use this proportion in the spam filter to be developed.

The function $\texttt{get_words}(\texttt{fn})$ takes a filename $\texttt{fn}$ as its argument.  It reads the file and returns a set of all words that are found in this file.  
The words are then transformed to lower case. Since we use a set, words occurring multiple times are only counted once.

In [None]:
def get_words(fn: str) -> set[str]:
    with open(fn, 'r') as file:
        text: str = file.read().lower()
        return set(re.findall(r"[\w']+", text))

Let us test this function with a small example mail.

In [None]:
cat EmailData/ham-train/3-380msg4.txt || or EmailData/ham-train/3-380msg4.txt

In [None]:
get_words('EmailData/ham-train/3-380msg4.txt')

The function `read_all_files` reads all files contained in those directories that are stored in the list `Directories`. 
It returns a `Counter`.  For every word $w$ this counter contains the number of files that contain the word $w$. 

In [None]:
def read_all_files(Directories: list[str]) -> Counter:
    Words: Counter = Counter()
    for directory in Directories:
        for file_name in os.listdir(directory):
            Words.update(get_words(directory + file_name))
    return Words

`Word_Counter` is a dictionary containing all words together with their counts.

In [None]:
Word_Counter: Counter = read_all_files(Directories)
Word_Counter

The email contains 22770 different words. 

In [None]:
len(Word_Counter)

We investigate how many words only occur in a single email.

It wouldn't make sense to use a word as a feature if it only occurs in a single mail.
Let's check how many such word exist.

In [None]:
SingleWords = { w for w in Word_Counter if Word_Counter[w] <= 1 }
SingleWords

In [None]:
len(SingleWords)

`Common_Words` is a list of those words that occur in at least 10 of our emails.
The number `min_occurrences` is a *hyper parameter* that should be validated via a validation set. 

In [None]:
min_occurrences = 10
Common_Words: set[str] = { w for w in Word_Counter if Word_Counter[w] >= min_occurrences }

There a 2653 words that occur at least `min_occurrences` times.

In [None]:
len(Common_Words)

## Computing the Conditional Probabilities

Having computed the most common words, we are now ready to compute the conditional probability that a given word occurs in a spam email.

The function $\texttt{get_common_words}(\texttt{fn})$ takes a filename $\texttt{fn}$ 
as its argument.  It reads the file and returns the set of all words in `Common_Words` that are found in the given file.  

In [None]:
def get_common_words(fn: str) -> set[str]:
    return get_words(fn) & Common_Words

We test this function for a small email.

In [None]:
get_common_words('EmailData/ham-train/3-380msg4.txt')

The function `count_common_words` takes a string specifying a `directory`.  It returns a 
`Counter` that counts how often the words in `Common_Words` occur in any of the files in `directory`.

In [None]:
def count_commmon_words(directory: str) -> Counter:
    Words: Counter = Counter()
    for file_name in os.listdir(directory):
        Words.update(get_common_words(directory + file_name))
    return Words

Next, we compute `Counter`s that store the number of occurrences in emails for every common word.

In [None]:
Spam_Counter: Counter = count_commmon_words(spam_dir_train)
Spam_Counter

In [None]:
Ham__Counter: Counter = count_commmon_words(ham__dir_train)
Ham__Counter


For every common word $w$  we compute the probability that $w$ occurs in a spam or ham email.  The formula for spam is:
 
 $$ P(w \in\texttt{Spam}) = \frac{\mbox{number of spam emails containing $w$}}{\mbox{number of all spam emails}} $$
 
 The formula for ham is similar:
 
 $$ P(w \in\texttt{Ham}) = \frac{\mbox{number of ham emails containing $w$}}{\mbox{number of all ham emails}} $$
 
 However, if we would use this formula, then a common word $w$ that, for some reason, hasn't yet occurred in any spam email, would have a probability of $0$ of occurring in spam email.  Hence, our classifier would never classify an email with the word $w$ as spam.  Since this cannot be right, we assume that there is one additional spam email that contains every common word.  
This approach is called
*Laplace smoothing* and it changes the formula for $P(w \in\texttt{Spam})$ as follows:
 
 $$ P(w \in\texttt{Spam}) = \frac{\mbox{number of spam emails containing $w$ + 1}}{\mbox{number of all spam emails + 1}} $$
 
Of course, the formula for $P(w \in\texttt{Ham})$ is changed in a similar way:

$$ P(w \in\texttt{Ham}) = \frac{\mbox{number of ham emails containing $w$ + 1}}{\mbox{number of all ham emails + 1}} $$

In [None]:
Spam_Probability: dict[str, float] = {}
Ham__Probability: dict[str, float] = {}
for w in Common_Words:
    Spam_Probability[w] = (Spam_Counter[w] + 1) / (no_spam + 1) 
    Ham__Probability[w] = (Ham__Counter[w] + 1) / (no_ham  + 1) 
Spam_Probability

Let us check those common words that have a probability of at least $10\%$ to occur in a spam mail, but that have a probability of less than $1\%$ to occur in a ham mail.

In [None]:
{ w for w in Common_Words if Spam_Probability[w] > 0.1 and Ham__Probability[w] < 0.01 }

For example, the probability that a spam email contains the word `'earn'` is about $15\%$, while the probability that it occurs in a ham email is $0.55\%$.

In [None]:
Spam_Probability['earn'], Ham__Probability['earn']

For the word `'dollar'` the probability that a spam email contains this word is about $21\%$, while the probability that this word occurs in a ham email is less than $2\%$.

In [None]:
Spam_Probability['dollar'], Ham__Probability['dollar']

Given a file name `fn`, the function `log_probabilities` returns a pair of numbers $(p_1, p_2)$ such that.  

In order to compute whether a mail is spam or ham we have to compute

$$\arg\max\limits_{C \in \{\text{Spam},\, \text{Spam}\}}  \left(\prod\limits_{i=1}^m P(f_i \;|\; C)\right) \cdot P(C) $$

Therefore, we have to check, whether

$$ \left(\prod\limits_{i=1}^m P(f_i \;|\; \texttt{Spam})\right) \cdot P(\texttt{Spam}) > \left(\prod\limits_{i=1}^m P(f_i \;|\; \texttt{Ham})\right) \cdot P(\texttt{Ham}) $$

holds. When implementing the formula 
$$\arg\max\limits_{C \in \mathcal{C}}  \left(\prod\limits_{i=1}^m P(f_i \;|\; C)\right) \cdot P(C) $$
we have to be careful, because a naive implementation will evaluate the product
$$\prod\limits_{i=1}^m P(f_i \;|\; C)$$
as the number $0$ due to numerical underflow.  The trick to compute this product is to remember that
$$ \ln(a \cdot b) = \ln(a) + \ln(b) $$
and to transform the product into a sum of logarithms.  As the logarithm is a monotone function, we have 

$$ \begin{array}{llcl}
  & \left(\prod\limits_{i=1}^m P(f_i \;|\; \texttt{Spam})\right) \cdot P(\texttt{Spam}) & > & \left(\prod\limits_{i=1}^m P(f_i \;|\; \texttt{Ham})\right) \cdot P(\texttt{Ham}) \\
  \Leftrightarrow \qquad &
  \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; \texttt{Spam})\bigr) + \ln\bigl(P(\texttt{Spam})\bigr) & > & \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; \texttt{Ham})\bigr) + \ln\bigl(P(\texttt{Ham}) \bigr)
  \end{array}
$$

The function `log_probabilities(fn)` takes a filename `fn` as its first argument and returns the pair 
$$ \left(\sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; \texttt{Spam})\bigr) + \ln\bigl(P(\texttt{Spam})\bigr),\quad
         \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; \texttt{Ham})\bigr) + \ln\bigl(P(\texttt{Ham}) \bigr) \right)$$
as its result.  It should be noted that these number are not really the logarithms of probabilites.  The reason is that
the formula for the probability of a class $C$ is
$$
\frac{\prod\limits_{i=1}^m P(f_i \;|\; C)}{P(f_1 \wedge \cdots \wedge f_m)} \cdot P(C)
$$
and we are not computing the denominator $P(f_1 \wedge \cdots \wedge f_m)$.  However, this denominator is the same for spam and ham and hence we don't need it when comparing the respective probabilities.  Therefore, what the function `log_probabilities`is really computing are pairs of *relative logarithmic probabilities*.

In [None]:
def log_probabilities(fn: str) -> tuple[float, float]:
    log_p_spam: float = math.log(spam_prior)
    log_p_ham:  float = math.log(ham__prior)
    words: set[str] = get_common_words(fn)
    for w in Common_Words:
        if w in words:
            log_p_spam += math.log(Spam_Probability[w])
            log_p_ham  += math.log(Ham__Probability[w])
    return (log_p_ham, log_p_spam)

Let us test this function with a ham email.  Due to privacy concerns, the emails have been reduced to lists of words, which have then been permuted.

In [None]:
!cat EmailData/ham-train/3-430msg1.txt || type EmailData/ham-train/3-430msg1.txt 

In [None]:
log_probabilities('EmailData/ham-train/3-430msg1.txt')

Clearly, the ham probability is bigger than the spam probability.  Next, we check the general performance of our approach.

## Evaluate Precision and Recall

In order to evaluate the performance of this algorithm, we need to define two new concepts: *precision* and *recall*.  Let us call the ham emails the *positives*, while the spam emails are called the *negatives*.  Then we define

- *true positives*: ham emails that are correctly classified as ham,
- *false positives*: spam emails that are falsely classified as ham,
- *true negatives*: spam emails that are correctly classified as spam,
- *false negatives*: ham emails that are falsely classified as spam.

The *precision* of the spam classifier is then defined as
$$ \texttt{precision} = \frac{\mbox{number of true positives}}{\mbox{number of true positives} + \mbox{number of false positives}} $$
Therefore, the *precision* measures the percentage of the ham emails in the set of all emails that are classified as ham. The *recall* of the spam classifier is defined as
$$ \texttt{recall} = \frac{\mbox{number of true positives}}{\mbox{number of true positives} + \mbox{number of false negatives}} $$
Therefore, the *recall* measures the percentage of those ham emails that are indeed classified as ham.

Usually, it is very important that the recall is high as we don't want to miss a ham email because our classifier has incorrectly classified it as a spam email.
On the other hand, having a high precision is not that important. After all, if $10\%$ of the emails offered to us as ham are, in fact, spam, we might tolerate this.  However, we would certainly not tolerate missing $10\%$ of our ham emails because they are incorrectly classified as spam.

The function `precission_recall` takes two directories as arguments: `spam_dir` is supposed to contain spam emails, while `ham_dir` contains ham emails.  It computes the **precision** and the **recall** of our spam classifier with respect to these test data.  

Since it would be quite bad when we misclassify a valid email as spam but we can tolerate an occasional spam email that gets through our filter, we add a third parameter $\theta$.  We will only classify an email as spam if
$$
                \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; \texttt{Spam})\bigr) + \ln\bigl(P(\texttt{Spam})\bigr) \; > \; 
\vartheta \cdot \sum\limits_{i=1}^m \ln\bigl(P(f_i \;|\; \texttt{Ham})\bigr) + \ln\bigl(P(\texttt{Ham}) \bigr)
$$
The third parameter $\vartheta$ is the logarithmic threshold. Later, we will use a logarithmic threshold of $0.8$.
To make things concrete, assume that
* The relative logarithmic probability $p_1$ of a mail being ham is $p_1 = -250$, while
* the relative logarithmic probability $p_2$ of this mail being spam is $p_2 = 200$.
As we have 
$$p_1 = -250 < -200 = p_2$$
we would classify this mail as spam if we would just compare the relative logarithmic probabilities.
However, if we use the logarithmic threshold of $0.8$ we have 
$$\vartheta \cdot p_1 = 0.8 \cdot (-250) = -200 = p_2 $$ 
and hence the inequality
$$\vartheta \cdot p_1 < p_2$$
would not be valid.  Therefore, we would conservatively classify the email as ham. 

In [None]:
ùúó = 0.8
log_p_ham  = -250
log_p_spam = -200
log_p_spam  > log_p_ham

In [None]:
log_p_ham * ùúó 

In [None]:
log_p_spam  > ùúó * log_p_ham

In [None]:
def precission_recall(spam_dir: str, ham_dir: str, ùúó: float) -> tuple[float, float, float]:
    TN: int = 0 # true negatives
    FP: int = 0 # false positives
    for email in os.listdir(spam_dir):
        log_p_ham, log_p_spam = log_probabilities(spam_dir + email)
        if log_p_spam > ùúó * log_p_ham:
            TN += 1
        else:
            print(email, log_p_ham, log_p_spam)
            FP += 1
    FN: int = 0 # false negatives
    TP: int = 0 # true positives
    for email in os.listdir(ham_dir):
        log_p_ham, log_p_spam = log_probabilities(ham_dir + email)
        if log_p_spam > ùúó * log_p_ham:
            FN += 1
            print(email, log_p_ham, log_p_spam)
        else:
            TP += 1
    precision: float = TP / (TP + FP)
    recall:    float = TP / (TP + FN)
    accuracy:  float = (TN + TP) / (TN + TP + FN + FP)
    return precision, recall, accuracy

If we use $\vartheta = 0.8$, then we have a *precision* of $95\%$, and a *total recall* of $100\%$ on the training set.

In [None]:
precission_recall(spam_dir_train, ham__dir_train, 0.80)

For the test set, we still have a *recall* of $100\%$, but the *precision* drops to $92\%$, which is still acceptable.

In [None]:
precission_recall(spam_dir_test, ham__dir_test, 0.80)