In this notebook, Naïve Bayes theorem will be applied to a dataset that contains spam emails.
Naïve Bayes is a classification technique based on Bayes theorem with an assumption of independence among predictors.

For clarity, Bayes theorem is stated mathematically as the following equation:

$$P(A | B) = \frac{P(B|A)P(A)}{P(B)},$$

where

 - $P(A| B)$ is a conditional probability: the likelihood of event $A$ occurring given that $B$ is true.
 - $P(B|A)$ is also a conditional probability: the likelihood of event $B$ occurring given that $A$ is true.
 - $P(A)$ and $P(B)$ are the probabilities of observing events $A$ and $B$.

In simple terms, a Naïve Bayes classifier assumes that the presence of a particular feature in a *class* is unrelated to the presence of any other feature.

## Index:

- [Part 1 - Importing the Dataset and Exploratory Data Analysis (EDA)](#part1)
- [Part 2 - Shuffling  and Splitting the Emails](#part2)
- [Part 3 - Building a Simple Naïve Bayes Classifier From Scratch](#part3)
- [Part 4 - Explaining the Code Given in Part 3](#part4)
- [Part 5 - Train the Classifier `train`](#part5)
- [Part 6 - Explore the Performance of `train` Classifier ](#part6)
- [Part 7 - Training the `train2` Classifier ](#part7)

## The Naïve Bayes Algorithm

You have now learned about the **Naïve Bayes algorithm** for classification. 

The pseudo-algorithm for Naïve Bayes can be summarized as follows: 
1. Load the training and test data.
2. Shuffle the messages and split them.
3. Build a simple Naïve Bayes classifier from scratch.
4. Train the classifier and explore the performance.

[Back to top](#Index:) 

<a id='part1'></a>

### Part 1: Importing the Dataset and  Exploratory Data Analysis (EDA)

Using `pandas` to import the dataset. Importing `pandas` and then reading the file using the `.read_csv` *function* and passing the name of the dataset as a *string*.

Because the rows in the dataset are separated using a `\t`, the type of delimiter is specified in the `.read_csv()` *function*; the default value is `,`. Additionally, the list of column names to use is specified (`"label"` and `"emails"`).

In [1]:
import pandas as pd

emails = pd.read_csv('EmailSpamCollection', sep = '\t', names = ["label", "email"])

Before performing any algorithm on the *dataframe*, it is always good practice to perform exploratory data analysis.

Begining by visualizing the first ten rows of the `df` *dataframe* using the `head()` *function*. By default, *`head()`* displays the first five rows of a *dataframe*.

In [2]:
emails.head(5)

Unnamed: 0,label,email
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Next, more information is retrieved about the *dataframe* by using the `shape` and `columns` *methods* and the `describe()` *function*.

Here's a brief description of what each of the above *functions* does:
- `shape`: returns a *tuple* representing the dimensionality of the *dataframe*.
- `columns`: returns the column labels of the *dataframe*.
- `describe()`: returns summary statistics of the columns in the *dataframe* provided, such as mean, count, standard deviation, etc.

In [3]:
emails.shape

(5572, 2)

In [4]:
emails.columns

Index(['label', 'email'], dtype='object')

In [5]:
emails.describe()

Unnamed: 0,label,email
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


[Back to top](#Index:) 

<a id='part2'></a>

### Part 2: Shuffling  and Splitting the Messages

In the second part, emails are shuffled and split them into a training set (2,500 messages), a validation set (1,000 messages), and a test set (all remaining messages).

Shuffling messages is done using the `sample` *function* from the `pandas` *library*.

`frac` denotes the proportion of the *dataframe* to sample, and `random_state` is a random seed that ensures reproducibility of your results. 

Reset the *index* of `emails` to align with the shuffled emails by using the `reset_index` *function* with the appropriate *argument*. 

Here's a link for you where you can find the documentation about the [`reset_index()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) *function*.

In [6]:
emails = emails.sample(frac = 1, random_state = 0).reset_index(drop = True)

Run the code cell below to visualize the updated *dataframe*.

In [7]:
emails.head()

Unnamed: 0,label,email
0,ham,"Storming msg: Wen u lift d phne, u say ""HELLO""..."
1,spam,<Forwarded from 448712404000>Please CALL 08712...
2,ham,And also I've sorta blown him off a couple tim...
3,ham,"Sir Goodmorning, Once free call me."
4,ham,All will come alive.better correct any good lo...


Next, emails are split into a training set (the first 2,500 emails in the *dataframe*), a validation set (the next 1,000 emails), and a testing set (the remaining emails). These three sets will be defined as `trainingMsgs`, `valMsgs`, and `testingMsgs`.

In the code cell below, the messages and their correspoding labels are defined. Then messages are split into the required sets according to the instructions.

In [8]:
email = list(emails.email) 
lbls =list(emails.label) 
trainingEmail = email[:2500] 
valEmail = email[2500:3500] 
testingEmail = email[3500:]

The cell below splits the labels into a training set, a validation set, and a test set.

In [9]:
traininglbls = lbls[:2500] 
vallbls = lbls[2500:3500] 
testinglbls = lbls[3500:]

[Back to top](#Index:) 

<a id='part3'></a>

### Part 3: Building a Simple Naïve Bayes Classifier From Scratch

While Python’s SciKit-learn *library* has a [Naïve Bayes classifier](https://scikit-learn.org/stable/modules/naive_bayes.html), it works with continuous probability distributions and assumes numerical features. 

Here is a simple Naïve Bayes classifier built from scratch.

In [10]:
import numpy as np


class NaiveBayesForSpam:
    def train (self, nonSPamMessages, spamMessages):
        self.words = set (' '.join (nonSPamMessages + spamMessages).split())
        self.priors = np.zeros (2)
        self.priors[0] = float (len (nonSPamMessages)) / (len (nonSPamMessages) + len (spamMessages))
        self.priors[1] = 1.0 - self.priors[0]
        self.likelihoods = []
        for i, w in enumerate (self.words):
            prob1 = (1.0 + len ([m for m in nonSPamMessages if w in m])) / len (nonSPamMessages)
            prob2 = (1.0 + len ([m for m in spamMessages if w in m])) / len (spamMessages)
            self.likelihoods.append ([min (prob1, 0.95), min (prob2, 0.95)])
        self.likelihoods = np.array (self.likelihoods).T
        
    def predict (self, message):
        posteriors = np.copy (self.priors)
        for i, w in enumerate (self.words):
            if w in message.lower():  # convert to lower-case
                posteriors *= self.likelihoods[:,i]
            else:                                   
                posteriors *= np.ones (2) - self.likelihoods[:,i]
            posteriors = posteriors / np.linalg.norm (posteriors)  # normalize
        if posteriors[0] > 0.5:
            return ['ham', posteriors[0]]
        return ['spam', posteriors[1]]    

    def score (self, messages, labels):
        confusion = np.zeros(4).reshape (2,2)
        for m, l in zip (messages, labels):
            if self.predict(m)[0] == 'ham' and l == 'ham':
                confusion[0,0] += 1
            elif self.predict(m)[0] == 'ham' and l == 'spam':
                confusion[0,1] += 1
            elif self.predict(m)[0] == 'spam' and l == 'ham':
                confusion[1,0] += 1
            elif self.predict(m)[0] == 'spam' and l == 'spam':
                confusion[1,1] += 1
        return (confusion[0,0] + confusion[1,1]) / float (confusion.sum()), confusion

[Back to top](#Index:) 

<a id='part4'></a>

### Part 4: Explaining the Code Provided in Part 3

Before explaining the code that was provided in Part 3, it is important to have some level of intuition as to what a spam email message might look like. Usually they have words designed to catch your eye and, in some sense, to tempt you to open them. Also, spam emails tend to have words written in all capital letters and use a lot of exclamation marks.

The `train` *function* calculates and stores the prior probabilities and likelihoods based on the training dataset. In Naïve Bayes, this is all the training phase does. 

The `predict` *function* repeatedly applies the Naïve Bayes theorem to every word in the constructed *dictionary* and, based on the posterior probability, it classifies each email as `spam` or `non-spam`. The `score` *function* calls `predict` for multiple emails and compares the outcomes with the supplied `ground truth` labels, thus evaluating the classifier. It also computes and returns a confusion matrix.

[Back to top](#Index:) 

<a id='part5'></a>

### Part 5: Training the `train`  Classifier

Looking at the definition of the `train` *function* in Part 2, you can see that the training *functions* require the `non-spam` and `spam` emails to be passed on separately.

The `non-spam` emails can be passed using the code given below:

In [11]:
nonSpamEmails = [m for (m, l) in zip(trainingEmail, traininglbls) if 'ham' in l]

The cell below to passes the spam emails.

In [12]:
spamEmails = [m for (m, l) in zip(trainingEmail, traininglbls) if 'spam' in l]

Run the cell below to see how many `non-spam` and `spam` emails there are.

In [13]:
print(len(nonSpamEmails))
print(len(spamEmails))

2170
330


Their sum equals 2,500 as expected. 

Next, the classifier is created for the analysis using the `NaiveBayesForSpam`*function*. 

After that, `non-spammsgs` and `spammsgs` are trained using the `train` *function*.

In [14]:
clf = NaiveBayesForSpam()
clf.train(nonSpamEmails, spamEmails)

[Back to top](#Index:) 

<a id='part6'></a>

### Part 6: Exploring the Performance of the `train` Classifier

Explore the performance of the two classifiers on the *validation set* by using the `.score()` *function*.


**IMPORTANT NOTE: Results in the following sections will change. This is expected and due to the random shuffling. The results will be different for each shuffling. To ensure reproducible results, define `random_state` in the `sample` *method* when shuffling the data in [Part 2: Shuffling and Splitting the Text Messages](#part2).**

In [15]:
score, confusion = clf.score (valEmail, vallbls)

The code below prints the score and the confusion matrix.

In [16]:
print("The overall performance is:", score)

The overall performance is: 0.977


In [17]:
print("The confusion matrix is:\n", confusion)

The confusion matrix is:
 [[864.  20.]
 [  3. 113.]]


The data is not equally divided into the two *classes*. As a baseline, let’s see what the success rate would be if you always guessed non-spam.

Run the code cell below to *print* the new score.

In [18]:
print('new_score', len([1 for l in vallbls if 'ham' in l]) / float (len ( vallbls)))

new_score 0.867


Comparing the baseline score to the performance on the validation set to see which is better.

The sample error can also be generated by calculating the score and the confusion matrix on the training set.

In [19]:
#Note: this cell may take a LONG time to run!
score, confusion = clf.score (trainingEmail, traininglbls)

In [20]:
print("The overall performance is:", score)

The overall performance is: 0.9796


In [21]:
print("The confusion matrix is:\n", confusion)

The confusion matrix is:
 [[2.169e+03 5.000e+01]
 [1.000e+00 2.800e+02]]


[Back to top](#Index:) 

<a id='part7'></a>

### Part 7: Training the `train2` Classifier

In this section, a second classifier is defined, `train2`, and its performances compared to the `train` classifier defined above.

The `train2` classifier is defined in the code cell below.

In [22]:
class NaiveBayesForSpam:
    def train2 ( self , nonSpamMessages , spamMessages) :
            self.words = set (' '.join (nonSpamMessages + spamMessages).split())
            self.priors = np. zeros (2)
            self.priors [0] = float (len (nonSpamMessages)) / (len (nonSpamMessages) +len( spamMessages ) )
            self.priors [1] = 1.0 - self . priors [0] 
            self.likelihoods = []
            spamkeywords = [ ]
            for i, w in enumerate (self.words):
                prob1 = (1.0 + len ([m for m in nonSpamMessages if w in m])) /len ( nonSpamMessages )
                prob2 = (1.0 + len ([m for m in spamMessages if w in m])) /len ( spamMessages ) 
                if prob1 * 20 < prob2:
                    self.likelihoods.append([min (prob1 , 0.95) , min (prob2 , 0.95) ])
                    spamkeywords . append (w) 
            self.words = spamkeywords
            self.likelihoods = np.array (self.likelihoods).T 
            
    def predict (self, message):
        posteriors = np.copy (self.priors)
        for i, w in enumerate (self.words):
            if w in message.lower():  # convert to lower-case
                posteriors *= self.likelihoods[:,i]
            else:                                   
                posteriors *= np.ones (2) - self.likelihoods[:,i]
            posteriors = posteriors / np.linalg.norm (posteriors)  # normalise
        if posteriors[0] > 0.5:
            return ['ham', posteriors[0]]
        return ['spam', posteriors[1]]    

    def score (self, messages, labels):
        confusion = np.zeros(4).reshape (2,2)
        for m, l in zip (messages, labels):
            if self.predict(m)[0] == 'ham' and l == 'ham':
                confusion[0,0] += 1
            elif self.predict(m)[0] == 'ham' and l == 'spam':
                confusion[0,1] += 1
            elif self.predict(m)[0] == 'spam' and l == 'ham':
                confusion[1,0] += 1
            elif self.predict(m)[0] == 'spam' and l == 'spam':
                confusion[1,1] += 1
        return (confusion[0,0] + confusion[1,1]) / float (confusion.sum()), confusion

Next, the classifier is updadted for the analysis using the `NaiveBayesForSpam` *function*. The cell below creates the `clf` classifier.

Then, training `non-spammsgs` and `spammsgs` using the `train2` *function*.

In [23]:
clf = NaiveBayesForSpam()
clf.train2(nonSpamEmails, spamEmails)

Re-computing the score and the confusion matrix on the validation set using the updated classifier.

In [24]:
#Again, this cell may take a long time to run!
score_2, confusion_2 = clf.score(trainingEmail, traininglbls)

Running the code cells below gets the updated values.

In [25]:
print("The overall performance is: ", score_2)

The overall performance is:  0.9804


In [26]:
print("The confusion matrix is:\n", confusion_2)

The confusion matrix is:
 [[2163.   42.]
 [   7.  288.]]


**Takeaways**

The `train2` classifier is faster and more accurate than the `train` classifier.

Both `train` and `train2` successfully determine whether an email is spam or normal. However, `train2` contains an `if` statement that further distinguishes the emails as `normal` or `spam` improving the accuracy of the classifier.