# Introduction
Naive Bayes is a supervised learning classifier that can be trained to predict the label of future values (e.g, health outcome, spam detection, credit risk, etc.). The Naive Bayes equation is based on conditional probability whereby predictions are made from dependent events:

$$P(A \cap B)=P(A | B) P(B)=P(B | A)P(A)$$

Solving one side of the equation, we get the Naive Bayes equation:

$$P(A|B)=\dfrac{P(B | A) P(A)}{P(B)}$$

To avoid confusion, I will use the following variables:

- *X*: The x-variables in the dataset organized in a matrix, where every row is a unique observation and every column a unique characteristic/feature (synonyms: predictors, explanatory variable, features, input variables, conditions, independent variable)
- *y*: The y-variables in the dataset organized in a vector (synonyms: response variable, outcome/output variable, label, dependent variable)

## Assumptions
The following assumptions hold for Naive Bayes:

- *X* features/columns are independent of one another

- Similar x-values will be observed in future predictions

- Smoothing (e.g., Laplace) is necessary when a category to be predicted equals the minimum (0) or maximum (1) probability.

- Some programs ([IBM](https://www.ibm.com/support/knowledgecenter/en/SS3RA7_15.0.0/com.ibm.spss.modeler.help/dbmining_oracle_data.htm), [Microsoft](https://docs.microsoft.com/en-us/sql/analysis-services/data-mining/microsoft-naive-bayes-algorithm)) require binning of all continous values prior to using Naive Bayes. 

There are different Naive Bayes equations for different types of data:

1. **Multinomial:** Categorical or continous data that can be described by counts

2. **Bernoulli:** Binary data

3. **Gaussian:** Normally distributed data

If the data does not fit any of the methods above (e.g., contains continous and categorical data) then the continous variables can be binned (e.g., by percentile) prior to splitting the data into training and testing. This effectively makes all the data categorical and prepares it for Bernoulli Naive Bayes.  


## Non-Code Example

15 e-mail subject lines are recorded and labeled as spam depending on the content of the e-mail. Given the frequency table below, what is the likelihood of an e-mail being spam if it contains special characters in the subject line? Use Naive Bayes to calculate your prediction.

| Condition | No (spam)   |  Yes (spam)   |
|:------|------|------|
|   > 10 words  | 2 |  1 |
|   Special character(s)  | 5 |  4 |
|   Capital letter(s)  | 0 |  3 |
|   All observations  | 7 |  8 |

$$P(Yes \space | \space Special \space characters)=\dfrac{P(Special \space characters \space |  \space Yes) \times P(Yes)}{P(Special \space characters)}$$

Plugin values from frequency table:

$$P(Yes| \space Special \space characters)=\dfrac{\dfrac{4}{8}\times\dfrac{8}{15}}{\dfrac{9}{15}}$$

Calculate the solution:

$$P(Yes| \space Special \space characters)=0.44$$

It is more likely that an e-mail with special characters in subject line will **not** be spam (1-0.444..=0.555..). The likelihood that the e-mail will be spam if it contains special characters is approximately 0.44.

Let's see the code equivalent (0=no, 1=yes):

In [456]:
from sklearn.naive_bayes import BernoulliNB

X_example = [[1,0,0],[1,0,0],
             [1,0,0],
             [0,1,0],[0,1,0],[0,1,0],[0,1,0],[0,1,0],
             [0,1,0],[0,1,0],[0,1,0],[0,1,0],
             [0,0,1],[0,0,1],[0,0,1]]
y_example = [0,0,1,0,0,0,0,0,1,1,1,1,1,1,1]
model_example = BernoulliNB(fit_prior=True)
model_example.fit(X_example, y_example)
print("The likelihood of not being spam is approx: %.3f" % model_example.score(X_example[3:12],y_example[3:12]))
print("A new subject line with special characters is predicted to be: %d" % model_example.predict([[0,1,0]]))

The likelihood of not being spam is approx: 0.556
A new subject line with special characters is predicted to be: 0


# Detecting Spam E-mails using Naive Bayes and Python
Now let's use a (larger) publically available dataset from the University of California Irvine to detect spam e-mails based on the content of the email rather than special characters in the subject line. This setup requires Python 3. 

In [457]:
import numpy as np
import pandas as pd
import urllib.request
import scipy.stats as stats
from urllib.request import urlopen

import sklearn
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

In [458]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
raw_data = urlopen(url) 
dataset = np.loadtxt(raw_data, delimiter=",")

X = dataset [1:,0:48]
print(X)
y = dataset [1:, -1]

x_simulation = dataset [0,0:48]
y_simulation = dataset [0, -1]

[[ 0.21  0.28  0.5  ...,  0.    0.    0.  ]
 [ 0.06  0.    0.71 ...,  0.06  0.    0.  ]
 [ 0.    0.    0.   ...,  0.    0.    0.  ]
 ..., 
 [ 0.3   0.    0.3  ...,  1.2   0.    0.  ]
 [ 0.96  0.    0.   ...,  0.32  0.    0.  ]
 [ 0.    0.    0.65 ...,  0.65  0.    0.  ]]


As shown above, the dataset contains 4,601 rows/unique e-mails from one user. Each column ($n_{columns}$=48) represents the frequency of a searched word divided by the total words in that e-mail. The counts of all the words in the dataset are taken out of context of other words (hence the NLP association of 'bag of words' with Naive Bayes). 

The first entry in the data is 0, indicating that the word "make" occured 0 times in the first e-mail. A list of all 48 words can be found at [University of California Irvine Spambase Archives](https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names). If we look at the *y* dataset, we find that the first e-mail was indicated as 0, or not spam.

Since this dataset describes different 'documents' as e-mails and counts of words, we can use all types of Naive Bayes methods. With some tweaks of course! 

This setup differs from the non-code example above where the column values are the frequency of spam/no spam labels. The [Scikit-learn Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes) library will take care of counting the frequencies and organizing the data for us depending on the Naive Bayes equation used (e.g., Bernoulli, multinomial, or Gaussian).

**Note: ** Since we don't have extra data, we'll reserve the first row of the data for 'simulation'. We'll use *x_simulation* and *y_simulation* in our last step. 

## Estimating Naive Bayes Accuracy

In the non-code example we simply generated a prediction with the 15 observations available without evaluating the likelihood of the prediction's accuracy. Now we'll take advantage of the fact that we have more observations available ($n_{rows}$=4,601) to generate predictions on a subset of the data and calculate the rate of prediction accuracy. Using this method, we can choose the best Naive Bayes equation and maintain a general idea of the equation's accuracy.

**Note: ** You may choose to not test and compare the accuracy of every Naive Bayes equation and instead, directly use the equation that can be used with your non-altered data. However, as mentioned above, the calculation of an equation's accuracy based on a training and testing dataset is still valuable. Some methods may require an altering of the data: Bernoulli Naive Bayes may require a reformatting to binary form, Gaussian Naive Bayes may require a log transformation to be normally distributed. Each alteration of the data limits the generalizability and inference of the data and should therefore be done with caution.

Cross-validation can also be used to obtain more robust accuracy scores.

In [459]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=17)

### Bernoulli Naive Bayes

First we adapt the data to be used with Bernoulli Naive Bayes by assigning all non-zero values as 1 (using binarize=True). Then we compare the prediction results of y_pred to y_expect to generate the number or correctly predicted values divided by all predicted values.

In [460]:
BernNB = BernoulliNB(binarize=0.1)
BernNB.fit(X_train, y_train)
print(BernNB)

y_expect = y_test
y_pred = BernNB.predict(X_test)
accuracy_score(y_expect, y_pred)
print("Accuracy is: %.2f" % accuracy_score(y_expect, y_pred))
print("Precision is: %.2f" % precision_score(y_expect, y_pred))
print("Recall/sensitivity is: %.2f" % recall_score(y_expect, y_pred))
print(confusion_matrix(y_expect, y_pred))

BernoulliNB(alpha=1.0, binarize=0.1, class_prior=None, fit_prior=True)
Accuracy is: 0.89
Precision is: 0.88
Recall/sensitivity is: 0.82
[[857  64]
 [109 488]]


### Multinomial Naive Bayes 

Since the data we have is continous, we can use the multinomial equation directly.

In [461]:
MultiNB = MultinomialNB()
MultiNB.fit(X_train, y_train)
print(MultiNB)

y_pred=MultiNB.predict(X_test)
print("Accuracy is: %.2f" % accuracy_score(y_expect, y_pred))
print("Precision is: %.2f" % precision_score(y_expect, y_pred))
print("Recall/sensitivity is: %.2f" % recall_score(y_expect, y_pred))
print(confusion_matrix(y_expect, y_pred))

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Accuracy is: 0.86
Precision is: 0.77
Recall/sensitivity is: 0.94
[[751 170]
 [ 36 561]]


### Gaussian Naive Bayes

Gaussian Naive Bayes assumes that we have a normal distribution, so let's first test for a normal distribution in each of the columns/words. We will use a $p \space value<0.05$ as our threshold for distinguising values that are non-normal. If you print the result of the normal test, you will see that all columns are $<0.05$, allowing us to directly use the Gaussian Naive Bayes equation without any transformation of the data.

**Note: ** Here we chose 0.05 as the significance level for the normality test. To choose the right significance level for your data, find the 'standard' levels used in your research domain. I look for this information in related studies/full papers within blind peer-reviewed (non open-souce) journal articles or conference proceedings. Typically 0.01 is used for 'high risk' research (e.g., health-related studies) and 0.05 for low risk research. 

In [462]:
stats.normaltest(X_train)

GausNB = GaussianNB()
GausNB.fit(X_train, y_train)
print(GausNB)

y_pred = GausNB.predict(X_test)
print("Accuracy is: %.2f" % accuracy_score(y_expect, y_pred))
print("Precision is: %.2f" % precision_score(y_expect, y_pred))
print("Recall/sensitivity is: %.2f" % recall_score(y_expect, y_pred))
print(confusion_matrix(y_expect, y_pred))

GaussianNB(priors=None)
Accuracy is: 0.80
Precision is: 0.67
Recall/sensitivity is: 0.95
[[642 279]
 [ 29 568]]


## Comparing Naive Bayes Models

### Accuracy
Based on the tests above, the two best equations are Bernoulli Naive Bayes (altered data, $accuracy=0.89$) and Multinomial Naive Bayes (non-altered data, $accuracy=0.86$). 

### Recall & Precision
Although Multinomial Naive Bayes had a higher recall (0.94) than Bernoulli Naive Bayes (0.82), the confusion matricies show that Multinomial Naive Bayes would have also had more occurances (n=170) of placing your e-mails in the spam folder (precision=0.77). 

### F1-Score
Bernoulli Naive Bayes more often correctly identified both spam/not spam labels, and had a more balanced precision and recall (Bernoulli F1-score=0.85, Multinomial F1-score=0.84). On the instances where Bernoulli mislabled e-mails, it had more occurances (n=109) of placing a spam e-mail in your inbox.

## Generating predictions with Bernoulli Naive Bayes
We will use Bernoulli Naive Bayes trained above (BernNB) as our equation and the *x_simulation* variable we previously reserved to simulate a new entry to be predicted upon. 

**Note: **If we were to run our trained Naive Bayes with our test data, we would get a high prediction accuracy, as this was the data we originally used to select Bernoulli over Multinomial. It would not provide us any additional insights.

In [463]:
new_entry = x_simulation.reshape(1, -1)
y_pred = BernNB.predict(new_entry)
print(y_pred)

[ 1.]


A prediction of 1 indicates that the e-mail is spam. We can assume that this is likely to be true as prior testing indicated that the Bernoulli Naive Bayes model had an accuracy of 0.89.

If we look at the true value we see that our prediction was indeed correct! 

In [464]:
print(y_simulation)

1.0
