# Naive Bayes Classifier

In this tutorial, you will learn how to classify the email as spam or not using the Naive Bayes Classifier.

<img src="img/spam-filter.png">

First, let's review a little bit the characteristics of Naive Bayes Classifiers:

### Pro and cons of Naive Bayes Classifiers
Pros:
* Computationally fast
* Simple to implement
* Works well with small datasets
* Works well with high dimensions
* Perform well even if the Naive Assumption is not perfectly met. In many cases, the approximation is enough to build a good classifier.

Cons:
* Require to remove correlated features because they are voted twice in the model and it can lead to over inflating importance.
* If a categorical variable has a category in test data set which was not observed in training data set, then the model will assign a zero probability. It will not be able to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called [Laplace estimation](https://stats.stackexchange.com/questions/108797/in-naive-bayes-why-bother-with-laplace-smoothing-when-we-have-unknown-words-in). `sklearn` applies Laplace smoothing by default when you train a Naive Bayes classifier.

### Popular use cases for Naive Bayes models

* Spam Detection
* Classification of the customer
* Loan Classification
* Health Risk Prediction

In [None]:
import numpy as np
import pandas as pd
import urllib
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

We will load the dataset containing the spam information:

In [None]:
dataset = np.loadtxt("data/spambase.data",delimiter=",")
dataset[0]

Print the number of rows and attributes contained in the dataset:

In [None]:
# add your code here

If you check the file `data/spambase_doc.txt` you can read the relevant information of the dataset.

For our purposes, we will use the following attributes:
* The first 48 attributes as training features. These are continuous values ranged from 0 to 100, representing the percentage of words in the e-mail that match 48 different keywords, usually associated with spam/not spam e-mails.
* The last attribute as target. It is a binary class attribute representing if the e-mail was considered spam (1) or not (0).

Extract the features and target matrices by using `numpy` slicing operators.

In [None]:
x = ...
y = ...

Split the dataset into train and test with the test size of 0.33, and _ramdom_state=17_. For this, use `sklearn`'s method `train_test_split`, whose documentation can be found in https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
x_train, x_test, y_train, y_test = ...

Model a [Gaussian Naive Bayes Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) on the dataset. For this, you can use the `GaussianNB` implementation by `sklearn`.

After this, train the model using the training data.

In [None]:
GaussNB = ...
... # fit your model

print(GaussNB)

Using the train model, make predictions for the test features. Recall that you can use the function `.predict()` for this purpose.

In [None]:
y_predict = ...
y_predict

Use the `accuracy_score` function to compare the predictions against the real target values (`y_test`). Print the accuracy obtained.

In [None]:
acc = ...
acc

### Compare with Logistic regression

Let's make a quick comparison with a Logistic regression model trained on the same data:

In [None]:
log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)
print(log_reg)
y_predict = log_reg.predict(x_test)
acc = accuracy_score(y_test, y_predict)
print (acc)

----

Well done! We have implemented a spam detector algorithm that detects spam in emails with ~0.81 accuracy. A couple of remarks:
1. Since we have chosen the best model based on the test results, in reality this dataset has acted as **validation** one. In order to perform a strict evaluation of the quality of the model, we should test the chosen algorithm against a fresh batch of data.
2. We have chosen the accuracy score as performance metric. However, depending on the problem statement, other metrics would be more appropriate. Some of them are the Precision, Recall, F-1 Score, or Area Under the ROC curve.