Copyright 2020 Vasile Rus, Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# Naïve Bayes

Naïve Bayes is a supervised data science method typically used for **classification/categorization** tasks as exemplified before in, for instance, the logistic regression notebook.
For that reason, it can be viewed as estimating the probabilities of a number of outcome variable values, e.g., the probabilities of categories in classification.
To classify a particular object or instance $X$, the class with the highest probability among all possible classes $1$ to $C$ is taken as shown below:

$$class (X) = argmax_{c \in (1..C)} P(c_i|X)$$ 
        
While quite successful in classification tasks, the actual estimated probabilities for each class are not very reliable.

In this notebook, we focus on multinomial, hard classification tasks.

### What you will learn

In this notebook, you will learn about naïve Bayes, an original data science paradigm to approach primarily classification tasks, and how it can be used to infer from labeled/annotated naïve Bayes based classifiers.  
We will study the following:

- The basics of naïve Bayes
- The meaning of “naïve” in naïve Bayes
- Details about how naïve Bayes models are trained
- Evaluation of performance for naïve Bayes classifiers


### When to use naïve Bayes

Naïve Bayes classifiers are useful when you have a categorical response/outcome variable and there are multiple features/predictors that can be used to predict the correct value of the outcome variable. 
The ultimate goal is to build automatically a probabilistic model to predict the correct value of the outcome variable for a new instance described by the set of predictors/features. 
Naïve Bayes outputs a probability distribution over the values of the outcome variable and therefore for each class a probability value is being generated. 
The category with the highest probability is typically chosen as the correct/most-likely category for the corresponding instance. 
Naïve Bayes has the advantage of being simple and highly accurate for classification when features can be treated independently, comparable to logistic regression but more easily extended to many categories.

## Mathematical Foundations of Naïve Bayes for Binary, Hard Classification

We briefly review in this section the mathematical formulation of the naïve Bayes method for multinomial, hard classification problems. 
That is, we assume the outcome for one instance or object can be one and only one category out $C$ possible categories.

The naïve Bayes method relies on Bayes' Theorem shown below:

$$P (Y|X) = \frac{P(Y)P(X|Y)}{P(X)}$$

The term $P (Y|X) $ is called the posterior, the term $P(Y)$ is called the prior, and the term $P(X|Y)$ is called the likelihood.

In a classification case, Y can take as value any of the classes $c \in (1..C)$ and X is described as a set of features/predictors $X=(x_1,..,x_P)$. 
Then Bayes' Theorem becomes:

$$P (Y=c_i| (x_1,..,x_P)) = \frac{P(Y=c_i)P(x_1,..,x_P|Y=c_i)}{P(x_1,..,x_P)}$$

The naïve Bayes method takes this theorem and based on the naive assumption of the predictors $x_i$ being independent, i.e., meaning $P(x_1,..,x_P|Y=c_i)$ is approximated by $\prod \limits _{j=1} ^P P(x_j|c_i)$, it re-writes the theorem in the following form:

$$P (Y=c_i| (x_1,..,x_P)) = \frac{P(Y=c_i) \prod \limits _{j=1} ^P P(x_j|c_i)}{P(x_1,..,x_P)}$$

This naive formulation of the theorem is more manageable in terms of estimating the parameters of the distributions involved and in particular of the likelihood probability.

## Training a Naïve Bayes Classifier

Training a naïve Bayes classifier implies deriving the prior and likelihood distributions from training data based on the naive formulation of Bayes' Theorem.

The prior $P(c_i)$ is derived using the following expression:

$$ P(c_i)= \frac{{\#} c_i}{N}$$

where ${\#} c_i$ is the number of training instances labeled with class $i$ and $N$ is the total number of training instances.

The likelihood $P (X | Y) = \prod _{j=1} ^P P(x_j|c_i) = \prod P(x_1|c_i)P(x_2|c_i)... P(x_P|c_i)$ is derived by multiplying individual conditional distributions for each predictor $x_i$ as shown below:

$$ P(x_i|c_i) = \frac{{\#} x_{ci}}{{\#} c_i}$$

Once the prior and likelihood distributions derived, to predict the most likely class for a new instance $X=(x_i, ..., x_P)$ we apply the naïve Bayes formula:

$$class (X) = argmax_{c \in (1..C)} {P(c_i|X)} = argmax_{c \in (1..C)} P (Y=c_i| (x_1,..,x_P)) = argmax_{c \in (1..C)} \frac{P(Y=c_i) P(x_1|c_i)P(x_2|c_i)... P(x_P|c_i)}{P(x_1,..,x_P)} $$

Since the denominator does not depend on $c_i$, the argument of argmax, we can ignore the denominator.
Then the most likely class can be simply obtained using this formula:

$$class (X) = argmax_{c \in (1..C)} P(c_i|X) = argmax_{c \in (1..C)} P(Y=c_i) P(x_1|c_i) P(x_2|c_i) ... P(x_P|c_i)$$ 

That is, the most likely class is the class correspond to the posterior probability estimated based on the above naive formulation of the Bayes Theorem.

<!-- NOTE: this has already been covered -->
<!-- ## Peformance Evaluation for Classification Methods including Naïve Bayes

The typical performance metrics for classifiers are accuracy, precision, and recall. These are typical derived by compared the predicted output to the golden or actual output/categories in the expert labelled dataset.

For a binary classification case, we denote the category 1 as the positive category and category 0 as the negative category. Using this new terminology, When comparing the predicted categories to the actual categories we may end up with the following cases:
* True Positives (TP): instances predicted as belonging to the positive category and which in fact do belong to the positive category
* True Negatives (TN): instances predicted as belonging to the negative category and which in fact do belong to the negative category
* False Positives (FP): instances predicted as belonging to the positive category and which in fact do belong to the negative category
* False Negatives (FN): instances predicted as belonging to the negative category and which in fact do belong to the positive category

From these categories, we define the following metrics:

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

$Precision = \frac{TP}{TP + FP}$

$Recall = \frac{TP}{TP + FN}$

Classfication methods that have a high accuracy are preferred in general although in some case maximizing precision or recall may be preferred. For instance, a high recall is highly recommended when making medical diagnosis since it is preferrable to err on mis-diagnosing someone as having cancer as opposed to missing someone who indeed has cancer, i.e., the method should try not to miss anyone who may indeed have cancer. 

In general, there is a trade-off between precision and recall. If precision is high then recall is low and viceversa. Total recall (100% recall) is achievable by always predicting the positive class, i.e., label all instances as positive, in which case precision will be very low. -->

## Example: Naïve Bayes

The data we will use is the `nursery` dataset, which ranks applications for nursery schools in Slovenia during the 1980s.
Because the original dataset is a fair bit larger, we've randomly sampled 2000 rows.

The goal is to predict `rank`.

| Variable | Type    | Description                                        |
|:----------|:---------|:----------------------------------------------------|
| parents  | Nominal | usual, pretentious, great_pret                     |
| has_nurs | Nominal | proper, less_proper, improper, critical, very_crit |
| form     | Nominal | complete, completed, incomplete, foster            |
| children | Nominal | 1, 2, 3, more                                      |
| housing  | Nominal | convenient, less_conv, critical                    |
| finance  | Nominal | convenient, inconv                                 |
| social   | Nominal | non-prob, slightly_prob, problematic               |
| health   | Nominal | recommended, priority, not_recom                   |
| rank    | Nominal | not_recom, recommend, very_recom, priority, spec_prior   |

<div style="text-align:center;font-size: smaller">
 <b>Source:</b> This dataset was taken from the <a href="https://archive.ics.uci.edu/ml/datasets/Nursery">UCI Machine Learning Repository library
    </a></div>
<br>


### Load data

Import `pandas` so we can load a dataframe:

- `import pandas as pd`

Load the dataframe:

- Create `dataframe` and set it to `with pd do read_csv using` a list containing
    - `"datasets/nursery.csv"`
- `dataframe`

## Explore data

Let's check the data makes sense with the five figure summary:

- `with dataframe do describe using` 

What did we get?
It's not a five figure summary because our variables are nominal, and there's no such thing as mean, median, etc., with nominal data.
Instead we have the number of *unique* levels of each variable, the *top* or most frequent level, and the *freq*uency of that level.
Moreover, the count is 2000 across all variables, indicating there are no NaNs.

Because the variables are nominal, many of our standard tools won't work.
For example, we can't use a correlation matrix/heatmap, because correlation isn't defined for nominal variables.
There is something called [Cramer's V](https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9) that is close, but it requires some custom coding that's a bit out of scope for us.
One thing we can do is plot each variable separately.

First import `plotly.express` to do plots:

- `import plotly.express as px`

Next create an empty histogram figure:

- Create `fig` and set it to `with px do histogram using`

Create all the figures in a loop:

- `for i in from dataframe get columns`
    - Set `fig` to `with px do histogram using` a list containing
        - `dataframe`
        - freestyle `x=i` (`i` will take on the name of a column on each loop)
    - Empty freestyle followed by `with fig do show using`

Interestingly, all the variables have levels that occur with about the same frequency, except for `rank`, which is very imbalanced with respect to `very_recom`.

### Prepare train/test sets

Let's separate our predictors (`X`) from our class label (`Y`), putting each into its own dataframe:

- Create `X` and set to `with dataframe do drop using` a list containing
    - freestyle `columns=["rank"]`
- Create `Y` and set to `dataframe [ ]` containing a list with `"rank"` inside

The model we will use is Bernoulli naive Bayes.
This model needs numeric predictors, but can have nominal class labels.
So we need to get dummies for `X` only:

- Set `X` to `with pd get_dummies using` a list containing
    - `X`
- `X` (so you can see what happened)


We're now ready to split the data into train/test sets, which requires `sklearn.model_selection`:

- `import sklearn.model_selection as model_selection`

And do the actual split:

- Create `splits` and set to `with model_selection do train_test_split` using a list containing
    - `X`
    - `Y`
    - freestyle `random_state=1`
    
Setting random_state will make our results match each other.

### Fit model

We need to import libraries for:

- Naïve Bayes
- Metrics
- Ravel

So do the following imports:

- `import sklearn.naive_bayes as naive_bayes`
- `import sklearn.metrics as metrics`
- `import numpy as np`

Create the naive Bayes model:

- Create variable `naiveBayes` and set it to `with naive_bayes create BernoulliNB using` 

Train the model by calling `fit` on it:

-  `with naiveBayes do fit using` a list containing
    - `in list splits get # 1` (this is Xtrain)
    - `with np do ravel using` a list containing
        - `in list splits get # 3` (this is Ytrain)

And finally, get predictions:

- Create `predictions`
- Set it to `with naiveBayes do predict using` a list containing
    - `in list splits get #2`

### Evaluate the model

Get the accuracy:

- `with metrics do accuracy_score using` a list containing
    - `in list splits get # 4`  (this is `Ytest`)
    - `predictions`

And get the recall and precision:

- `print with metrics do classification_report using` a list containing
    - `in list splits get # 4`  (this is `Ytest`)
    - `predictions`
    

Performance is surprisingly good except for the infrequent class, `very_recom`.

## Visualizing

### Feature importance

Extracting feature importance from naive Bayes models in `sklearn` is a bit more work than for other models.

We need to create a dataframe of the probabilities of predictors given the class label, i.e. the likelihoods, then give that dataframe correct row/column names, and finally raise it to the power of ten (because the default output is log):

- Create variable `output`
- Set it to `with pd create DataFrame` using a list containing
    - freestyle `naiveBayes.feature_log_prob_`
- Freestyle `output.index = naiveBayes.classes_`
- Freestyle `output.columns = X.columns`
- Freestyle `output = 10 ** output`
- `output` (to display)

Each column in this table shows the probability of that column's predictor given the classes shown in each row.
So `parents_great_pret` makes `spec_prior` .23 likely, the most likely class for this predictor.

It's a bit hard to read these, so we can also plot them in a loop:

- `for i in from output get columns`
    - Set `fig` to `with px do bar using` a list containing
        - `output`
        - freestyle `x=i` (`i` will take on the name of a column on each loop)
    - Empty freestyle followed by `with fig do show using`

Quite a few of these predictors make one class label much more likely than the other labels.
This suggests that the naive Bayes assumption of independent predictors is reasonable for this dataset.