# Naive Bayes

*Author: Dr. Vasile Rus (vrus@memphis.edu)*

Naive Bayes is a supervised Data Science method typically used for **classification/categorization** tasks as exemplified before in, for instance, the Logistic Regression notebook. For that reason,
it can be viewed as estimating the probabilities of a number of outcome variable values, e.g., the probabilities of categories in classification. To classify a particular object or instance the class with the highest probability among all possible classes $1$ to $C$ is taken as shown below:

$$class (X) = argmax_{c \in (1..C)} P(c_i|X)$$
        
While quite successful in classification tasks, the actual estimated probabilities for each class are not very reliable.

In this notebook, we focus on multinomial, hard classification tasks.

## Mathematical Foundations of Naive Bayes for Binary, Hard Classification

We briefly review in this section the mathematical formulation of the Naive Bayes method for multinomial, hard classification problems. That is, we assume the outcome for one instance or object can be one and only one category out $C$ possible categories.

The Naive Bayes method relies on Bayes' Theorem shown below:

$$P (Y|X) = \frac{P(Y)P(X|Y)}{P(X)}$$

The term $P (Y|X) $ is called the posterior, the term $P(Y)$ is called the prior, and the term $P(X|Y)$ is called the likelihood.

In a classification case, Y can take as value any of the classes $c \in (1..C)$ and X is described as a set of features/predictors $X=(x_1,..,x_P)$. The Bayes' Theorem becomes:

$$P (Y=c_i| (x_1,..,x_P)) = \frac{P(Y=c_i)P(x_1,..,x_P|Y=c_i)}{P(x_1,..,x_P)}$$

The Naive Bayes method takes this theorem and based on the naive assumption of the predictors $x_i$ being independent, i.e., meaning $P(x_1,..,x_P|Y=c_i)$ is approximated by $\prod \limits _{j=1} ^P P(x_j|c_i)$, it re-writes the theorem in the following form:

$$P (Y=c_i| (x_1,..,x_P)) = \frac{P(Y=c_i) \prod \limits _{j=1} ^P P(x_j|c_i)}{P(x_1,..,x_P)}$$

This naive formulation of the theorem is more manageable in terms of estimating the parameters of the distrubtions involved and in particular of the likelihood probability.

## Training a Naive Bayes Classifier

Training a Naive Bayes classifier implies deriving the prior and likelihood distributions from training data based on the naive formulation of Bayes' Theorem.

The prior $P(c_i)$ is derived using the following expression:

$$ P(c_i)= \frac{{\#} c_i}{N}$$

where ${\#} c_i$ is the number of training instances labeled with class $i$ and $N$ is the total number of training instances.

The likelihood $P (X | Y) = \prod _{j=1} ^P P(x_j|c_i) = \prod P(x_1|c_i)P(x_2|c_i)... P(x_P|c_i)$ is derived by multiplying each individual conditional distributions for each predictor $x_i$ as shown below:

$$ P(x_i|c_i) = \frac{{\#} x_{ci}}{{\#} c_i}$$

Once the prior and likelihood distributions derived, to predict the most likely class for a new instance $X=(x_i, ..., x_P)$ we apply the Naive Bayes formula:

$$class (X) = argmax_{c \in (1..C)} {P(c_i|X)} = argmax_{c \in (1..C)} P (Y=c_i| (x_1,..,x_P)) = argmax_{c \in (1..C)} \frac{P(Y=c_i) P(x_1|c_i)P(x_2|c_i)... P(x_P|c_i)}{P(x_1,..,x_P)} $$

Since the denominator does not depend on $c_i$, the argument of argmax, the most likely class can be simply obtained using this formula:

$$class (X) = argmax_{c \in (1..C)} P(c_i|X) = argmax_{c \in (1..C)} P(Y=c_i) P(x_1|c_i) P(x_2|c_i) ... P(x_P|c_i)$$ 

That is, the most likely class is the class correspond to the posterior probability estimated based on the above naive formulation of the Bayes Theorem.

# Peformance Evaluation for Classification Methods including Naive Bayes

The typical performance metrics for classifiers are accuracy, precision, and recall. These are typical derived by compared the predicted output to the golden or actual output/categories in the expert labelled dataset.

For a binary classification case, we denote the category 1 as the positive category and category 0 as the negative category. Using this new terminology, When comparing the predicted categories to the actual categories we may end up with the following cases:
* True Positives (TP): instances predicted as belonging to the positive category and which in fact do belong to the positive category
* True Negatives (TN): instances predicted as belonging to the negative category and which in fact do belong to the negative category
* False Positives (FP): instances predicted as belonging to the positive category and which in fact do belong to the negative category
* False Negatives (FN): instances predicted as belonging to the negative category and which in fact do belong to the positive category

From these categories, we define the following metrics:

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

$Precision = \frac{TP}{TP + FP}$

$Recall = \frac{TP}{TP + FN}$

Classfication methods that have a high accuracy are preferred in general although in some case maximizing precision or recall may be preferred. For instance, a high recall is highly recommended when making medical diagnosis since it is preferrable to err on mis-diagnosing someone as having cancer as opposed to missing someone who indeed has cancer, i.e., the method should try not to miss anyone who may indeed have cancer. 

In general, there is a trade-off between precision and recall. If precision is high then recall is low and viceversa. Total recall (100% recall) is achievable by always predicting the positive class, i.e., label all instances as positive, in which case precision will be very low.


# DataWhys Example - Iris Classifier

The purpose of this notebook is to demonstrate `Blockly` integration using `scikit-learn` and the `iris` dataset.

## Load data

In [19]:
import seaborn as sns

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Y;QqMX.ksfqmJ/K~J)#:">sns</variable></variables><block type="importAs" id="06NQsrer?lqIYDz:Kf.]" x="20" y="123"><field name="libraryName">seaborn</field><field name="libraryAlias" id="Y;QqMX.ksfqmJ/K~J)#:">sns</field></block></xml>

In [20]:
iris = sns.load_dataset('iris')

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="w{XYwzJID?zQ##dG@5_m">iris</variable><variable id="Y;QqMX.ksfqmJ/K~J)#:">sns</variable></variables><block type="variables_set" id="Pi,H0EUCG1(j*bt,,-mX" x="-10" y="191"><field name="VAR" id="w{XYwzJID?zQ##dG@5_m">iris</field><value name="VALUE"><block type="varDoMethod" id=":a.t3e,-8!dipmT*odVw"><field name="VAR" id="Y;QqMX.ksfqmJ/K~J)#:">sns</field><field name="MEMBER">load_dataset</field><data>sns:load_dataset</data><value name="INPUT"><block type="text" id="N0eSFcX:_Xobqs9/vr)X"><field name="TEXT">iris</field></block></value></block></value></block></xml>

## Display data

### Tabular

In [21]:
iris.head(100)

#<xml xmlns="https://developers.google.com/blockly/xml"><block type="dummyNoOutputCodeBlock" id="R61;nGuKUL|ZiG:V#e^w" x="21" y="231"><field name="CODE">iris.head()</field></block></xml>

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
95,5.7,3.0,4.2,1.2,versicolor
96,5.7,2.9,4.2,1.3,versicolor
97,6.2,2.9,4.3,1.3,versicolor
98,5.1,2.5,3.0,1.1,versicolor


### Plot

In [None]:
# import sys as sys
# import matplotlib.pyplot as plt

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="=si]EbR0d9ShbUQb`O+p">sys</variable><variable id="iA,io$x96FNVF5[yEN6#">plt</variable></variables><block type="importAs" id="vic31uv^(7//EJgTgnpM" x="14" y="45"><field name="libraryName">sys</field><field name="libraryAlias" id="=si]EbR0d9ShbUQb`O+p">sys</field><next><block type="importAs" id="jgI/;Ji{c[Sh!yR=pvO%"><field name="libraryName">matplotlib.pyplot</field><field name="libraryAlias" id="iA,io$x96FNVF5[yEN6#">plt</field></block></next></block></xml>

In [23]:
# sys.argv = ['']

sns.set(style="ticks", color_codes=True)

# sns.pairplot(iris, hue="species", height=1.5)

g = sns.pairplot(iris)
# g = sns.pairplot(iris, hue="species")

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Y;QqMX.ksfqmJ/K~J)#:">sns</variable><variable id="iA,io$x96FNVF5[yEN6#">plt</variable><variable id="w{XYwzJID?zQ##dG@5_m">iris</variable></variables><block type="valueNoOutputCodeBlock" id="N1H#P_m2^Q:)@uegpu0I" x="-42" y="216"><field name="CODE">sys.argv =</field><value name="INPUT"><block type="lists_create_with" id="d7i)mFtQNMJK7!?*Snr;"><mutation items="1"/><value name="ADD0"><block type="text" id="20+{B@er=VxRt$lC7F|+"><field name="TEXT"/></block></value></block></value></block><block type="varDoMethod" id="3lpLk3u!%!oe|sJ9?wrM" x="-38" y="254"><field name="VAR" id="Y;QqMX.ksfqmJ/K~J)#:">sns</field><field name="MEMBER">set</field><data>sns:set</data></block><block type="varDoMethod" id="9fv{gJeg,H$p62dYa}}5" x="-40" y="294"><field name="VAR" id="Y;QqMX.ksfqmJ/K~J)#:">sns</field><field name="MEMBER">pairplot</field><data>sns:pairplot</data><value name="INPUT"><block type="lists_create_with" id="G)r/$1Y=eQit224#/Fu1"><mutation items="3"/><value name="ADD0"><block type="variables_get" id="RDO:r55L0H=cCYIt9*fB"><field name="VAR" id="w{XYwzJID?zQ##dG@5_m">iris</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock" id="XL!63t)F7l_1mttWsxu="><field name="CODE">hue="species"</field></block></value><value name="ADD2"><block type="dummyOutputCodeBlock" id="[pY#G%b8ej`B*W$j[Zj0"><field name="CODE">height=1.5</field></block></value></block></value></block><block type="varDoMethod" id="Q)3TF7gPY?kY+.;Be^Pw" x="-42" y="386"><field name="VAR" id="iA,io$x96FNVF5[yEN6#">plt</field><field name="MEMBER">show</field><data>plt:show</data></block></xml>

In [24]:
# import seaborn as sns; 
sns.set(style="ticks", color_codes=True)
iris = sns.load_dataset("iris")
g = sns.pairplot(iris)

## Model

### Prepare data

In [25]:
import seaborn as sns; sns.set(style="ticks", color_codes=True)
iris = sns.load_dataset("iris")
iris.head(100)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
95,5.7,3.0,4.2,1.2,versicolor
96,5.7,2.9,4.2,1.3,versicolor
97,6.2,2.9,4.3,1.3,versicolor
98,5.1,2.5,3.0,1.1,versicolor


In [26]:
g = sns.pairplot(iris)

In [27]:
X_iris = iris.drop("species", axis=1)
y_iris = iris["species"]

## Fitting the Model

In [28]:
import sklearn.model_selection as model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X_iris, y_iris, random_state=1)



In [34]:
import sklearn.naive_bayes as naive_bayes

model = naive_bayes.GaussianNB()
model.fit(X_train, y_train)

y_model = model.predict(X_test)

y_model

array(['setosa', 'versicolor', 'versicolor', 'setosa', 'virginica',
       'versicolor', 'virginica', 'setosa', 'setosa', 'virginica',
       'versicolor', 'setosa', 'virginica', 'versicolor', 'versicolor',
       'setosa', 'versicolor', 'versicolor', 'setosa', 'setosa',
       'versicolor', 'versicolor', 'virginica', 'setosa', 'virginica',
       'versicolor', 'setosa', 'setosa', 'versicolor', 'virginica',
       'versicolor', 'virginica', 'versicolor', 'virginica', 'virginica',
       'setosa', 'versicolor', 'setosa'], dtype='<U10')

## Model Evaluation

In [30]:
import sklearn.metrics as metrics
print(metrics.accuracy_score(y_test, y_model))


0.9736842105263158


## Make Predictions

In [36]:
predicted_label = model.predict([[-1.62 , 1.35, -1.73, -1.45]])
print(predicted_label)

['virginica']
