In [1]:
# general imports usually needed
import numpy as np
import matplotlib as mpl
import matplotlib.animation
import matplotlib.pyplot as plt
import seaborn as sbn
import pandas as pd

In [2]:
# libraries more specific to this lecture notebook
import os.path
import sys
sys.path.append('../../src')
from ml_python_class.config import DATA_DIR

from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn import datasets
from sklearn.model_selection import train_test_split

In [3]:
# notebook wide settings to make plots more readable and visually better to understand
np.set_printoptions(suppress=True)

#%matplotlib widget
#%matplotlib inline

plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)
plt.rc('figure', titlesize=18)
plt.rc('legend', fontsize=14)
plt.rcParams['figure.figsize'] = (12.0, 8.0) # default figure size if not specified in plot
plt.style.use('seaborn-darkgrid')


# Naive Bayes Classification

This notebook is based on the following tutorials:

- https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn

Suppose you are a product manager, you want to classify customer reviews in positive and negative classes.
Or As a loan manager, you want to identify which loan applicants are safe or risky? As a healthcare analyst,
you want to predict which patients can suffer from diabetes disease. All the examples are the same kind of 
problem: a classification task of reviews or other customer features.

Naive Bayes is the most straightforward and fast classification algorithm, which is suitable for a large chunk of
data. Naive Bayes classifiers have been successfully used in many applications such as spam filtering, text classification,
sentiment analysis, and recommender systems. It uses Bayes theorem of probability for prediction of unknown class.

In this lecture notebook, you are going to learn about all of the following:


- Classification Workflow
- What is Naive Bayes classifier?
- How Naive Bayes classifier works?
- Classifier building in Scikit-learn
- Zero Probability Problem
- It's advantages and disadvantages


## Classification Workflow

Whenever you perform classification, the first step is to understand the problem and identify potential features and the target label.
Features are those characteristics or attributes which affect the results of the label. For example, in the case of a loan distribution,
bank managers identify the customer’s occupation, income, age, location, previous loan history, transaction history, credit score, etc.
These characteristics are known as features which help the model classify customers.

The classification has two phases, a learning phase, and the evaluation phase. In the learning phase, the classifier trains its model
on a given dataset.  And in the evaluation phase, it tests the classifier performance. Performance is evaluated on the basis of
various parameters such as accuracy, error, precision, and recall.

## What is a Naive Bayes Classifier?

Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the simplest supervised learning algorithms.
Naive Bayes classifier is a fast, and mostly accurate and reliable algorithm. Naive Bayes classifiers have good accuracy and speed on large datasets.

The Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of other features. For example,
a loan applicant is desirable or not depending on his/her income, previous loan and transaction history, age, and location. Even
if these features are interdependent, these features are still considered independently by a naive Bayes classifier.
This assumption simplifies computation, and that's why it is considered naive. This assumption is called class conditional independence.

The standard specification of the conditional probability at the hear of Bayesien classification is given as:

\begin{equation}
P(h|D) = \frac{P(D|h) P(h)}{P(D)}
\end{equation}

Where

- $P(h)$: the probability of hypothesis $h$ being true (regardless of the evidence or data $D$).  This is known as the prior probability of $h$.
- $P(D)$: the probability of the data (regardless of the hypothesis).  This is known as the prior probability.
- $P(h|D)$: the probability of hypothesis $h$ given the data $D$.  This is known as the posterior probability.
- $P(D|h)$: the probability of data $D$ given that the hypothesis $h$ was true.  This is known as posterior probability.

## How Naive Bayes Classifier Works

Let’s understand how Naive Bayes works through an example. We will use an example of weather
conditions and playing sports. We want to calculate the probability of playing sports given weather
conditions. Now, you need to classify whether players will play or not, based on the weather condition.


### First Approach (In case of a single feature)

Naive Bayes classifier calculates the probability of an event in the following steps:

1. Calculate the prior probability for the given class labels
2. Find likelihood probability with each attribute for each class
3. Put these values in Bayes formula and calculate posterior probabilities
4. See which class has a higher probability, given the input belongs to the higher probability class

For simplifying prior and posterior probability calculation we can use the two tables: frequency and likelihood tables.
Both of these tables will help you to calculate the prior and posterior probability. The Frequency table contains the occurrence of
labels for all features. There are two likelihood tables. Likelihood Table 1 is showing prior probabilities of labels and
Likelihood Table 2 is showing the posterior probability.

![naive Bayes figure 1](../../figures/naive-bayes-1.png)

Now suppose you want to calculate the probability of playing when the weather is overcast.

Probability of playing:

\begin{equation}
P(Yes | Overcast) = \frac{P(Overcast | Yes) P(Yes)}{P(Overcast)}
\end{equation}

Besides the probabilities shown in our tables, we also have the following prior probabilities

\begin{equation}
P(Overcast) = \frac{4}{14} = 0.29 \\
P(Yes) = \frac{9}{14} = 0.64
\end{equation}

The posterior probability is (from Likelihood Table 2):

\begin{equation}
P(Overcast | Yes) = \frac{4}{9} = 0.44
\end{equation}

And finally we can calculate what we really want, the probability of playing when the whether is overcast

\begin{equation}
P(Yes | Overcast) = \frac{0.44 \times 0.64}{0.29} = 0.97
\end{equation}

We can also calculate the probability of not playing given that the sky is overcast.  Though
here, because we always made the decision to play (given this data) on overcast days, the 
posterior is 0, and thus the conditional probability ends up with a probability of 0:

Probability of not playing:

\begin{equation}
P(Overcast) = \frac{4}{14} = 0.29 \\
P(No) = \frac{5}{14} = 0.36
\end{equation}

\begin{equation}
P(Overcast | No) = \frac{0}{9} = 0
\end{equation}

\begin{equation}
P(No | Overcast) = \frac{0 \times 0.36}{0.29} = 0
\end{equation}


So in conclusion here.  Given information that the whether is overcast, we calculate the conditional probability of a Yes or No decision to
play as shown. And since the condition probability of a Yes desicion is higher, we would determine that the most likely decision output for
a new overcast data is Yes, we will play.

### Second Approach (In case of multiple features)


So suppose we want to calculate the probability of playing when the whether is overcast and the temperature is mild:

![naive Bayes figure 2](../../figures/naive-bayes-2.png)

Probability of Playing:

\begin{equation}
P(Yes | Weather=Overcast, Temp=Mild) = \frac{P(Weather=Overcast, Temp=Mild | Yes) P(Yes)}{P(Weather=Overcast, Temp=Mild)}
\end{equation}

Where by step 3, we can multiple conditional probabilities to get

\begin{equation}
P(Weather=Overcast, Temp=Mild | Yes) = P(Overcast | Yes) P(Mild | Yes)
\end{equation}

From the table above, we can determine all of the values we need to determine probability of playing and of not playing when weather is overcast and the temperature
is mild.

\begin{equation}
P(Weather=Overcast, Temp=Mild) = \frac{1}{14} = 0.07 \\
P(Yes) = \frac{9}{14} = 0.64 \\
P(Weather=Overcast, Temp=Mild | Yes) = \frac{1}{9} = 0.1111 \\
P(Yes | Weather=Overcast, Temp=Mild) = \frac{0.1111 \times 0.64 }{0.07} = 1.0
\end{equation}

And as before, we will not show again, but since we always play when the whether is overcast, the conditional probability ends up being 0 
for this data for the question if we will not play given weather is overcast and temperature is mild.


## Classifier Building in Scikit-Learn

In this example, we will continue to use the dummy dataset with three columns: weather, temperature, and play.
The first two are features(weather, temperature) and the other is the label.

In [4]:
# Assigning features and label variables
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']

play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']

### Encoding Features

First, we need to convert these string labels into numbers. for example: 'Overcast', 'Rainy', 'Sunny' as 0, 1, 2.
This is known as label encoding. Scikit-learn provides LabelEncoder library for encoding labels with a value between
0 and one less than the number of discrete classes.

In [5]:
# Import LabelEncoder
from sklearn import preprocessing

#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)
print(weather_encoded)

[2 2 0 1 1 1 0 2 2 1 2 0 0 1]


Similarly, we can encode the other columns as well using this simple encoding:

In [6]:
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)
print("Temp:",temp_encoded)
print("Play:",label)

Temp: [1 1 1 2 0 0 0 2 0 2 2 2 1 2]
Play: [0 0 1 1 1 0 1 0 1 1 1 1 1 0]


Now we combine our features (weather and temp) into a single array (actually list of tuples currently):

In [7]:
#Combinig weather and temp into single listof tuples
features = np.zeros( (len(label), 2) )
features[:,0] = weather_encoded
features[:,1] = temp_encoded
print(features.shape)

(14, 2)


## Generating Model

Now we will generate a model using a naive Bayes classifer using the sklearn library.

In [8]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
model = GaussianNB()

# Train the model using the training sets
model.fit(features,label)

#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print("Predicted Value:", predicted)

Predicted Value: [1]


In [9]:
model.predict_proba([[0,2]])

array([[0.00770751, 0.99229249]])

## Naive Bayes with Multiple Labels

Naive bayes can be used for multinomial (multiple label) classification tasks.  Till now you have learned Naive Bayes classification with binary
labels. Now you will learn about multiple class classification in Naive Bayes. For example, if you want to classify a news article about technology,
entertainment, politics, or sports.

In model building part, we will use wine dataset, which is a very famous multi-class classification problem.
This dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.

Lets first load the required wine dataset from scikit-learn:

In [10]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
wine = datasets.load_wine()

### Exploring Data

We can print the target and feature names, to make sure you have the right dataset, as such:

In [11]:
# print the names of the 13 features
print("Features: ", wine.feature_names)

# print the label type of wine(class_0, class_1, class_2)
print("Labels: ", wine.target_names)

Features:  ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Labels:  ['class_0' 'class_1' 'class_2']


It's a good idea to always explore your data a bit, so you know what you're working with. Here, you can see the first
five rows of the dataset are printed, as well as the target variable for the whole dataset.

In [12]:
# print data(feature)shape
wine.data.shape


(178, 13)

In [13]:
# print the wine data features (top 5 records)
wine.data[0:5]

array([[  14.23,    1.71,    2.43,   15.6 ,  127.  ,    2.8 ,    3.06,
           0.28,    2.29,    5.64,    1.04,    3.92, 1065.  ],
       [  13.2 ,    1.78,    2.14,   11.2 ,  100.  ,    2.65,    2.76,
           0.26,    1.28,    4.38,    1.05,    3.4 , 1050.  ],
       [  13.16,    2.36,    2.67,   18.6 ,  101.  ,    2.8 ,    3.24,
           0.3 ,    2.81,    5.68,    1.03,    3.17, 1185.  ],
       [  14.37,    1.95,    2.5 ,   16.8 ,  113.  ,    3.85,    3.49,
           0.24,    2.18,    7.8 ,    0.86,    3.45, 1480.  ],
       [  13.24,    2.59,    2.87,   21.  ,  118.  ,    2.8 ,    2.69,
           0.39,    1.82,    4.32,    1.04,    2.93,  735.  ]])

In [14]:
wine.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

# Splitting Data

Lets split the data into train and test data sets as usual, so we can evaluate our classifiers
performance more accurately.

In [15]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3,random_state=109) # 70% training and 30% test


### Model Generation

After splitting, we create a naive Bayes model on the training set and perform prediction on test set features.

In [16]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gnb.predict(X_test)

### Evaluating the Model

After model generation let's check the accuracy using actual and predicted values.

In [17]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9074074074074074
