<img src="AV_Logo.png" style="width: 200px;height: 75px"/>

Table of Contents:
-------------
* [The Naive Bayes Classifier](#The-Naive-Bayes-Classifier)
* [Bayes' Theorem](#Bayes'-Theorem)
* [The Naive Bayes algorithm](#The-Naive-Bayes-algorithm)
* [How Naive Bayes algorithm works?](#How-Naive-Bayes-algorithm-works?)
* [Types of Naive Bayes Classifier](#Types-of-Naive-Bayes-Classifier)
* [Benefits of using Naive Bayes algorithm](#Benefits-of-using-Naive-Bayes-algorithm)
* [Implementation of Naive Bayes](#Implementation-of-Naive-Bayes)
* [Limitations of the Naive Bayes algorithm](#Limitations-of-the-Naive-Bayes-algorithm)

### The Naive Bayes Classifier

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class, is unrelated to the presence of any other feature. For example: a fruit may be considered as an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large datasets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Naive Bayes theorem is completely based on Bayes' Theorem. Let's first see the Bayes' theorem.

### Bayes' Theorem

The Bayes' theorem describes the probability of an event based on the prior knowledge of the conditions that might be related to the event. If we know the conditional probability , we can use the bayes rule to find out the reverse probabilities.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equations below:

<img src="bayes.png" style="width: 500px;height: 200px">
<img src="bayes2.png" style="width: 250px;height: 50px">

The above statement is the general representation of the Bayes' Theorem

### The Naive Bayes algorithm

Naive Bayes model is easy to build and particularly useful for very large datasets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:

<img src="bayes3.png" style="width: 400px;height: 200px">

Above,

* P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
* P(c) is the prior probability of class.
* P(x|c) is the likelihood which is the probability of predictor given class.
* P(x) is the prior probability of predictor.

### How Naive Bayes algorithm works?

Let’s understand it using an example. Below I have a training dataset of weather and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather condition. Let’s follow the below steps to perform it.

Step 1: Convert the dataset into a frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability of playing is 0.64.

<img src="bayes4.png" style="width: 550px;height: 250px">

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.

Let's try to understand using the below problem statement. 

**Problem**: Players will play if the weather is sunny. Is this statement correct?

We can solve it using above discussed method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.

### Types of Naive Bayes Classifier

There are mainly three types of Naive Bayes models:

* **Gaussian**: It is used in classification and it assumes that features follow a normal distribution.

* **Multinomial**: It is used for discrete counts. For example, let’s say,  we have a text classification problem. Here we can consider bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.

* **Bernoulli**: The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with ‘bag of words’ model where the 1s & 0s are represented by “word occurs in the document” and “word does not occur in the document” respectively.

Based on your dataset, you can choose any of above discussed model.

### Benefits of using Naive Bayes algorithm

* It is easy and fast to predict class of test data set. It also performs well in multi class prediction
* When assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression, and you need less training data.
* It performs well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

## Implementation of Naive Bayes

In [2]:
import pandas as pd

from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression

In [3]:
data = pd.read_csv('winequality.csv')

In [4]:
data.head()

Unnamed: 0,ID,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,W0001,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,2
1,W0002,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,,9.5,2
2,W0003,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,,10.1,2
3,W0004,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,2
4,W0005,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,2


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 13 columns):
ID                      4898 non-null object
fixed acidity           4898 non-null float64
volatile acidity        4898 non-null float64
citric acid             4165 non-null float64
residual sugar          4898 non-null float64
chlorides               4898 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4054 non-null float64
sulphates               4175 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
dtypes: float64(11), int64(1), object(1)
memory usage: 497.5+ KB


In [6]:
data.loc[(data.quality == 2), 'quality'] = 0
data.fillna(data.mean(), inplace=True)

Unnamed: 0,ID,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,W0001,7.0,0.270,0.360000,20.70,0.045,45.0,170.0,1.00100,3.000000,0.450000,8.800000,0
1,W0002,6.3,0.300,0.340000,1.60,0.049,14.0,132.0,0.99400,3.300000,0.490158,9.500000,0
2,W0003,8.1,0.280,0.400000,6.90,0.050,30.0,97.0,0.99510,3.260000,0.490158,10.100000,0
3,W0004,7.2,0.230,0.320000,8.50,0.058,47.0,186.0,0.99560,3.190000,0.400000,9.900000,0
4,W0005,7.2,0.230,0.320000,8.50,0.058,47.0,186.0,0.99560,3.190000,0.400000,9.900000,0
5,W0006,8.1,0.280,0.400000,6.90,0.050,30.0,97.0,0.99510,3.260000,0.440000,10.100000,0
6,W0007,6.2,0.320,0.334031,7.00,0.045,30.0,136.0,0.99490,3.188762,0.470000,9.600000,0
7,W0008,7.0,0.270,0.360000,20.70,0.045,45.0,170.0,1.00100,3.000000,0.450000,8.800000,0
8,W0009,6.3,0.300,0.340000,1.60,0.049,14.0,132.0,0.99400,3.300000,0.490158,9.500000,0
9,W0010,8.1,0.220,0.430000,1.50,0.044,28.0,129.0,0.99380,3.220000,0.450000,11.000000,0


In [7]:
X = data.drop(['ID', 'quality'], axis=1)
y = data.quality

In [8]:
y.value_counts()

0    3258
1    1640
Name: quality, dtype: int64

In [9]:
nb = BernoulliNB()

In [10]:
nb.fit(X, y)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [11]:
from sklearn.metrics import accuracy_score
accuracy_score(y, nb.predict(X))

0.66619028174765216

In [12]:
logReg = LogisticRegression()
accuracy_score(y, logReg.fit(X, y).predict(X))

0.75091874234381384

Here we see that logistic regression has a slightly better accuracy.

In [13]:
gnb = GaussianNB()

In [14]:
gnb.fit(X,y)

GaussianNB(priors=None)

In [15]:
from sklearn.metrics import accuracy_score
accuracy_score(y, gnb.predict(X))

0.70641077991016743

So the score of Gaussian Naive Bayes is better than Bernoulli but it is still not as good as Logistic regression. So we see that Gaussian Naive Bayes is better suited for this problem.

**Exercise**:

Q1. Apply your learnings on [Loan Prediction practice problem](https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/)

### Limitations of the Naive Bayes algorithm

* If categorical variable has a category (in test dataset), which was not observed in training dataset, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.

* Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.


That's all for today!
----------------
-------------------------------
<img src="AV_Datafest_logo.png" style="width: 200px;height: 200px"/>
[www.analyticsvidhya.com](www.analyticsvidhya.com)

DATAFEST 2017