## Naive Bayes example (binary classification)

In this example, we are going to use a dummy dataset with three columns:
> `weather, temperature,  play`

The first two are features(`weather, temperature`) and the other is the label.

In [6]:
# Assigning features and label variables
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Rainy']

temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Hot','Mild']

play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','No']

In [7]:
import pandas as pd
df=pd.DataFrame({'weather':weather, 'temp':temp, 'play':play})
df


Unnamed: 0,weather,temp,play
0,Sunny,Hot,No
1,Sunny,Hot,No
2,Overcast,Hot,Yes
3,Rainy,Mild,Yes
4,Rainy,Cool,Yes
5,Rainy,Cool,No
6,Overcast,Cool,Yes
7,Sunny,Mild,No
8,Sunny,Cool,Yes
9,Rainy,Mild,Yes


**Encoding Features**: 
First, we need to convert these string labels into numbers (es., label encoding).

In [8]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
weather_encoded=le.fit_transform(weather)

print(weather_encoded)

[2 2 0 1 1 1 0 2 2 1 2 0 1]


We ecode also the features

In [9]:
# Converting string labels into numbers
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)

print("Temp:{}".format(temp_encoded))
print("Play:{}".format(label))

Temp:[1 1 1 2 0 0 0 2 0 2 2 1 2]
Play:[0 0 1 1 1 0 1 0 1 1 1 1 0]


Now we combine both the features (`weather, temp`) in a single variable (list of tuples).

In [10]:
import pandas as pd 

features=pd.DataFrame({'weather':weather_encoded, 'temp':temp_encoded})
print(features)

    weather  temp
0         2     1
1         2     1
2         0     1
3         1     2
4         1     0
5         1     0
6         0     0
7         2     2
8         2     0
9         1     2
10        2     2
11        0     1
12        1     2


**Generating Model**: 
Generate a model using naive bayes classifier.

In [11]:
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
model = GaussianNB()

# Train the model using the training sets
model.fit(features.values,label)

#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print("Predicted Value:{}".format(predicted))

Predicted Value:[1]


Prediction: players can Play

## Multinomial Naive Bayes classification (Multiple Labels)
For this example we are going to use the wine dataset (already seen in a previous practical exercise)

In [None]:
from sklearn import datasets
wine = datasets.load_wine()

In [None]:
print("Features: {}".format(wine.feature_names))
print("Labels: {}".format(wine.target_names))

Features: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Labels: ['class_0' 'class_1' 'class_2']


In [None]:
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.5,random_state=109)

In [None]:
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gnb.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=wine.target_names))

              precision    recall  f1-score   support

     class_0       0.97      0.97      0.97        31
     class_1       0.97      0.91      0.94        32
     class_2       0.93      1.00      0.96        26

    accuracy                           0.96        89
   macro avg       0.95      0.96      0.96        89
weighted avg       0.96      0.96      0.95        89



### Zero probability problem

Suppose there is no tuple for a `class_2` in the training set, in this scenario, the posterior probability (`P(c|x)`) will be zero, and the model is unable to make a prediction. This problem is known as **Zero Probability** because the occurrence of the particular class is zero.

The solution for such an issue is the **Laplacian correction** or **Laplace Transformation**. Laplacian correction is one of the *smoothing* techniques. Here, you can assume that the dataset is large enough that adding one row of each class will not make a difference in the estimated probability. This will overcome the issue of probability values to zero.

*For Example*: Suppose that there are 1000 training tuples in the training database. In this database, the label column has 
- 0 tuples for `class_2`, 
- 990 tuples for `class_1`, 
- 10 tuples for `class_0`. 

The probabilities of these events, without the Laplacian correction, are 
- $\frac{0}{1000}=0$ for `class_2`,  
- $\frac{990}{1000}=0.990$ for `class_1`, 
- $\frac{11}{1000}=0.010$ for `class_0`

Applying a Laplacian correction on the given dataset, we add 1 more tuple for each income-value pair. The probabilities of these events will be:
- $\frac{1}{1003}=0.001$ for `class_2`,  
- $\frac{991}{1003}=0.988$ for `class_1`, 
- $\frac{12}{1003}=0.011$ for `class_0`

