<img style="width:450px;" src="https://durhamcollege.ca/wp-content/uploads/ai-hub-header.jpg" alt="DC Logo"/>

# LESSON 10a - Gaussian Naive Bayes & Pre-Processing

## <span style="color: green">OVERVIEW</span>

- ref.: http://dataaspirant.com/2017/02/20/gaussian-naive-bayes-classifier-implementation-python/
- data: https://archive.ics.uci.edu/ml/datasets/Adult

<b>We have provided the data for you so there is no need to download it!</b>

<hr />

>**Section 1:** <a href="#Conditional-Probability-Examples">Conditional Probability Examples</a>

>**Section 2:** <a href="#Gaussian-Naive-Bayes">Gaussian Naive Bayes</a>

>**Section 3:** <a href="#Census-Income-Dataset">Census Income Dataset</a>

>**Section 4:** <a href="#Data-Dictionary">Data Dictionary</a>

>**Section 5:** <a href="#Import-Required-Libraries">Import Required Libraries</a>

>**Section 6:** <a href="#Import-Data">Import Data</a>

>**Section 7:** <a href="#Handling-Missing-Data">Handling Missing Data</a>

<hr />

In this post, we are going to implement the Naive Bayes classifier in Python using the machine learning library scikit-learn. After which, we are going to use the trained Naive Bayes (supervised classification), model to predict the Income bracket of people based on our classifier.

Let's quickly go over the basics of Naive Bayes:

Bayes’ theorem is based on <b>conditional probability</b>. The conditional probability helps us calculating the probability that something will happen, given that something else has already happened. Not getting let’s understand with few examples.

<hr />

### <span style="color:#27ae60">Conditional Probability Examples</span>

Below are the few examples helps to clearly understand the definition of conditional probability.

- Purchasing mac book when you already purchased the iPhone.
- Having a refreshing drink when you are in the movie theater.
- Buying peanuts when you brought a chilled soft drink.



<img style="width:350px;" src="https://i2.wp.com/dataaspirant.com/wp-content/uploads/2017/02/Conditional_probability.jpg?w=500" alt="Conditional Probability" />

Using the Bayes theorem the naive Bayes classifier assumes all the features are independent to each other. Even if the features depend on each other or upon the existence of the other features.

The naive Bayes classifier considers all of these properties to independently contribute to the probability that a user will buy a MacBook.

To learn the key concepts related to Naive Bayes. You can read our article on Introduction to Naive Bayes. This will help you understand the core concepts related to Naive Bayes.

In the introduction to Naive Bayes post

(https://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/)

Three popular Naive Bayes algorithms are covered:

- Gaussian Naive Bayes,
- Multinomial Naive Bayes.
- Bernoulli Naive Bayes.

We are going to implement **Gaussian Naive Bayes** on a “Census Income” dataset.

### <span style="color:#27ae60">Gaussian Naive Bayes</span>
A Gaussian Naive Bayes algorithm is a special type of NB algorithm. It’s specifically used when the features have continuous values. It’s also assumed that all the features are following a gaussian distribution i.e, normal distribution.

### <span style="color:#27ae60">Census Income Dataset</span>
The Census Income dataset is used to predict whether the income of a person is either: 
><p>&gt;\$50K/yr (greater than \$50K/yr)</p>
><p>or</p>
><p>&lt;=\$50K/yr (less than or equal to \$50K/yr)</p>

<i>The data was collected by Barry Becker from 1994 Census dataset.</i>

This dataset was contributed to a UCI repository, and It’s openly available at the link provided at the top of this lesson. The dataset consists of 15 columns of a mix of discrete as well as continuous data.

### <span style="color:#27ae60">Data Dictionary</span>
#### <span style="color: red">Please ensure that you understand the data fully before continuing the lesson.</span>

<table style="height: 296px; width: 602px; border-color: #000000; background-color: #ffffff;" border="1 px">
<tbody>
<tr style="height: 24px;">
<td style="width: 37px; height: 24px; text-align: center;"></td>
<td style="width: 146px; height: 24px; text-align: center;"><strong>Variable Name</strong></td>
<td style="width: 425px; height: 24px; text-align: center;"><strong>Variable Range</strong></td>
</tr>
<tr style="height: 29px;">
<td style="width: 37px; height: 29px; text-align: center;">1.</td>
<td style="width: 146px; height: 29px; text-align: center;">age</td>
<td style="width: 425px; height: 29px;">&nbsp;[17 – 90]</td>
</tr>
<tr style="height: 87px;">
<td style="width: 37px; height: 87px; text-align: center;">2.</td>
<td style="width: 146px; height: 87px; text-align: center;">workclass</td>
<td style="width: 425px; height: 87px;">[‘State-gov’, ‘Self-emp-not-inc’, ‘Private’, ‘Federal-gov’, ‘Local-gov’, ‘Self-emp-inc’, ‘Without-pay’, ‘Never-worked’]</td>
</tr>
<tr style="height: 30px;">
<td style="width: 37px; height: 30px; text-align: center;">3.</td>
<td style="width: 146px; height: 30px; text-align: center;">fnlwgt</td>
<td style="width: 425px; height: 30px;">[77516- 257302]</td>
</tr>
<tr style="height: 12px;">
<td style="width: 37px; height: 12px; text-align: center;">4.</td>
<td style="width: 146px; height: 12px; text-align: center;">education</td>
<td style="width: 425px; height: 12px;">
<p class="">[‘Bachelors’, ‘HS-grad’, ’11th’, ‘Masters’, ‘9th’, ‘Some-college’, ‘Assoc-acdm’, ‘Assoc-voc’, ‘7th-8th’, ‘Doctorate’, ‘Prof-school’, ‘5th-6th’, ’10th’, ‘1st-4th’, ‘Preschool’, ’12th’]</p>
</td>
</tr>
<tr style="height: 31px;">
<td style="width: 37px; height: 31px; text-align: center;">5.</td>
<td style="width: 146px; height: 31px; text-align: center;">education_num</td>
<td style="width: 425px; height: 31px;">[1 – 16]</td>
</tr>
<tr style="height: 24px;">
<td style="width: 37px; height: 24px; text-align: center;">6.</td>
<td style="width: 146px; height: 24px; text-align: center;">marital_status</td>
<td style="width: 425px; height: 24px;">
<p class="">[‘Never-married’, ‘Married-civ-spouse’, ‘Divorced’, ‘Married-spouse-absent’, ‘Separated’, ‘Married-AF-spouse’, ‘Widowed’]</p>
</td>
</tr>
<tr style="height: 24px;">
<td style="width: 37px; height: 24px; text-align: center;">7.</td>
<td style="width: 146px; height: 24px; text-align: center;">occupation</td>
<td style="width: 425px; height: 24px;">
<p class="">[‘Adm-clerical’, ‘Exec-managerial’, ‘Handlers-cleaners’, ‘Prof-specialty’, ‘Other-service’, ‘Sales’, ‘Craft-repair’, ‘Transport-moving’, ‘Farming-fishing’, ‘Machine-op-inspct’, ‘Tech-support’, ‘Protective-serv’, ‘Armed-Forces’, ‘Priv-house-serv’]</p>
</td>
</tr>
<tr style="height: 24px;">
<td style="width: 37px; height: 24px; text-align: center;">8.</td>
<td style="width: 146px; height: 24px; text-align: center;">relationship</td>
<td style="width: 425px; height: 24px;">
<p class="">[‘Not-in-family’, ‘Husband’, ‘Wife’, ‘Own-child’, ‘Unmarried’, ‘Other-relative’]</p>
</td>
</tr>
<tr style="height: 24px;">
<td style="width: 37px; height: 24px; text-align: center;">9.</td>
<td style="width: 146px; height: 24px; text-align: center;">race</td>
<td style="width: 425px; height: 24px;">
<p class="">[‘White’, ‘Black’, ‘Asian-Pac-Islander’, ‘Amer-Indian-Eskimo’, ‘Other’]</p>
</td>
</tr>
<tr style="height: 24px;">
<td style="width: 37px; height: 24px; text-align: center;">10.</td>
<td style="width: 146px; height: 24px; text-align: center;">sex</td>
<td style="width: 425px; height: 24px;">
<p class="">[‘Male’, ‘Female’]</p>
</td>
</tr>
<tr style="height: 29px;">
<td style="width: 37px; height: 29px; text-align: center;">11.</td>
<td style="width: 146px; height: 29px; text-align: center;">capital_gain</td>
<td style="width: 425px; height: 29px;">[0 – 99999]</td>
</tr>
<tr style="height: 28px;">
<td style="width: 37px; height: 28px; text-align: center;">12.</td>
<td style="width: 146px; height: 28px; text-align: center;">capital_loss</td>
<td style="width: 425px; height: 28px;">[0 –&nbsp;4356]</td>
</tr>
<tr style="height: 28px;">
<td style="width: 37px; height: 28px; text-align: center;">13.</td>
<td style="width: 146px; height: 28px; text-align: center;">hours_per_week</td>
<td style="width: 425px; height: 28px;">[1 – 99]</td>
</tr>
<tr style="height: 24px;">
<td style="width: 37px; height: 24px; text-align: center;">14.</td>
<td style="width: 146px; height: 24px; text-align: center;">native_country</td>
<td style="width: 425px; height: 24px;">
<p class="">[‘United-States’, ‘Cuba’, ‘Jamaica’, ‘India’, ‘Mexico’, ‘South’, ‘Puerto-Rico’, ‘Honduras’, ‘England’, ‘Canada’, ‘Germany’, ‘Iran’, ‘Philippines’, ‘Italy’, ‘Poland’, ‘Columbia’, ‘Cambodia’, ‘Thailand’, ‘Ecuador’, ‘Laos’, ‘Taiwan’, ‘Haiti’, ‘Portugal’, ‘Dominican-Republic’, ‘El-Salvador’, ‘France’, ‘Guatemala’, ‘China’, ‘Japan’, ‘Yugoslavia’, ‘Peru’, ‘Outlying-US(Guam-USVI-etc)’, ‘Scotland’, ‘Trinadad&amp;Tobago’, ‘Greece’, ‘Nicaragua’, ‘Vietnam’, ‘Hong’, ‘Ireland’, ‘Hungary’, ‘Holand-Netherlands’]</p>
</td>
</tr>
<tr style="height: 13.5938px;">
<td style="width: 37px; height: 13.5938px; text-align: center;">15.</td>
<td style="width: 146px; height: 13.5938px; text-align: center;">income</td>
<td style="width: 425px; height: 13.5938px;">
<p class="">&nbsp;[‘&lt;=50K’, ‘&gt;50K’]</p>
</td>
</tr>
</tbody>
</table>

><b>The final target variable consists of two values: ‘&gt;=50K' &amp; ‘>50K’.</b>

>*This is the variable we are trying to predict based on the others.*


<hr />

### <span style="color:#27ae60">Import Required Libraries</span>

We need to import pandas, numpy and sklearn libraries.

The <b>train_test_split</b> module is for splitting the dataset into training and testing set.

The <b>accuracy_score</b> module will be used for calculating the accuracy of our Gaussian Naive Bayes algorithm.

From sklearn, we need to import preprocessing modules like **Imputer**. The Imputer package helps to impute the missing values that cannot be processed.

In [44]:
# Required Python Machine learning Packages
import pandas as pd
import numpy as np

# For preprocessing the data
from sklearn.preprocessing import Imputer
from sklearn import preprocessing

# To split the dataset into train and test datasets
from sklearn.model_selection import train_test_split

# To model the Gaussian Navie Bayes classifier
from sklearn.naive_bayes import GaussianNB

# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score


<hr />

<img style="width:375px;" src="http://s2.quickmeme.com/img/a7/a78187c76f9d27752fe5494e00c269095000e643328edfe59dfba42047276d7b.jpg" />

### <span style="color:#27ae60">Import Data</span>
For importing the census data, we are using pandas <b>read_csv()</b> method. This method is a very simple and fast method for importing data.

We are passing four parameters. 

The **filename parameter** value is clearly *'adult.data'*. 

The **header parameter** is for giving details to pandas regarding whether the first row of data consists of headers or not. In our dataset, there is no header. So, we are passing *'None'*.

The **delimiter parameter** is for giving the information the delimiter that is separating the data. Here, we are using the *Comma delimiter ( , )*. This delimiter is to show delete the spaces before and after the data values. 

The **engine parameter** is for ensuring the pandas read_csv method knows how to read the language used; Which is why we entered *'python'*, as it is the language we are working in.

This is very helpful when there is inconsistency in spaces used with data values.

In [45]:
# Load the Dataset into a Dataframe
adult_df = pd.read_csv('adult.data.txt', header=None, delimiter=',', engine='python')

# View the shape of the dataframe 
# to understand and verify the size of the dataset
adult_df.shape

(32561, 15)

<hr />

Let’s add headers to our dataframe. The below code snippet can be used to perform this task.

In [46]:
# Adding Headers
adult_df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
                    'marital_status', 'occupation', 'relationship',
                    'race', 'sex', 'capital_gain', 'capital_loss',
                    'hours_per_week', 'native_country', 'income']

<hr />

### <span style="color:#27ae60">Handling Missing Data</span>
Let’s try to test whether there is any null values in our dataset or not. 
>We can do this using the isnull() method

>In combination with the sum() method

In [47]:
# Find the Null values in the dataframe and sum them
adult_df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

*The above output should show that there is no “null” value in our dataset.*

<hr />

**Let’s try to test whether any categorical attribute contains a “?” in it or not. At times there exists “?” or ” ” in place of missing values.**

Using the starting code snippet below we are going to test whether our adult_df dataframe consists of categorical variables with values equal to “?”.

In [48]:
#for value in ['workclass', 'education', 'marital_status', 'occupation', 
#              'relationship','race', 'sex', 'native_country', 'income']:
#    print(#?)


<hr />

The output of the above code - once completed - should show that there are 1836 missing values in the 'workclass' attribute. 1843 missing values in the 'occupation' attribute and 583 missing values in 'native_country' attribute.

<hr />

### <span style="color:ForestGreen;padding-left:50px;">Continue Pre-Processing your Data in Lesson 10b</span>