#### Gaussian Naive Bayes
This kernel shows an implementation of Gaussian Naive Bayes classifier in python. It is supposed to be educational and represent principles of Naive Bayes classifier in a real-life example.

#### Dataset introduction
Predict whether income exceeds \$50K/yr based on census data. Also known as "Census Income" dataset.

[Census Income dataset page at UCLI](https://archive.ics.uci.edu/ml/datasets/Adult)


In [7]:
#############################
# Gaussian Naive Bayes
#############################
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_color_codes()


# Dataset import
import requests
from tqdm import tqdm

r = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", stream=True)
total_size = int(r.headers["Content-Length"])
chunk_size = 1024*1024

with open("data", "wb") as handle:
    for data in tqdm(r.iter_content(chunk_size),
                     total=np.ceil(total_size//chunk_size),
                     unit="Mb",
                     unit_scale=True):
        handle.write(data)

4.00Mb [00:03, 1.06s/Mb]                                                                                               


In [22]:
data = pd.read_csv("data", header=None)

print("Top of the file")
print(data.head())

Top of the file
   0                  1       2           3   4                    5   \
0  39          State-gov   77516   Bachelors  13        Never-married   
1  50   Self-emp-not-inc   83311   Bachelors  13   Married-civ-spouse   
2  38            Private  215646     HS-grad   9             Divorced   
3  53            Private  234721        11th   7   Married-civ-spouse   
4  28            Private  338409   Bachelors  13   Married-civ-spouse   

                   6               7       8        9     10  11  12  \
0        Adm-clerical   Not-in-family   White     Male  2174   0  40   
1     Exec-managerial         Husband   White     Male     0   0  13   
2   Handlers-cleaners   Not-in-family   White     Male     0   0  40   
3   Handlers-cleaners         Husband   Black     Male     0   0  40   
4      Prof-specialty            Wife   Black   Female     0   0  40   

               13      14  
0   United-States   <=50K  
1   United-States   <=50K  
2   United-States   <=50K  


For the sake of simplicity I will restrict our data to only three columns - one continuous variable, one nominal and the label. I choose Age and Sex (columns "0" and "9").

In [23]:
data = data.iloc[:, [0, 9, 14]]

data = data.rename(columns={0: "Age", 9: "Gender", 14: "Income"})

print("The columns are now: {}".format(data.columns))

The columns are now: Index(['Age', 'Gender', 'Income'], dtype='object')


In [30]:
print(data.describe())
print(data["Gender"].value_counts())
print(data["Income"].value_counts())

                Age
count  32561.000000
mean      38.581647
std       13.640433
min       17.000000
25%       28.000000
50%       37.000000
75%       48.000000
max       90.000000
 Male      21790
 Female    10771
Name: Gender, dtype: int64
 <=50K    24720
 >50K      7841
Name: Income, dtype: int64


It looks like our dataset is not really a balanced one, since the number of people whose income is lower than 50K is around 3.5 times higher than people who earn at least 50K.
Moreover there is twice the number of males in the dataset. Age is around 40 years old with the min and max in the expected margins.

There is no missing values in the dataset.

#### Splitting the dataset
For the purpose of showing how the Bayes is working on this dataset, I will split it into training and testing parts. Trianing will be named data and testing test.

In [69]:
test = data.sample(frac=0.2, random_state=13)
data.drop(index=test.index, inplace=True)
print("Length of test {} and length of training dataset {}".format(len(test), len(data)))

Length of test 5210 and length of training dataset 20839


#### Bayes Theorem
This kernel follow [a great explanation of NB inner workings](https://shuzhanfan.github.io/2018/06/understanding-mathematics-behind-naive-bayes/).

Generally speaking my goal is to implement functions, which can calculate likelihoods and prior probabilities of a class and a predictor. That is the first step. The second step is to calculate probability of a case being in a class. Let us start off by showing off the equations we will be working with.

\begin{equation*}
P(C_k | X) = \frac {\prod{P(x_i | C_k}) * P(C_k)}{P(X)}, \text  {for k = 1, 2,..., K}
\end{equation*}

where C_k is a class and X is a vector of features (in our case the 2 variables we have chosen from the whole Census dataset). I will call $P(C_k | X)$ posterior probability, $\prod_i^n{P(x_i | C_k)}$ likelihood, $P(C_k)$ prior probability of a class and $P(X)$ prior probability of a predictor.

Let us start off with a simple function to calculate the prior class probability. Since it does not change in the process of learning, we will swiftly assign in to a variable.

In [36]:
def prior_class(class_vector):
    return dict(class_vector.value_counts()/len(data))
prior_probas = prior_class(data["Income"])
print(prior_probas)

{' <=50K': 0.7591904425539756, ' >50K': 0.2408095574460244}


The next function calculated the prior predictor probability given a vector of the features. I will assume the Gaussian distribution of age in this population, so I will use the following equation to calculate $P(age)$

\begin{equation*}
P_N(x | u, \sigma^2) = \frac {1}{\sqrt {2 \pi \sigma^2}}e^{-{\frac{(x-u)^2}{2 \sigma^2}}}
\end{equation*}

In [61]:
def normal_pdf(x, u, sigma_sq):
    return 1/(np.sqrt(2 * np.pi * sigma_sq)) * np.e ** ((-(x - u)**2)/2/sigma_sq)

def prior_predictor(feature_vector):
    age_mean = np.mean(data["Age"])
    age_var = np.var(data["Age"])
    age_prob = normal_pdf(feature_vector["Age"], age_mean, age_var)
    
    gender_probas = dict(data["Gender"].value_counts()/len(data))
    gender_prob = gender_probas[feature_vector["Gender"]]
    
    return(age_prob * gender_prob)

Following function calculates the likelihood.

In [62]:
def likelihood(class_, feature_vector):
    # Subsetting the data
    subset = data.loc[data["Income"] == class_, :]
    
    # Calculating the age prior
    age_mean = np.mean(subset["Age"])
    age_var = np.var(subset["Age"])
    age_prob = normal_pdf(feature_vector["Age"], age_mean, age_var)
    
    # Calculating the gender prior
    gender_probas = dict(subset["Gender"].value_counts()/len(subset))
    gender_prob = gender_probas[feature_vector["Gender"]]
    
    return age_prob * gender_prob

Mixing it altogether to create a function calculating the posterior probability.

In [65]:
def posterior(class_, feature_vector):
    return likelihood(class_, feature_vector) * prior_probas[class_] / prior_predictor(feature_vector)

Let us try out our function - calculate the posterior probability for the class of the first person in the dataset.

In [67]:
posterior(data.iloc[0, 2], data.iloc[0, :2])

0.6673929968532522

#### Bayes classifier
Thus, I have created all the tools needed for the Bayes classifier to work. Now what is left is to calculate the predicted class of the train and test dataset, compare the accuracies and maybe plot a ROC curve.