# Naive Bayes Classifier

This notebook contains a simple example task: given height, weight,
and foot size, classify a person being either male or female

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## First step: let's create some (arbitrary) data about men and women

In [2]:
# Create an empty dataframe
data = pd.DataFrame()

# Create our target variable
data['Gender'] = ['male','male','male','male','female','female','female','female']

# Create our feature variables
data['Height'] = [6,5.92,5.58,5.92,5,5.5,5.42,5.75]
data['Weight'] = [180,190,170,165,100,150,130,150]
data['Foot_Size'] = [12,11,12,10,6,8,7,9]

# View the data
data

   Gender  Height  Weight  Foot_Size
0    male    6.00     180         12
1    male    5.92     190         11
2    male    5.58     170         12
3    male    5.92     165         10
4  female    5.00     100          6
5  female    5.50     150          8
6  female    5.42     130          7
7  female    5.75     150          9

## Second step: the data about a person of unknown gender

In [3]:
# Create an empty dataframe
person = pd.DataFrame()

# Create some feature values for this single row
person['Height'] = [6]
person['Weight'] = [130]
person['Foot_Size'] = [8]

# View the data 
person

   Height  Weight  Foot_Size
0       6     130          8

### Gaussian Naive Bayes Classifier

#### Calculate prior probabilites

In [12]:
# Number of males
n_male = data['Gender'][data['Gender'] == 'male'].count()

# Number of males
n_female = data['Gender'][data['Gender'] == 'female'].count()

# Total rows
total_ppl = data['Gender'].count()

In [14]:
# Number of males divided by the total rows
P_male = n_male/total_ppl

# Number of females divided by the total rows
P_female = n_female/total_ppl
print("prior probability of male:", P_male)
print("prior probability of female:", P_female)

prior probability of male: 0.5
prior probability of female: 0.5


Indeed, in our training data 50% of people were female and 50% male

### Calculate likelihood

The next task is to calculate the probability to be either male or
female, given all individual characteristics.  For instance, what is
the probability to be female given your height is 5.5?

We do it by approximating the probability density with normal
distribution:
Pr(female|height = 5.5) = normalPDF(5.5, mean=female mean height,
                                    variance = female height variance)

In order to be able to do this, we have to compute all group-specific
means and variances.

In [6]:
# Group the data by gender and calculate the means of each feature
data_means = data.groupby('Gender').mean()

# View the values
data_means

        Height  Weight  Foot_Size
Gender                           
female  5.4175  132.50       7.50
male    5.8550  176.25      11.25

In [7]:
# Group the data by gender and calculate the variance of each feature
data_variance = data.groupby('Gender').var()

# View the values
data_variance

          Height      Weight  Foot_Size
Gender                                 
female  0.097225  558.333333   1.666667
male    0.035033  122.916667   0.916667

#### Create all the neccesary variable using tables above

In [8]:
# Means for male
male_height_mean = data_means['Height'][data_variance.index == 'male'].values[0]
male_weight_mean = data_means['Weight'][data_variance.index == 'male'].values[0]
male_footsize_mean = data_means['Foot_Size'][data_variance.index == 'male'].values[0]

# Variance for male
male_height_variance = data_variance['Height'][data_variance.index == 'male'].values[0]
male_weight_variance = data_variance['Weight'][data_variance.index == 'male'].values[0]
male_footsize_variance = data_variance['Foot_Size'][data_variance.index == 'male'].values[0]

# Means for female
female_height_mean = data_means['Height'][data_variance.index == 'female'].values[0]
female_weight_mean = data_means['Weight'][data_variance.index == 'female'].values[0]
female_footsize_mean = data_means['Foot_Size'][data_variance.index == 'female'].values[0]

# Variance for female
female_height_variance = data_variance['Height'][data_variance.index == 'female'].values[0]
female_weight_variance = data_variance['Weight'][data_variance.index == 'female'].values[0]
female_footsize_variance = data_variance['Foot_Size'][data_variance.index == 'female'].values[0]

#### Calculate probability density

Now we are done with means and variances.  Now we can insert these
into the corresponding normal density function.

In [9]:
# Create a function that calculates p(x | y):
# note: you may also use scipy.stats.norm.pdf() instead of this function
def p_x_given_y(x, mean_y, variance_y):

    # Input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
    
    # return p
    return p

#### Apply classifer to the new data point

Note: in practice, instead of multiplying the probabilities, it is usually better to add log-probabilities. 

In [15]:
# Numerator of the posterior if the unclassified observation is a male
PM = P_male * \
p_x_given_y(person['Height'][0], male_height_mean, male_height_variance) * \
p_x_given_y(person['Weight'][0], male_weight_mean, male_weight_variance) * \
p_x_given_y(person['Foot_Size'][0], male_footsize_mean,
            male_footsize_variance)
PM

6.197071843878078e-09

In [16]:
# Numerator of the posterior if the unclassified observation is a female
PF = P_female * \
p_x_given_y(person['Height'][0], female_height_mean, female_height_variance) * \
p_x_given_y(person['Weight'][0], female_weight_mean, female_weight_variance) * \
p_x_given_y(person['Foot_Size'][0], female_footsize_mean,
            female_footsize_variance)
PF

0.0005377909183630018

Because the numerator of the posterior for female is greater than male, then we predict that the person is female.

In [17]:
## We can also calculate the likelihood the person is female
## (not just the class):
                                                              
PF/(PF + PM)

0.9999884769336502

## Your turn:

Now assume you get a bit more data: in addition to the training set
above, you also learn that

male      5.92     202          9
female    4.44     120          9


Use the amended trainig set to classify the following persons:

height weight foot
4.8     140    8
5.1     150    8
6.1     200    12
