# Naive Bayes Classifier By Hand

We're going to create our own Naive Bayes classifier to classify if the given information about a person's height, weight, and age indicate that this person is male or female.

For this, we'll use a custom dataset called `people.csv`, which you'll find in the `datasets/` folder.

In [1]:
# Standard scientific Python imports
import matplotlib.pyplot as plt

# This gives us dataframes which will allow us to build our custom
# Naive Bayes Classifier
import pandas as pd

# Standard numeric library that gives us optimized arrays and vectors
import numpy as np

In [2]:
dataset = "datasets/people.csv"

# We can read in a properly formatted CSV using this helper function.
# This reads it in as a pandas dataframe object, which gives us a lot
# of the same functionality we get in Excel like filtering, sorting, etc
df = pd.read_csv(dataset)

# This head() function gives us the first 5 items, which Jupyter notebook
# formats nicely for us.
df.head()

Unnamed: 0,Height,Weight,Age,Gender
0,151.765,47.825606,63.0,male
1,139.7,36.485807,63.0,female
2,136.525,31.864838,65.0,female
3,156.845,53.041915,41.0,male
4,145.415,41.276872,51.0,female


####  Training Set and Test Set

Now, we're going to split our dataset up into test and training. We want 70% of our data to go for testing, 30% for testing.


In [3]:
# This is how many examples we have.
n_samples = df.shape[0]
print(n_samples)

# TODO: Compute train_size and test_size
train_size = int(0.7 * n_samples)
test_size = n_samples - train_size

print("Training set size: {}".format(train_size))
print("Test set size: {}".format(test_size))


543
Training set size: 380
Test set size: 163


In [4]:
# Create our test and train sets

# NOTE: This training set is a dataframe that includes both X and y.
# Our Naive Bayes classifier needs our training set to be like this, unlike scikit-learn's NB
df_train = df[:train_size]


# We create a temporary variable for our test set. This includes all features (Height, Weight, Age),
# as well as label (Gender). We want to split those out.
df_test_temp = df[train_size:]

# The test set's X just needs the following 3 columns
df_test_X = df_test_temp[["Height", "Weight", "Age"]]

# The test set's y just needs the Gender column
df_test_y = df_test_temp[["Gender"]]

print("Test X:")
print(df_test_X.head())

print("\nTest y:")
print(df_test_y.head())

# Now we can delete the temp dataframe we made earlier.
del df_test_temp

Test X:
      Height     Weight   Age
380   67.945   7.966209   1.0
381  135.890  27.215520  15.0
382  158.115  47.485413  45.0
383   85.090  10.801160   3.0
384   93.345  14.004653   3.0

Test y:
     Gender
380  female
381  female
382    male
383    male
384  female



## Recap: Bayes Theorem

Bayes theorem is a famous equation that allows us to make predictions based on data:

$$
P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}
$$

Specifically, we're trying to figure out the class (i.e. "male" or "female") of an observation _given_ the data

$$
p(class \mid \mathbf{data}) = \frac{p(\mathbf{data} \mid class) * p(class)}{p(\mathbf{data})}
$$


where:
- class is a particular class (i.e. "male" or "female")
- $\mathbf{data}$ is an observation's data (the features)
- $p(class \mid \mathbf{data})$ is called the posterior
- $p(\mathbf{data} \mid class)$ is called the likelihood
- $p(class)$ is called the prior
- $p(\mathbf{data})$ is called the marginal probability

### Bayes Theorem Applied to Predicting "Male" or "Female"

$$
p(person\:is\:male \mid \mathbf{person's\:data}) = \frac{
p(\mathbf{person's\:data}\mid person\:is\:male) * p(person\:is\:male)
}{
p(\mathbf{person's\:data})
}
$$

#### More Specifically:
Let's factor in height, weight, age
$$
posterior(male) = \frac{
p(height \mid male)\,p(weight|male)\,p(age \mid male)\,p(male)
}{
\mathit{marginal\;probability}
}
$$


__Two things to note:__

1. We assume each feature is uncorrelated from each other. This independence assumption of Naive Bayes is what makes it "Naive". This assumption may not be true in the real world but let's stick with it and see what happens.

2. We assume that the value of the features (height of the women, weight of the women) are normally (gaussian) distributed. This means that $p(height \mid female)$ is calculated by inputing the required parameters into the probability density function of the normal distribution:

__WARNING__: Very mathy, but we'll just have one helper function do this for us.

$$
p(height \mid female) = \frac{1}{\sqrt{2\pi(\text{variance of female height in data})}}
- e^{-\frac{
 (\text{observation's height} - \text{average height of females in the data})^2
}{
2*(\text{variance of female height in data})
}
}
$$




In [5]:
def p_x_given_y(y, mean_y, variance_y):
    """This function calculates p(x | y)"""

    # Input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(y-mean_y)**2)/(2*variance_y))

    # return p
    return p

In [6]:
# Number of males
n_male = df_train["Gender"][df_train["Gender"] == "male"].count()
print("Number of males: {}".format(n_male))

# TODO: Get the number of females
n_female = df_train["Gender"][df_train["Gender"] == "female"].count()
print("Number of females: {}".format(n_female))

total_ppl = df_train["Gender"].count()
print("Total Population: {}".format(total_ppl))

Number of males: 173
Number of females: 207
Total Population: 380


In [7]:
# TODO: Calculate the priors p(male) and p(female) using the values above.
# This is the ratio of males to everyone, and females to everyone respectively.

p_male = n_male / total_ppl
print("p(male) = {}".format(p_male))

p_female = n_female / total_ppl
print("p(female) = {}".format(p_female))

p(male) = 0.45526315789473687
p(female) = 0.5447368421052632


In [8]:
# To use the function p_x_given_y(), we must compute the means and
# variances for each attribute for each class.

# Group the data by gender and calculate the mean of each feature
# by gender.
df_means = df_train.groupby("Gender").mean()

# View the values
df_means.head()

Unnamed: 0_level_0,Height,Weight,Age
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,137.104113,34.44033,31.452174
male,145.095276,39.714644,32.237861


## Pandas Documentation
Refer to [pandas docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.var.html) for this next part.

In [9]:
# TODO: Group the data by gender and calculate the variance of each
# feature by gender.
df_vars = df_train.groupby("Gender").var()

# TODO: View the values
df_vars.head()

Unnamed: 0_level_0,Height,Weight,Age
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,501.967466,156.951771,450.678527
male,729.129243,237.809556,452.875317


In [10]:
# Now we'll extract the mean and variance of each attribute from the
# above two tables. It looks like a ton of code but it's really just
# pulling from those tables.

# Means for male
male_height_mean = df_means['Height'][df_means.index == 'male'].values[0]
male_weight_mean = df_means['Weight'][df_means.index == 'male'].values[0]
male_age_mean = df_means['Age'][df_means.index == 'male'].values[0]

# Means for female
female_height_mean = df_means['Height'][df_means.index == 'female'].values[0]
female_weight_mean = df_means['Weight'][df_means.index == 'female'].values[0]
female_age_mean = df_means['Age'][df_means.index == 'female'].values[0]

In [11]:
# TODO: Compute the variances for males and females, respectively, for
# each attribute.

# Variances for male
male_height_var = df_vars['Height'][df_vars.index == 'male'].values[0]
male_weight_var = df_vars['Weight'][df_vars.index == 'male'].values[0]
male_age_var = df_vars['Age'][df_vars.index == 'male'].values[0]

# Variances for female
female_height_var = df_vars['Height'][df_vars.index == 'female'].values[0]
female_weight_var = df_vars['Weight'][df_vars.index == 'female'].values[0]
female_age_var = df_vars['Age'][df_vars.index == 'female'].values[0]

## Apply Bayes Classifier to New Data Point

Now, all we have to do when we get a new datapoint is extract the features out and compare which label has a higher
probability.

In [12]:
first_X_df = df_test_X.iloc[0]
print(first_X_df)
print("\n=======================\n")

# Get the values out of the dataframe as a list
first_X = first_X_df.values
print("\nHeight\t\tWeight\tAge\n")
print(first_X)

# Unpack out the 3 fields from that list.
height, weight, age = first_X

Height    67.945000
Weight     7.966209
Age        1.000000
Name: 380, dtype: float64



Height		Weight	Age

[67.945      7.9662095  1.       ]


In [13]:
first_y_df = df_test_y.iloc[0]
print(first_y_df)
print("\n=======================\n")


# Get the values out of the dataframe as a list
first_y = first_y_df.values
print("Gender")

actual_y = first_y[0]
print(actual_y)

Gender    female
Name: 380, dtype: object


Gender
female


In [14]:
# Compute the conditional probabilities
p_male_given_height = p_x_given_y(height, male_height_mean, male_height_var)
p_male_given_weight = p_x_given_y(weight, male_weight_mean, male_weight_var)
p_male_given_age = p_x_given_y(age, male_age_mean, male_age_var)


# TODO: Compute p_female_given_height, p_female_given_weight, p_female_given_age
p_female_given_height = p_x_given_y(height, female_height_mean, female_height_var)
p_female_given_weight = p_x_given_y(weight, female_weight_mean, female_weight_var)
p_female_given_age = p_x_given_y(age, female_age_mean, female_age_var)

In [15]:
# Now, we just need to compare p(male) to p(female)

# TODO: Multiply out the 4 parts of the numerator here, separately for male and female
p_male_given_data =  p_male_given_height * p_male_given_weight * p_male_given_age * p_male
p_female_given_data = p_female_given_height * p_female_given_weight * p_female_given_age * p_female

print("p_male_given_data: {}".format(p_male_given_data))
print("p_female_given_data: {}".format(p_female_given_data))

p_male_given_data: 2.2519559927205e-09
p_female_given_data: 1.8975093485306007e-09


In [16]:
prediction = ""
if p_male_given_data > p_female_given_data:
    prediction = "male"
else:
    prediction = "female"
    
print("Prediction: {}".format(prediction))
print("Actual: {}".format(actual_y))
print(prediction == actual_y)

Prediction: male
Actual: female
False


In [17]:
# TODO: Now, write a loop that loops over all rows in df_test_X and df_test_y and
# creates predictions like we did above, and track how many times we predict correctly.

# HINT: Store "True" or "False" for whether your prediction was right in a list, then count
# the "True"s and print out what percentage were correct.


matches = []


for idx in range(df_test_X.shape[0]):
    feats = df_test_X.iloc[idx].values
    height, weight, age = feats
    actual_y = df_test_y.iloc[idx].values[0]
    
    # Compute the conditional probabilities for males and females
    p_male_given_height = p_x_given_y(height, male_height_mean, male_height_var)
    p_male_given_weight = p_x_given_y(weight, male_weight_mean, male_weight_var)
    p_male_given_age = p_x_given_y(age, male_age_mean, male_age_var)

    p_female_given_height = p_x_given_y(height, female_height_mean, female_height_var)
    p_female_given_weight = p_x_given_y(weight, female_weight_mean, female_weight_var)
    p_female_given_age = p_x_given_y(age, female_age_mean, female_age_var)
    
    p_male_given_data =  p_male_given_height * p_male_given_weight * p_male_given_age * p_male
    p_female_given_data = p_female_given_height * p_female_given_weight * p_female_given_age * p_female

    prediction = ""
    if p_male_given_data > p_female_given_data:
        prediction = "male"
    else:
        prediction = "female"

    match = prediction == actual_y
    # We can turn a boolean into an integer 1 if True or 0 if False using the 
    # int() function
    match_int = int(match)
    matches.append(match_int)
    
    
print(sum(matches) / len(matches))
    

0.6196319018404908
