## C S 329E HW 7

# Naive Bayes 

## Jeremy Ulfohn, Pair 32

For this week's homework we are going explore one new classification technique:

  - Naive Bayes

We are reusing the version of the Melbourne housing data set from HW6, to predict the housing type as one of three possible categories:

  - 'h' house
  - 'u' duplex
  - 't' townhouse

In addition to building our own Naive Bayes classifier, we are going to compare the performace of our classifier to the [Gaussian Naive Bayes Classifier](https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes) available in the scikit-learn library. 

In [1]:
# These are the libraries you will use for this assignment
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import calendar
from sklearn.naive_bayes import GaussianNB # The only thing in scikit-learn you can use this assignment

# Starting off loading a training set and setting a variable for the target column, "Type"
df_melb = pd.read_csv('https://gist.githubusercontent.com/yanyanzheng96/81b236aecee57f6cf65e60afd865d2bb/raw/56ddb53aa90c26ab1bdbfd0b8d8229c8d08ce45a/melb_data_train.csv')
target_col = 'Type'


# function for Q1
# INPUT: year, month = y, m
# OUTPUT: days in that month
def getDays(y, m):
    month_arr = calendar.monthcalendar(y, m)
    month = []
    for week in month_arr:
        new = list(filter(lambda a: a != 0, week))
        month.extend(new)
    return len(month)

## Q1 - Fix a column of data to be numeric
If we inspect our dataframe, `df_melb` using the `dtypes` method, we see that the column "Date" is an object.  However, we think this column might contain useful information so we want to convert it to [seconds since epoch](https://en.wikipedia.org/wiki/Unix_time). Use only the exiting imported libraries to create a new column "unixtime". Be careful, the date strings in the file might have some non-uniform formating that you have to fix first.  Print out the min and max epoch time to check your work.  Drop the original "Date" column. 

THESE ARE THE EXACT SAME INSTRUCTIONS FROM HW6! Please take this opportunity to reuse your code (if you got it right last time). 

In [2]:
# normalize date accepts the date string as shown in the df_melb 'Date' column,
# and returns a data in a standarized format
def standardize_date(d):
    # get current time
    ticks = time.time()

    datelist = d.split("/") # split on dashes
    total_days = int(datelist[0]) + getDays(int(datelist[2]), int(datelist[1]))
    # subtract time SINCE d from current time
    fixed_time = ticks - total_days * 3600 * 24
    return fixed_time


In [3]:
#df_melb['Date'] = df_melb['Date'].apply( standardize_date )
#df_melb['unixtime'] = # your code here
#df_melb = df_melb.drop(columns="Date")

# version from HWK6 that does the same thing:
df_melb['Date'] = df_melb['Date'].apply( lambda x : standardize_date(x))
df_melb = df_melb.rename(columns={'Date': 'unixtime'}) # effectively drop 'Date' col


## Q2 Calculating the prior probabilities
Calculate the prior probabilities for each possible "Type" in `df_melb` and populate a dictionary, `dict_priors`, where the key is the possible "Type" values and the value is the prior probabilities. Show the dictionary. Do not hardcode the possible values of "Type".  Don't forget about [value counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html). 

In [4]:
# create dict of prior probabilities for the 3 'type' values
dict_priors = df_melb.groupby('Type').size().div(len(df_melb)).to_dict()

# show dict
dict_priors

{'h': 0.452, 't': 0.13, 'u': 0.418}

## Q3 Create a model for the distribution of all of the numeric attributes
For each class, and for each attribute calculate the sample mean and sample standard deviation.  You should store the model in a nested dictionary, `dict_nb_model`, such that `dict_nb_model['h']['Rooms']` is a tuple containing the mean and standard deviation for the target Type 'h' and the attribute 'Rooms'.  Show the model using the `display` function. You should ignore entries that are `NaN` in the mean and [standard deviation](https://pandas.pydata.org/docs/reference/api/pandas.Series.std.html) calculation. 

In [5]:
dict_nb_model = dict() # dict to which to append result

subset_h = df_melb[df_melb['Type'] == 'h'].drop(['Type'], axis=1)
subset_t = df_melb[df_melb['Type'] == 't'].drop(['Type'], axis=1)
subset_u = df_melb[df_melb['Type'] == 'u'].drop(['Type'], axis=1)

inner_dict_h = dict()
inner_dict_t = dict()
inner_dict_u = dict()

for target in dict_priors.keys():
    if target == 'h':
        for col in subset_h:
            # NOTE: mean() and std() for pd.Series objects skip NaN by default
            inner_dict_h[col] = (subset_h[col].mean(), subset_h[col].std())
            dict_nb_model[target] = inner_dict_h

    elif target == 't':
        for col in subset_t:
            inner_dict_t[col] = (subset_t[col].mean(), subset_t[col].std())
            dict_nb_model[target] = inner_dict_t

    else:
        for col in subset_u:
            inner_dict_u[col] = (subset_u[col].mean(), subset_u[col].std())
            dict_nb_model[target] = inner_dict_u

display(inner_dict_t['BuildingArea'])


(138.66666666666666, 53.498637054290135)

In [6]:
display(dict_nb_model)

{'h': {'Bathroom': (1.5619469026548674, 0.6720871086493075),
  'BuildingArea': (156.24339622641511, 54.62662837301434),
  'Car': (1.7777777777777777, 0.932759177140425),
  'Distance': (12.086725663716809, 7.397501132737295),
  'Landsize': (932.9646017699115, 3830.7934157687164),
  'Postcode': (3103.8982300884954, 98.35750345419703),
  'Price': (1189022.3451327435, 586296.5794417895),
  'Rooms': (3.269911504424779, 0.7258264201127756),
  'YearBuilt': (1954.900826446281, 32.461876347154686),
  'unixtime': (1630525969.6574244, 718336.6498583388)},
 't': {'Bathroom': (1.8461538461538463, 0.565430401076506),
  'BuildingArea': (138.66666666666666, 53.498637054290135),
  'Car': (1.6923076923076923, 0.5280588545286915),
  'Distance': (10.766153846153845, 4.870455475462387),
  'Landsize': (268.18461538461537, 276.57700624711265),
  'Postcode': (3121.6153846153848, 100.01588816090862),
  'Price': (1000169.2307692308, 421822.5363389935),
  'Rooms': (2.9076923076923076, 0.6052653582075831),
  'Yea

## Q4 Write a function that calculates the probability of a Gaussian
Given the mean ($\mu$), standard deviation ($\sigma$), and a observed point, `x`, return the probability.  
Use the formula $p(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$ ([wiki](https://en.wikipedia.org/wiki/Normal_distribution)).  You should use [numpy's exp](https://numpy.org/doc/stable/reference/generated/numpy.exp.html) function in your solution. 

In [7]:
#(1 / (sigma * np.sqrt(2*np.pi)))
def get_p( mu, sigma, x):
    return np.exp(-0.5 * (((x - mu)/sigma)) ** 2) / (sigma * np.sqrt(2*np.pi))

In [46]:
# Test it
p = get_p( 0, 2, 0.5)
p

0.19333405840142465

## Q5 Write the Naive Bayes classifier function
The Naive Bayes classifier function, `nb_class`, should take as a parameter the prior probability dictionary. `dict_priors`, the dictionary containing all of the gaussian distribution information for each attribue, `dict_nb_model`, and a single observation row (a series generated from iterrows) of the test dataframe. It should return a single target classification. For this problem, all of our attributes are numeric and modeled as Gaussians, so we don't worry about categorical data. Make sure to skip attributes that do not have a value in the observation.  Do not hardcode the possible classification types. 

In [47]:
def nb_class( dict_priors, dict_nb_model, observation):
    # for X in observation, calculate P(h|X), P(t|X), P(u|X) and go with the maximum
    # naïve assumption ex: P(X | h) = P(price | h) * ... * P(yearBuilt | h)
    # to not hardcode: create a dict with the values given classes = h, t, u
    # initialize all to 1
    dict_class_results = dict_priors
    for key in dict_class_results.keys():
        dict_class_results[key] = 1 # these values will eventually be our P(C|X) for C = {h, t, u}

    # example: dict_nb_model['h']['Rooms'] = (mu, std)

    obs_nan = observation.isna()

    for idx, value in observation.items():
        if obs_nan[idx]: # if this item's NaN == True, go to next item (attribute/col)
            continue

        for type in dict_class_results.keys(): # do this for each type (3 times)
            mu = dict_nb_model[type][idx][0] # (mean, std) tuples
            sigma = dict_nb_model[type][idx][1]
            dict_class_results[type] *= get_p(mu, sigma, value)


    for type in dict_class_results.keys(): # multiply the current P(X|C)'s by P(C)
        dict_class_results[type] *= dict_priors[type]

    # dict_class_results now contains the overall pvals, of which we need the key of max

    prediction = max(dict_class_results, key=dict_class_results.get)
    return prediction


    

    

## Q6 Calculate the accuracy using Naive Bayes classifier function on the test set
Load the test set from file, convert date to unix time and drop the date column, classify each row using your `nb_class`, and then show the accuracy on the test set. 

In [10]:
df_test = pd.read_csv('https://gist.githubusercontent.com/yanyanzheng96/c3d53303cebbd986b166591d19254bac/raw/94eb3b2d500d5f7bbc0441a8419cd855349d5d8e/melb_data_test.csv')
#df_test['Date'] = df_test['Date'].apply( standardize_date )
#df_test['unixtime'] = # your code here
#df_test = df_test.drop(columns="Date")

df_test['Date'] = df_test['Date'].apply( lambda x : standardize_date(x))
df_test = df_test.rename(columns={'Date': 'unixtime'}) # effectively drop 'Date' col, as above/in HWK6
# create y_Test, then remove 'Type' from df_test

y_test = df_test[target_col]
df_test = df_test.drop(columns=[target_col]) # or use inplace=True
X_test = df_test # for later use


In [48]:
predictions = []

for (indx,row) in df_test.iterrows(): # iterate through observations (single rows)
    predictions.append(nb_class(dict_priors, dict_nb_model, row))
    

In [53]:
# OBJECT: compare predictions to y_test
correct = 0
total = len(predictions)

# increment 'correct'
for idx, value in y_test.items():
    if value == predictions[idx]:
        correct += 1

acc = correct / total

In [54]:
print('Accuracy is {:.2f}%'.format(acc*100))

Accuracy is 48.00%


## Use scikit-learn to do the same thing!

Now we understand the inner workings of the Naive Bayes algorithm, let's compare our results to [scikit-learn's Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) implementation. Use the [GaussianNB](https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes) to train using the `df_melb`dataframe and test using the `df_test` dataframe. Remember to split `df_melb` into a `df_X` with the numerical attributes, and a `s_y` with the target column. On the `df_melb` frame you will have to fill the empty attributes via imputation since the scikit-learn library can not handle missing values.  Use the same method you used in the last homework (filling the training data with the mean of the non-nan values). 

Answer the following in a markdown cell: do you think imputation hurt or helped the classifier?

In [16]:
# fill in df_melb missing values with HW6 imputation
# ... since scikit-learn cannot handle missing values

# Imputation training: create and populate dict
dict_imputation = dict()
for col in list(df_melb.columns):
    if col == 'Type':
        continue
    else:
        dict_imputation[col] = df_melb[col].mean()

    # in same loop, fill in NaN with calculated mean (using .fillna() )
    df_melb = df_melb.fillna({col:dict_imputation[col]})


In [17]:
# Seperate the attributes from the target_col
# split df_melb (TRAIN) into df_X and s_y. this is already done for test (X_test, y_test)
s_y = df_melb[target_col]
df_X = df_melb.drop(columns=[target_col])
    
# Imputation - now apply it on the test data (called X_test)
# we use the means from the TRAIN data for the test imputation, avoiding data spillage
for col in list(X_test.columns):
    X_test = X_test.fillna({col:dict_imputation[col]})

In [18]:
gnb = GaussianNB()
y_pred = gnb.fit(df_X, s_y).predict(X_test) # fit w train, predict X_test

In [19]:
acc = (y_test != y_pred).sum()/X_test.shape[0]
print('Accuracy is {:.2f}%'.format(acc*100))

Accuracy is 55.00%


## Imputation changed the accuracy to 55% from an original 48%. This 7% is significant when dealing with a large amount of individual residental structures (in the df_Melb case), in which case lack of imputation could result in hundreds of misclassifications. Therefore, I would say that it helped rather than hurt.