# DASC 521 - Introduction to Machine Learning

In this homework, we will try to implement a Naive Bayes classifier for an image dataset which contains faces of different people to predict the gender of given person. First we will begin with importing necessary libraries.

#### Importing Libraries

In [149]:
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix

Now let's import image data set and label data from csv files. We will use Pandas' read_csv function.

#### Reading data from file

In [34]:
data_set = pd.read_csv('./hw01_images.csv', header=None)
labels = pd.read_csv('./hw01_labels.csv', header=None)

#### Obserbing and Preparing Data

Let's take a look to out dataset. 

In [35]:
data_set

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4086,4087,4088,4089,4090,4091,4092,4093,4094,4095
0,0.294118,0.325490,0.325490,0.290196,0.317647,0.298039,0.294118,0.250980,0.235294,0.250980,...,0.160784,0.160784,0.156863,0.152941,0.149020,0.152941,0.160784,0.164706,0.156863,0.149020
1,0.431373,0.423529,0.470588,0.498039,0.509804,0.556863,0.635294,0.662745,0.670588,0.650980,...,0.149020,0.141176,0.145098,0.141176,0.133333,0.145098,0.152941,0.137255,0.129412,0.145098
2,0.301961,0.294118,0.254902,0.207843,0.192157,0.196078,0.184314,0.168627,0.188235,0.250980,...,0.149020,0.149020,0.156863,0.160784,0.149020,0.145098,0.145098,0.145098,0.149020,0.145098
3,0.188235,0.207843,0.227451,0.215686,0.223529,0.203922,0.176471,0.133333,0.078431,0.113725,...,0.603922,0.639216,0.658824,0.686275,0.694118,0.694118,0.694118,0.698039,0.705882,0.701961
4,0.474510,0.454902,0.466667,0.466667,0.470588,0.498039,0.552941,0.552941,0.552941,0.549020,...,0.172549,0.168627,0.168627,0.180392,0.156863,0.188235,0.172549,0.184314,0.156863,0.164706
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,0.380392,0.400000,0.407843,0.372549,0.364706,0.364706,0.372549,0.388235,0.403922,0.403922,...,0.160784,0.160784,0.145098,0.160784,0.160784,0.145098,0.172549,0.172549,0.156863,0.129412
396,0.349020,0.345098,0.345098,0.349020,0.345098,0.341176,0.345098,0.333333,0.317647,0.325490,...,0.454902,0.419608,0.380392,0.337255,0.349020,0.403922,0.454902,0.482353,0.505882,0.521569
397,0.474510,0.466667,0.443137,0.447059,0.462745,0.474510,0.490196,0.505882,0.529412,0.549020,...,0.168627,0.160784,0.156863,0.156863,0.152941,0.164706,0.164706,0.160784,0.176471,0.180392
398,0.203922,0.192157,0.200000,0.215686,0.239216,0.266667,0.286275,0.286275,0.286275,0.270588,...,0.501961,0.474510,0.458824,0.466667,0.501961,0.533333,0.552941,0.560784,0.564706,0.572549


Now let's take a look at labels.

In [143]:
labels

Unnamed: 0,4096
0,2
1,2
2,2
3,2
4,2
...,...
395,2
396,2
397,2
398,2


For convenient usage, we will change our column header as "4096" so that we can concatenate data_set and labels.

In [36]:
labels = labels.rename(columns = {0: 4096})
concatted_df = pd.concat([data_set, labels], axis = 1)

Now we need to split data frame into two parts: train and test. We will assign first half of total entry for training, and last half for testing.

In [37]:
train = concatted_df[0:200]
test = concatted_df[200:400]

#### Estimating Parameters and Creating Classification Model

After creating train and test dataset, we will predict our parameters by using train dataset.

In [85]:
mean = pd.DataFrame([train[range(0, 4096)][train[4096] == 1].mean(), train[range(0, 4096)][train[4096] == 2].mean()])
std_dev = pd.DataFrame([train[range(0, 4096)][train[4096] == 1].std(), train[range(0, 4096)][train[4096] == 2].std()])
prior_prob = pd.DataFrame([train[4096][train[4096] == 1].size/train[4096].size, train[4096][train[4096] == 2].size/train[4096].size])

Now let's take a look at mean, deviation and prior probabilities.

In [144]:
mean

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4086,4087,4088,4089,4090,4091,4092,4093,4094,4095
0,0.379608,0.398235,0.411961,0.428824,0.428235,0.421961,0.430392,0.447843,0.46549,0.488431,...,0.245686,0.237843,0.233137,0.239216,0.243529,0.24,0.242353,0.245294,0.250392,0.256667
1,0.390218,0.394423,0.400871,0.406427,0.410828,0.416819,0.424597,0.433355,0.43878,0.443072,...,0.265664,0.26342,0.259891,0.256122,0.258758,0.260937,0.26939,0.273682,0.280109,0.27841


In [145]:
std_dev

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4086,4087,4088,4089,4090,4091,4092,4093,4094,4095
0,0.148031,0.152713,0.164403,0.177348,0.187419,0.19443,0.19263,0.187073,0.186642,0.188731,...,0.179291,0.179453,0.179927,0.18582,0.192234,0.193772,0.189207,0.186118,0.185473,0.184296
1,0.171043,0.173346,0.175414,0.177754,0.180978,0.184735,0.187504,0.188161,0.187698,0.183371,...,0.135867,0.136483,0.135303,0.131609,0.13493,0.140403,0.146043,0.149788,0.156379,0.156128


In [146]:
prior_prob

Unnamed: 0,0
0,0.1
1,0.9


Now it's time to create a classification model by using estimated parameters. To do this, we will derive score functions with estimated parameters. Then we will check for scores for each dataset in each class. After that, we will classify data points by higher score value. Higher one will be the estimated label.

#### Results for Train Data

In [147]:
score_val_1 = -0.5 * np.log(2 * np.pi * std_dev.iloc[0,:]**2) -\
    0.5 * (train[range(0, 4096)] - mean.iloc[0,:])**2 / std_dev.iloc[0,:]**2

score_val_2 = -0.5 * np.log(2 * np.pi * std_dev.iloc[1,:]**2) -\
    0.5 * (train[range(0, 4096)] - mean.iloc[1,:])**2 / std_dev.iloc[1,:]**2

score_val = pd.concat([score_val_1.sum(axis=1), score_val_2.sum(axis=1)], axis=1)
score_val[0] = score_val[0].add(np.log(prior_prob.iloc[0,0]))
score_val[1] = score_val[1].add(np.log(prior_prob.iloc[1,0]))

Let's observe score values.

In [148]:
score_val

Unnamed: 0,0,1
0,2231.396994,2099.151800
1,2399.915351,1586.537111
2,2224.616661,2126.652747
3,-2208.243098,-54.885141
4,2021.643892,1769.523364
...,...,...
195,3067.881940,3525.608345
196,2260.250446,2981.503206
197,2486.463765,3010.621384
198,2641.158568,3351.400566


We are incrementing labels by one since they are 0 and 1. Our original dataset contains labels as 1 and 2.

In [150]:
print(confusion_matrix(train[4096], score_val.idxmax(axis=1)+1))

[[ 18   2]
 [ 24 156]]


Results look okay for train data. But label 1 has so little amount of data points.

#### Results for Test Data

In [151]:
score_val_1 = -0.5 * np.log(2 * np.pi * std_dev.iloc[0,:]**2) -\
    0.5 * (test[range(0, 4096)] - mean.iloc[0,:])**2 / std_dev.iloc[0,:]**2

score_val_2 = -0.5 * np.log(2 * np.pi * std_dev.iloc[1,:]**2) -\
    0.5 * (test[range(0, 4096)] - mean.iloc[1,:])**2 / std_dev.iloc[1,:]**2

score_val = pd.concat([score_val_1.sum(axis=1), score_val_2.sum(axis=1)], axis=1)
score_val[0] = score_val[0].add(np.log(prior_prob.iloc[0,0]))
score_val[1] = score_val[1].add(np.log(prior_prob.iloc[1,0]))

In [152]:
score_val

Unnamed: 0,0,1
200,81.004370,1093.799936
201,1967.603868,2116.542405
202,515.577316,1857.860382
203,855.984386,1190.477241
204,1162.669812,1692.962567
...,...,...
395,3329.641899,3590.547325
396,1230.543526,2056.829729
397,3145.815208,3426.560522
398,-233.723361,1039.774535


In [153]:
print(confusion_matrix(train[4096], score_val.idxmax(axis=1)+1))

[[ 16   4]
 [ 19 161]]


Also for test data, there are not many label 1 data points. But for label 2, classification looks okay. But dataset is not so healthy for consideration.