# Bayesian Model on MNIST data

<br>

- In this tutorial, we will take the MNIST dataset and call it D0 dataset

<br>

- We will do a 9 dimensional PCA projection and call it as D1 dataset

<br>

- We do a 9 dimensional FISHER projection and call it D2 dataset

<br>

- We will then build a Bayesian classifier on D1 (single gaussian per class):
-- Take full covariance matrix 
-- Take diagonal covariance matrix (i.e.set non diagonals to zero)

<br>

- We will build a Bayesian classifier on D2 (single Gaussian per class) and considering diagonal  covariance matrix.

- We will then compare the test accuracies of the two classifiers.

## Import Packages 

In [1]:
import pandas as pd 
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split  # splitting dataset in train test data
from scipy.stats import multivariate_normal # used in gaussian pdf computation on mnist data.
from IPython.display import display

## Read MNIST data (D0 dataset)

In [2]:
D0 = pd.read_csv("train.csv") # import digits data
print("Top few rows of dataset:\n")
display(D0)

Top few rows of dataset:



Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41996,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41997,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41998,6,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Implement PCA (D1 Dataset)

In [3]:
"Load MNIST data and perform PCA (784 dimensions to 9 dimensions projected dataframe)"

label_in_data = D0['label']
dataval= D0.drop(["label"], axis = 1)


"Scale features and compute eigen values and eigen vectors"
scaledfeature = StandardScaler(copy=True, with_mean=True).fit_transform(dataval)  # shape is (4684, 784)
cvmat = np.cov(scaledfeature, rowvar = False, bias = False )
eigenva, eigenve = np.linalg.eig(cvmat) # input to linalg.eig() from numpy module is square array which is covariance matrix in our case
pairs_eigva_eigve = [(np.abs(eigenva[k]), eigenve[:,k]) for k in range(len(eigenva))] # create pairs of eigenvalue and eigenvectors
sorting_eig_pairs =sorted(pairs_eigva_eigve, key=lambda eigva: eigva[0], reverse=True) 

stackedcomp = np.hstack((sorting_eig_pairs[0][1].reshape(784,1), sorting_eig_pairs[1][1].reshape(784,1),
                           sorting_eig_pairs[2][1].reshape(784,1), sorting_eig_pairs[3][1].reshape(784,1),
                           sorting_eig_pairs[4][1].reshape(784,1), sorting_eig_pairs[5][1].reshape(784,1),
                           sorting_eig_pairs[6][1].reshape(784,1), sorting_eig_pairs[7][1].reshape(784,1),
                           sorting_eig_pairs[8][1].reshape(784,1)))   

"projected data"
projecteddata = scaledfeature.dot(stackedcomp)  # dot product of projections with originial scaled data

colnames = []
for i in range(0,9):
    colnames.append("projection" + str(i))

D1  = pd.DataFrame(data = projecteddata, columns = colnames, dtype = None)
D1['label'] = label_in_data  # append label column to projected data frame

print("\nD1 dataset after PCA implementation:\n")
display(D1) #  9 projections with last column as label column 




D1 dataset after PCA implementation:



Unnamed: 0,projection0,projection1,projection2,projection3,projection4,projection5,projection6,projection7,projection8,label
0,-5.140478,-5.226445,3.887001,-0.901512,-4.929111,-2.035413,-4.706946,4.767184,-0.230958,1
1,19.292332,6.032996,1.308148,-2.383294,-3.095188,1.791095,3.772790,-0.153865,4.115192,0
2,-7.644503,-1.705813,2.289326,2.241135,-5.094426,4.152058,1.012004,-1.732559,-0.436261,1
3,-0.474207,5.836139,2.008617,4.271106,-2.377777,-2.179913,-4.398030,0.353712,-0.992308,4
4,26.559574,6.024818,0.933179,-3.012645,-9.489179,2.331195,6.149597,1.783637,4.123302,0
...,...,...,...,...,...,...,...,...,...,...
41995,13.678849,-1.350366,-3.957336,-5.379672,-10.875898,5.105523,-0.071920,5.084014,4.253677,0
41996,-8.869582,-1.187360,2.323167,1.528830,-5.798988,2.821950,0.351780,-0.529810,-0.992204,1
41997,0.495391,7.076277,-12.089700,-3.223278,-0.618203,-0.330449,2.128035,-10.535164,2.225962,7
41998,2.307240,-4.344513,0.699848,10.011222,5.586478,5.494875,-0.189789,-5.450360,-2.181693,6


## Implement Fischer Discriminant Analysis - FDA (D2 Dataset)

In [4]:
'''FDA projection on MNIST data'''


"Construct dataframe D2: 9 dimensional FDA projection of MNIST data."

label_in_data= D0['label'] # labels
dataval= D0.drop(["label"], axis =1) # features

digits_labels = np.array(label_in_data)  # create array of labels in data
#features = dataval.iloc[:, 0:dataval.shape[1]].values # create array of features.(dimensional array)

"fda"
classifier_fda = LDA(solver='svd', shrinkage = None, n_components=9) 
#by default, LDA uses 'svd' solver to help in escaping from getting 'LinAlgError: Singular matrix' error while computing eigen values and eigen vectors

fda_data = classifier_fda .fit_transform(dataval, digits_labels)

colnames = []
for i in range(0,9):
    colnames.append("projection" + str(i))

D2  = pd.DataFrame(data = fda_data, columns = colnames, dtype = None)
D2['label'] = label_in_data  # append label column to projected data frame
print("\n")
print("D2 dataframe:\n")
display(D2)  #  9 projections with last column as label column 



D2 dataframe:



Unnamed: 0,projection0,projection1,projection2,projection3,projection4,projection5,projection6,projection7,projection8,label
0,0.706982,3.702191,-0.546160,1.083590,-1.282057,-0.640238,-0.161646,0.711746,0.098052,1
1,-4.753373,-3.257093,-2.983682,-1.244001,-1.880934,-0.898564,0.114414,-1.097409,1.235790,0
2,0.426475,5.168707,-0.215028,0.248895,-3.737808,0.168903,0.546867,0.164058,-0.314798,1
3,-0.978410,-0.555503,1.147945,-0.324528,-0.997568,-0.858390,0.979497,1.846044,-0.207963,4
4,-4.878184,-3.244367,-4.723876,-0.850046,-1.923177,-2.093587,0.166724,-2.228554,0.999328,0
...,...,...,...,...,...,...,...,...,...,...
41995,-2.846328,-1.655570,-3.661467,0.658924,-1.528190,-0.680945,0.511808,-1.606290,-1.122931,0
41996,1.776876,4.591485,-0.370188,0.200660,-2.247352,-0.005172,1.093822,-0.204246,-0.427021,1
41997,3.078790,-2.807856,-3.082087,-2.657304,-1.939725,2.647739,-2.012444,0.094553,-0.028155,7
41998,-2.901593,-1.216585,4.178510,-1.426573,-1.557683,2.061918,1.035576,0.932003,-0.135465,6


## Build a bayesian classifier on D1 dataset (with full covariance matrix)


#### Data Preparation 

Split the D1 data in train and test sets

In [5]:

X_gnb_D1 = D1.drop('label', axis=1) # entire data excluding class labels.
y_gnb_D1 =  D1['label'] # entire data with class labels only.
X_gnb_D1_train, X_gnb_D1_test, y_gnb_D1_train, y_gnb_D1_test = train_test_split(X_gnb_D1, y_gnb_D1, test_size=0.3, random_state= 11915048, shuffle = True)
print("shape of X_gnb_D1_train (predictors in train data):", X_gnb_D1_train.shape)
print("shape of X_gnb_D1_test (predictors in test data):", X_gnb_D1_test.shape)
print("shape of y_gnb_D1_train (response in train data):", y_gnb_D1_train.shape)
print("shape of y_gnb_D1_test (response in test data):", y_gnb_D1_test.shape)


"convert train and test datasets to arrays "
X_gnb_D1_train_arr = np.array(X_gnb_D1_train) # convert train data to array
y_gnb_D1_train_arr = np.array(y_gnb_D1_train)

X_gnb_D1_test_arr = np.array(X_gnb_D1_test)
y_gnb_D1_test_arr = np.array(y_gnb_D1_test)


shape of X_gnb_D1_train (predictors in train data): (29400, 9)
shape of X_gnb_D1_test (predictors in test data): (12600, 9)
shape of y_gnb_D1_train (response in train data): (29400,)
shape of y_gnb_D1_test (response in test data): (12600,)


#### Calculate prior probabilities

compute prior probabilities (i.e. P(c) which is probability of each class label from 0 to 9)

In [6]:
c = y_gnb_D1_train.value_counts().to_dict() # we want to compute proportion of each class label in train data
print("Class labels data size in train data:\n", c.items())
#print(c[0], c[1])  # c[0] = 0, c[1] = 1..  which are class labels.
prior_prob = np.ones(10) # 10 since we have 10 class labels in our data.
for i in range(10):
    prior_prob[i] = c[i]/y_gnb_D1_train.shape[0]  #proportion of each class in training data(in ascending order).

prior_prob_list = list(prior_prob)
print("\nPrior Probabilities are:\n", prior_prob_list) # these are the prior probabilities of each class label . i.e. p(c1), p(c2).. p(c10) since we have 10 classes.
#prior_prob.sum()  # sum should be equal to 1.0

Class labels data size in train data:
 dict_items([(1, 3274), (3, 3061), (7, 3054), (2, 2968), (0, 2902), (9, 2874), (4, 2872), (6, 2864), (8, 2849), (5, 2682)])

Prior Probabilities are:
 [0.09870748299319727, 0.11136054421768707, 0.10095238095238095, 0.1041156462585034, 0.09768707482993197, 0.09122448979591836, 0.09741496598639456, 0.10387755102040816, 0.09690476190476191, 0.09775510204081633]


#### Function to form the gaussian distribution per class

In [7]:
# compute mean and variance of each class and generate distribution.
# mean array of each class is in 9 dimensions
# covariance matrix  of each class is of 9*9 shape.
# Function returns frozen object which contains mean and covariance of a particular class.
def normdist(data):  # Data with 9 features.
    var = np.cov(data.T)  # to compute covariance, we need to horizontally stack variables.
    mean_array = []
    for i in range(D1.shape[1]-1): # d1.shape-1->we have 9 features. loop will run 9 times and calc mean.
        mu = np.mean(data[:,i])
        mean_array.append(mu)
    ma = np.array(mean_array)
    #print("Mean array:\n", ma)
    #print("Variance of X_class:\n", var)
    distribution = multivariate_normal(ma, var)
    return distribution

#### Predictions on train and test data

In [8]:
labels = sorted(list(c.keys())) 
# 'c' is a dictionary 
# keys: class labels(0 to 9); value: sample size counts per class
#Predictions-on-train-and-test-data
predictedvalues_traindata = pd.DataFrame() # assign empty dataframe to store predictions
predictedvalues_testdata = pd.DataFrame()

for i in range(D1.shape[1]):
    X_class = np.array(X_gnb_D1_train[y_gnb_D1_train == labels[i]]) # subset on  basis of class label
    norm_dist = normdist(X_class) # call normdist() function t0 get individual class distribution.
    predictions_class = prior_prob_list[i]*norm_dist.pdf(X_gnb_D1_train_arr) # array of train data
    predictions_class1 = prior_prob_list[i]*norm_dist.pdf(X_gnb_D1_test_arr) # array of test data
    predictedvalues_traindata = pd.concat([predictedvalues_traindata, 
                                           pd.DataFrame(predictions_class)], axis = 1)
    predictedvalues_testdata = pd.concat([predictedvalues_testdata, 
                                          pd.DataFrame(predictions_class1)], axis = 1)

# assign column names as class labels (0 to 9) 
predictedvalues_traindata.columns = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
predictedvalues_testdata.columns = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] 

# get the max value in row and column name corresponding to that maximum rowvalue.
predictedvalues_traindata = predictedvalues_traindata.idxmax(axis=1, skipna=True)

print("predicted class labels for observations in train data are::")
display(predictedvalues_traindata)

# get the max value in row and column name corresponding to that maximum rowvalue.
predictedvalues_testdata = predictedvalues_testdata.idxmax(axis=1, skipna=True) 
print("predicted class labels for observations in test data are:")
display(predictedvalues_testdata)


    

predicted class labels for observations in train data are::


0        2
1        3
2        3
3        9
4        5
        ..
29395    7
29396    2
29397    2
29398    1
29399    1
Length: 29400, dtype: object

predicted class labels for observations in test data are:


0        4
1        6
2        7
3        5
4        3
        ..
12595    8
12596    1
12597    2
12598    3
12599    0
Length: 12600, dtype: object

#### Model accuracy on train and test data

In [9]:
"Part 7 - Accuracy Measurements"

#train set accuracy:
trainaccuracy_gnb_D1 = pd.DataFrame(list(zip(list(predictedvalues_traindata),
                                             list(y_gnb_D1_train))), 
                                    columns =['predictedval_traindata', 'actuallabel_traindata'])

trainaccuracy_gnb_D1['predictedval_traindata'] = trainaccuracy_gnb_D1['predictedval_traindata'].astype(int)

 # Take out rows where values in a row are same for all the columns. 
trueevents = trainaccuracy_gnb_D1[trainaccuracy_gnb_D1.apply(lambda x: min(x) == max(x), 1)]
traindata_accuracy = (trueevents.shape[0]/y_gnb_D1_train.shape[0])*100
print("Accuracy on train data is:")
print(str(traindata_accuracy) + " %")


#test set accuracy-
testaccuracy_gnb_D1 = pd.DataFrame(list(zip(list(predictedvalues_testdata), 
                                            list(y_gnb_D1_test))), 
                                   columns =['predictedval_testdata', 'actuallabel_testdata'])

testaccuracy_gnb_D1['predictedval_testdata'] = testaccuracy_gnb_D1['predictedval_testdata'].astype(int)

# Take out rows where values in a row are same for all the columns.
trueevents1 = testaccuracy_gnb_D1[testaccuracy_gnb_D1.apply(lambda rowval: min(rowval) == max(rowval), 1)] 

testdata_accuracy = (trueevents1.shape[0]/y_gnb_D1_test.shape[0])*100

print("\nAccuracy on test data is:")
print(str(testdata_accuracy) + " %")

Accuracy on train data is:
84.5986394557823 %

Accuracy on test data is:
84.19047619047619 %


## Build a bayesian classifier on D1 dataset (with diagonal covariance matrix)

#### Data preparation

In [10]:
X_gnb_D1 = D1.drop('label', axis=1) # entire data excluding class labels.
y_gnb_D1 =  D1['label'] # entire data with class labels only.
X_gnb_D1_train, X_gnb_D1_test, y_gnb_D1_train, y_gnb_D1_test = train_test_split(X_gnb_D1, y_gnb_D1, test_size=0.3, random_state= 11915048, shuffle = True)
print("shape of X_gnb_D1_train (predictors in train data):", X_gnb_D1_train.shape)
print("shape of X_gnb_D1_test (predictors in test data):", X_gnb_D1_test.shape)
print("shape of y_gnb_D1_train (response in train data):", y_gnb_D1_train.shape)
print("shape of y_gnb_D1_test (response in test data):", y_gnb_D1_test.shape)


"convert train and test datasets to arrays"
X_gnb_D1_train_arr = np.array(X_gnb_D1_train) # convert train data to array
y_gnb_D1_train_arr = np.array(y_gnb_D1_train)

X_gnb_D1_test_arr = np.array(X_gnb_D1_test)
y_gnb_D1_test_arr = np.array(y_gnb_D1_test)


shape of X_gnb_D1_train (predictors in train data): (29400, 9)
shape of X_gnb_D1_test (predictors in test data): (12600, 9)
shape of y_gnb_D1_train (response in train data): (29400,)
shape of y_gnb_D1_test (response in test data): (12600,)


#### Calculate prior probabilities

In [11]:
# we want to compute proportion of each class label in train data
c = y_gnb_D1_train.value_counts().to_dict() 

print("Class labels data size in train data:")
print(c.items())

prior_prob = np.ones(10) # 10 since we have 10 class labels in our data.
for i in range(10):
    prior_prob[i] = c[i]/y_gnb_D1_train.shape[0]  #proportion of each class in training data(in ascending order).

prior_prob_list = list(prior_prob)

# these are the prior probabilities of each class label 
#. i.e. p(c1), p(c2).. p(c10) since we have 10 classes.
#prior_prob.sum()  # sum should be equal to 1.0
print("\n")
print("Prior Probabilities are:")
print(prior_prob_list) 

Class labels data size in train data:
dict_items([(1, 3274), (3, 3061), (7, 3054), (2, 2968), (0, 2902), (9, 2874), (4, 2872), (6, 2864), (8, 2849), (5, 2682)])


Prior Probabilities are:
[0.09870748299319727, 0.11136054421768707, 0.10095238095238095, 0.1041156462585034, 0.09768707482993197, 0.09122448979591836, 0.09741496598639456, 0.10387755102040816, 0.09690476190476191, 0.09775510204081633]


#### Function to form the gaussian distribution per class

To obtain the diagonal covariance matrix, we use np.diag() that converts non diagonal elements to zero.

In [12]:
# compute mean and variance of each class and generate distribution.
# mean array of each class is in 9 dimensions
# covariance matrix  of each class is of 9*9 shape.
# Function returns frozen object which contains mean and covariance of a particular class.

def normdist(data):  # Data with 9 features.
    var = np.cov(data.T)  # to compute covariance, we need to horizontally stack variables.
    var = np.diag(np.diag(var))  # construct diagonal matrix(non diagonal elements are zero)
    mean_array = []
    for i in range(D1.shape[1]-1): # d1.shape-1 -> since we have 9 features. loop will run 9 times and calc mean.
        mu = np.mean(data[:,i])
        mean_array.append(mu)
    ma = np.array(mean_array)
    #print("Mean array:\n", ma)
    #print("Variance of X_class:\n", var)
    distribution = multivariate_normal(ma, var)
    return distribution

#### Predictions on train and test data

In [13]:
labels = sorted(list(c.keys())) 
# 'c' is a dictionary with keys and values. 
#keys: class labels(0 to 9); value: sample size counts per class

predictedvalues_traindata = pd.DataFrame() # store the predicted values from all the classes
predictedvalues_testdata = pd.DataFrame()
for i in range(D1.shape[1]):
     # subset data for class label
    X_class = np.array(X_gnb_D1_train[y_gnb_D1_train == labels[i]]) 
    norm_dist = normdist(X_class) # call normdist() function to get class distribution.
    predictions_class = prior_prob_list[i]*norm_dist.pdf(X_gnb_D1_train_arr) # array of train data pred.
    predictions_class1 = prior_prob_list[i]*norm_dist.pdf(X_gnb_D1_test_arr) # array of test data pred.
    predictedvalues_traindata = pd.concat([predictedvalues_traindata, 
                                           pd.DataFrame(predictions_class)], axis = 1)
    predictedvalues_testdata = pd.concat([predictedvalues_testdata,
                                          pd.DataFrame(predictions_class1)], axis = 1)
    

# assign column names as class labels in our data
predictedvalues_traindata.columns = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] 

# get the max value in row and column name corresponding to that maximum rowvalue.
predictedvalues_traindata = predictedvalues_traindata.idxmax(axis=1, skipna=True) 

print("Train Set Predictions:")
display(predictedvalues_traindata)

predictedvalues_testdata.columns = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] 

predictedvalues_testdata = predictedvalues_testdata.idxmax(axis=1, skipna=True) 
print("\n")
print("Test Set Predictions:")
display(predictedvalues_testdata)



Train Set Predictions:


0        3
1        3
2        3
3        7
4        3
        ..
29395    7
29396    2
29397    6
29398    1
29399    1
Length: 29400, dtype: object



Test Set Predictions:


0        4
1        6
2        7
3        5
4        3
        ..
12595    8
12596    1
12597    2
12598    3
12599    0
Length: 12600, dtype: object

#### Model accuracy on train and test data

In [14]:
#train set accuracy
trainaccuracy_gnb_D1 = pd.DataFrame(list(zip(list(predictedvalues_traindata), 
                                             list(y_gnb_D1_train))), 
                                    columns =['predictedval_traindata', 'actuallabel_traindata'])
trainaccuracy_gnb_D1['predictedval_traindata'] = trainaccuracy_gnb_D1['predictedval_traindata'].astype(int)


trueevents = trainaccuracy_gnb_D1[trainaccuracy_gnb_D1.apply(lambda x: min(x) == max(x), 1)] 
traindata_accuracy = (trueevents.shape[0]/y_gnb_D1_train.shape[0])*100
print("Accuracy on train data is:")
print(str(traindata_accuracy) + " %")

#test set accuracy
testaccuracy_gnb_D1 = pd.DataFrame(list(zip(list(predictedvalues_testdata), 
                                            list(y_gnb_D1_test))), 
                                   columns =['predictedval_testdata', 'actuallabel_testdata'])

testaccuracy_gnb_D1['predictedval_testdata'] = testaccuracy_gnb_D1['predictedval_testdata'].astype(int)
trueevents1 = testaccuracy_gnb_D1[testaccuracy_gnb_D1.apply(lambda rowval: min(rowval) == max(rowval), 1)] # Take out rows where values in a row are same for all the columns.
testdata_accuracy = (trueevents1.shape[0]/y_gnb_D1_test.shape[0])*100
print("\nAccuracy on test data is:")
print(str(testdata_accuracy) + " %")


Accuracy on train data is:
75.58503401360545 %

Accuracy on test data is:
75.51587301587301 %


## Build a bayesian classifier on D2 dataset (with full covariance matrix)


#### Data preparation

Use D0 data i.e. MNIST digits data for implementing FDA and Bayesian Model on FDA 9-D output.

In [15]:
X_gnb_D2 = D2.drop('label', axis=1) # entire data excluding class labels.
y_gnb_D2 =  D2['label'] # entire data with class labels only.
X_gnb_D2_train, X_gnb_D2_test, y_gnb_D2_train, y_gnb_D2_test = train_test_split(X_gnb_D2, y_gnb_D2, test_size=0.3, random_state= 11915048, shuffle = True)
print("shape of X_gnb_D2_train (predictors in train data):", X_gnb_D2_train.shape)
print("shape of X_gnb_D2_test (predictors in test data):", X_gnb_D2_test.shape)
print("shape of y_gnb_D2_train (response in train data):", y_gnb_D2_train.shape)
print("shape of y_gnb_D2_test (response in test data):", y_gnb_D2_test.shape)

## convert D2 train data to array
X_gnb_D2_train_arr = np.array(X_gnb_D2_train) 
y_gnb_D2_train_arr = np.array(y_gnb_D2_train)

X_gnb_D2_test_arr = np.array(X_gnb_D2_test)
y_gnb_D2_test_arr = np.array(y_gnb_D2_test)

shape of X_gnb_D2_train (predictors in train data): (29400, 9)
shape of X_gnb_D2_test (predictors in test data): (12600, 9)
shape of y_gnb_D2_train (response in train data): (29400,)
shape of y_gnb_D2_test (response in test data): (12600,)


#### Calculate prior probabilities 

In [16]:
# we want to compute proportion of each class label in D2 train data
c = y_gnb_D2_train.value_counts().to_dict() 

print("Class labels data size in train data:")
print(c.items())
#print(c[0], c[1])  # c[0] = 0, c[1] = 1..  which are class labels.
prior_prob = np.ones(10) # 10 since we have 10 class labels in our data.
for i in range(10):
    prior_prob[i] = c[i]/y_gnb_D2_train.shape[0] 
    #proportion of each class in training data(in ascending order).

prior_prob_list = list(prior_prob)
print("\n")
print("Prior Probabilities are:")
print(prior_prob_list) # these are the prior probabilities of each class label . i.e. p(c1), p(c2).. p(c10) since we have 10 classes.
#prior_prob.sum()  # sum should be equal to 1.0

Class labels data size in train data:
dict_items([(1, 3274), (3, 3061), (7, 3054), (2, 2968), (0, 2902), (9, 2874), (4, 2872), (6, 2864), (8, 2849), (5, 2682)])


Prior Probabilities are:
[0.09870748299319727, 0.11136054421768707, 0.10095238095238095, 0.1041156462585034, 0.09768707482993197, 0.09122448979591836, 0.09741496598639456, 0.10387755102040816, 0.09690476190476191, 0.09775510204081633]


#### Function to form the gaussian distribution per class

In [17]:
# compute mean and variance of each class and generate distribution.
# mean array of each class is in 9 dimensions
# covariance matrix  of each class is of 9*9 shape.
# Function returns frozen object which contains mean and covariance of a particular class.

def normdist(data):  # Data with 9 features.Because each 'class' subset data has 9 columns(dimensions reduced to 9)
    var = np.cov(data.T)  # to compute covariance, we need to horizontally stack variables.
    mean_array = []
    for i in range(D2.shape[1]-1): # d1.shape-1 -> since we have 9 features. loop will run 9 times and calc mean.
        mu = np.mean(data[:,i])
        mean_array.append(mu)
    ma = np.array(mean_array)
    #print("Mean array:\n", ma)
    #print("Variance of X_class:\n", var)
    distribution = multivariate_normal(ma, var)
    return distribution

#### Predictions on train and test data

In [18]:
labels = sorted(list(c.keys())) 
# 'c' is a dictionary with keys and values.
# keys: class labels(0 to 9); value: sample size counts per class

predictedvalues_traindata = pd.DataFrame() # store the predicted values from all the classes
predictedvalues_testdata = pd.DataFrame()

for i in range(len(D2['label'].unique())): # D[2].shape[1] =  10. # since
    # subset on basis of class label
    X_class = np.array(X_gnb_D2_train[y_gnb_D2_train == labels[i]]) 
    norm_dist = normdist(X_class) # call normdist() function to get class distribution
    predictions_class = prior_prob_list[i]*norm_dist.pdf(X_gnb_D2_train_arr) # array of train data pred.
    predictions_class1 = prior_prob_list[i]*norm_dist.pdf(X_gnb_D2_test_arr) # array of test data pred.
    predictedvalues_traindata = pd.concat([predictedvalues_traindata, 
                                           pd.DataFrame(predictions_class)], axis = 1)
    predictedvalues_testdata = pd.concat([predictedvalues_testdata, 
                                          pd.DataFrame(predictions_class1)], axis = 1)
    
    

# assign column names = class labels in our data
predictedvalues_traindata.columns = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] 
predictedvalues_traindata = predictedvalues_traindata.idxmax(axis=1, skipna=True) 

print("Predictions on train data are:")
display(predictedvalues_traindata)

predictedvalues_testdata.columns = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] 

predictedvalues_testdata = predictedvalues_testdata.idxmax(axis=1, skipna=True) 

print("\nPredictions on test data are:")
display(predictedvalues_testdata)

Predictions on train data are:


0        2
1        3
2        3
3        9
4        3
        ..
29395    7
29396    2
29397    2
29398    1
29399    1
Length: 29400, dtype: object


Predictions on test data are:


0        4
1        6
2        7
3        5
4        3
        ..
12595    3
12596    1
12597    2
12598    3
12599    0
Length: 12600, dtype: object

#### Model accuracy on train and test data

In [19]:
trainaccuracy_gnb_D2 = pd.DataFrame(list(zip(list(predictedvalues_traindata), 
                                             list(y_gnb_D2_train))), 
                                    columns =['predictedval_traindata', 'actuallabel_traindata'])
trainaccuracy_gnb_D2['predictedval_traindata'] = trainaccuracy_gnb_D2['predictedval_traindata'].astype(int)

trueevents = trainaccuracy_gnb_D2[trainaccuracy_gnb_D2.apply(lambda rowval: min(rowval) == max(rowval), 1)] # Take out rows where values in a row are same for all the columns. 
traindata_accuracy = (trueevents.shape[0]/y_gnb_D2_train.shape[0])*100

print("Accuracy on D2 train data:")
print(str(traindata_accuracy) + " %")


testaccuracy_gnb_D2 = pd.DataFrame(list(zip(list(predictedvalues_testdata), 
                                            list(y_gnb_D2_test))), 
                                   columns =['predictedval_testdata', 'actuallabel_testdata'])
testaccuracy_gnb_D2['predictedval_testdata'] = testaccuracy_gnb_D2['predictedval_testdata'].astype(int)
trueevents1 = testaccuracy_gnb_D2[testaccuracy_gnb_D2.apply(lambda rowval: min(rowval) == max(rowval), 1)] # Take out rows where values in a row are same for all the columns.
testdata_accuracy = (trueevents1.shape[0]/y_gnb_D2_test.shape[0])*100
print("\n")
print("Accuracy on D2 test data:")
print(str(testdata_accuracy) + " %")

Accuracy on D2 train data:
89.62244897959184 %


Accuracy on D2 test data:
89.9126984126984 %


## Build a bayesian classifier on D2 dataset (with diagonal covariance matrix)

#### Data preparation

In [20]:
X_gnb_D2 = D2.drop('label', axis=1) # entire data excluding class labels.
y_gnb_D2 =  D2['label'] # entire data with class labels only.
X_gnb_D2_train, X_gnb_D2_test, y_gnb_D2_train, y_gnb_D2_test = train_test_split(X_gnb_D2, y_gnb_D2, test_size=0.3, random_state= 11915048, shuffle = True)
print("shape of X_gnb_D2_train (predictors in train data):", X_gnb_D2_train.shape)
print("shape of X_gnb_D2_test (predictors in test data):", X_gnb_D2_test.shape)
print("shape of y_gnb_D2_train (response in train data):", y_gnb_D2_train.shape)
print("shape of y_gnb_D2_test (response in test data):", y_gnb_D2_test.shape)

X_gnb_D2_train_arr = np.array(X_gnb_D2_train) # convert D2 train data to array
y_gnb_D2_train_arr = np.array(y_gnb_D2_train)

X_gnb_D2_test_arr = np.array(X_gnb_D2_test)
y_gnb_D2_test_arr = np.array(y_gnb_D2_test)



shape of X_gnb_D2_train (predictors in train data): (29400, 9)
shape of X_gnb_D2_test (predictors in test data): (12600, 9)
shape of y_gnb_D2_train (response in train data): (29400,)
shape of y_gnb_D2_test (response in test data): (12600,)


#### Calculate prior probabilities

In [21]:
c = y_gnb_D2_train.value_counts().to_dict() # we want to compute proportion of each class label in D2 train data
print("Class labels data size in train data:\n", c.items())
#print(c[0], c[1])  # c[0] = 0, c[1] = 1..  which are class labels.
prior_prob = np.ones(10) # 10 since we have 10 class labels in our data.
for i in range(10):
    prior_prob[i] = c[i]/y_gnb_D2_train.shape[0]  #proportion of each class in training data(in ascending order).

prior_prob_list = list(prior_prob)
# these are the prior probabilities of each class label
#. i.e. p(c1), p(c2).. p(c10) since we have 10 classes.
#prior_prob.sum()  # sum should be equal to 1.0
print("\n")
print("Prior Probabilities are:\n", prior_prob_list) 


Class labels data size in train data:
 dict_items([(1, 3274), (3, 3061), (7, 3054), (2, 2968), (0, 2902), (9, 2874), (4, 2872), (6, 2864), (8, 2849), (5, 2682)])


Prior Probabilities are:
 [0.09870748299319727, 0.11136054421768707, 0.10095238095238095, 0.1041156462585034, 0.09768707482993197, 0.09122448979591836, 0.09741496598639456, 0.10387755102040816, 0.09690476190476191, 0.09775510204081633]


#### Function to form the gaussian distribution per class

In [22]:
def normdist(data):  # Data with 9 features.Because each 'class' subset data has 9 columns(dimensions reduced to 9)
    var = np.cov(data.T)  # to compute covariance, we need to horizontally stack variables.
    var = np.diag(np.diag(var))
    mean_array = []
    for i in range(D2.shape[1]-1): # d1.shape-1 -> since we have 9 features. loop will run 9 times and calc mean.
        mu = np.mean(data[:,i])
        mean_array.append(mu)
    ma = np.array(mean_array)
    #print("Mean array:\n", ma)
    #print("Variance of X_class:\n", var)
    distribution = multivariate_normal(ma, var)
    return distribution


#### Predictions on train and test data


In [23]:
labels = sorted(list(c.keys())) 
# 'c' is a dictionary with keys are the class labels(0 to 9) and value is sample size count per class

predictedvalues_traindata = pd.DataFrame() # store the predicted values from all the classes
predictedvalues_testdata = pd.DataFrame()

for i in range(len(D2['label'].unique())): # D[2].shape[1] =  10. # since
    X_class = np.array(X_gnb_D2_train[y_gnb_D2_train == labels[i]]) # subset on  basis of class label
    norm_dist = normdist(X_class) # call normdist() function to get class distribution
    predictions_class = prior_prob_list[i]*norm_dist.pdf(X_gnb_D2_train_arr) # array of train data
    predictions_class1 = prior_prob_list[i]*norm_dist.pdf(X_gnb_D2_test_arr) # array of test data
    predictedvalues_traindata = pd.concat([predictedvalues_traindata, pd.DataFrame(predictions_class)],
                                          axis = 1)
    predictedvalues_testdata = pd.concat([predictedvalues_testdata, pd.DataFrame(predictions_class1)],
                                         axis = 1)
    

predictedvalues_traindata.columns = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] 

predictedvalues_traindata = predictedvalues_traindata.idxmax(axis=1, skipna=True) 
print("Predictions on train data are:")
display(predictedvalues_traindata)

predictedvalues_testdata.columns = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] # assign column names = class labels in our data
predictedvalues_testdata = predictedvalues_testdata.idxmax(axis=1, skipna=True) # get the max value in row and column name corresponding to that maximum rowvalue.
print("\n")
print("Predictions on test data are:")
display(predictedvalues_testdata)

Predictions on train data are:


0        2
1        3
2        3
3        9
4        3
        ..
29395    7
29396    2
29397    2
29398    1
29399    1
Length: 29400, dtype: object



Predictions on test data are:


0        4
1        6
2        7
3        5
4        3
        ..
12595    3
12596    1
12597    2
12598    3
12599    0
Length: 12600, dtype: object

#### Model accuracy on train and test data

In [24]:
#train set accuracy

trainaccuracy_gnb_D2 = pd.DataFrame(list(zip(list(predictedvalues_traindata), 
                                             list(y_gnb_D2_train))),
                                    columns =['predictedval_traindata', 'actuallabel_traindata'])
#trainaccuracy_gnb_D2 = trainaccuracy_gnb_D2.convert_objects(convert_numeric=True)
trainaccuracy_gnb_D2['predictedval_traindata'] = trainaccuracy_gnb_D2['predictedval_traindata'].astype(int)
trainaccuracy_gnb_D2
trueevents = trainaccuracy_gnb_D2[trainaccuracy_gnb_D2.apply(lambda rowval: min(rowval) == max(rowval), 1)] # Take out rows where values in a row are same for all the columns. 
traindata_accuracy = (trueevents.shape[0]/y_gnb_D2_train.shape[0])*100
print("Accuracy on D2 train data is:")
print(str(traindata_accuracy) + " %")


#test set accuracy

testaccuracy_gnb_D2 = pd.DataFrame(list(zip(list(predictedvalues_testdata), 
                                            list(y_gnb_D2_test))), 
                                   columns =['predictedval_testdata', 'actuallabel_testdata'])
#testaccuracy_gnb_D2 = testaccuracy_gnb_D2.convert_objects(convert_numeric=True)
testaccuracy_gnb_D2['predictedval_testdata'] = testaccuracy_gnb_D2['predictedval_testdata'].astype(int)
trueevents1 = testaccuracy_gnb_D2[testaccuracy_gnb_D2.apply(lambda rowval: min(rowval) == max(rowval), 1)] # Take out rows where values in a row are same for all the columns.
testdata_accuracy = (trueevents1.shape[0]/y_gnb_D2_test.shape[0])*100
print("\nAccuracy on D2 test data is:")
print(str(testdata_accuracy) + " %")

Accuracy on D2 train data is:
88.38095238095238 %

Accuracy on D2 test data is:
88.61111111111111 %


#### After PCA, Accuracy Results for Bayesian on D1 dataframe -

(Bayesian Classifier on D1 - (Considering Full  Covariance matrix)-
train data accuracy -> 84.5986394557823 %
test data accuracy -> 84.19047619047619 %

(Bayesian Classifier on D1 - (Diagonal Covariance matrix)-
train data accuracy -> 75.58503401360545 %
test data accuracy -> 75.51587301587301 %

<br>

#### After FDA, Accuracy Results for Bayesian on D2 dataframe -

(Bayesian Classifier on D2 - (Considering Full  Covariance matrix)-
train data accuracy -> 89.62244897959184 %
test data accuracy -> 89.9126984126984 %

(Bayesian Classifier on D2 - (Considering Diagonal Covariance matrix)-
train data accuracy -> 88.38095238095238 % 
test data accuracy -> 88.61111111111111 %
 
 Accuracy is lower when we considered diagonal covariance matrix since diagonal covariance matrix covers less portion of datapoints since it has a circle shape.