# Homework Assignment 2 - A Probablistic Naive Bayes Classifier

CSE 575 Section C Fall 2024 Luo

## Description

**This is an individual work.** The project focuses on a subset of the MNIST dataset containing images of digits "0" and "1". The project involves four tasks: feature extraction, parameter calculation, implementation of Naïve Bayes classifiers, and prediction of labels for the test data using the classifiers. Finally, calculating the accuracy of the predictions.

## What packages are allowed / prohibited

You ARE allowed to use fundamental math/stat operators in numpy and math, such as numpy.var, numpy.std, numpy.mean and etc.

You are <font color="red">**NOT allowed**</font> to use functions that directly return Gaussian-dsitribution PDF values or directly use a NB classifier, e.g. scipy.stat.norm, numpy.random.normal, the sklearn library as a whole, or the likes. If we find you use any of those in your source code, your submission will be desk-rejected (receiving a 0). 

## Deliverables

- (7 pts) Your source code in this **HW2.ipynb** that contains all the proper implementations
- (3 pts) A one-page **pdf** report, excluding the cover page if you have one. **You will report all output values (from Step 2 and Step 3) for all the 3 cases given to you**, and record any difficulty or problem you have encountered during the process.

## Evaluation

The ground truths for the given 3 cases will be revealed on Canvas **by Tuesday, Feb 13th**. During grading, we will further assess your program and see if it can pass 2 additional hidden test cases. 

The error range for the PDF values in Step 2 is +-0.2, and the error range for the accuracies in Step 3 is +-0.005 .

## Deadline

2359 hours on Tuesday, Feb 20th aka the Midterm 1 day. No late submissions will be accepted.

In [2]:
import numpy as np
import scipy.io
import math

case = 0 # can also be 0, 1, or 2

## This is a local path to the folder in my computer. Change it as per the requirement!!


## Already set to Test case 2 ## Change it as per the requirement
Numpyfile0 = scipy.io.loadmat(f'/home/mannvora/Desktop/CSE-575-SML/Assignments/ASSIGNMENT2/data/train2/digit0_stu_train2.mat')
Numpyfile1 = scipy.io.loadmat(f'/home/mannvora/Desktop/CSE-575-SML/Assignments/ASSIGNMENT2/data/train2/digit1_stu_train2.mat')
Numpyfile2 = scipy.io.loadmat('/home/mannvora/Desktop/CSE-575-SML/Assignments/ASSIGNMENT2/data/test/digit0_testset.mat')
Numpyfile3 = scipy.io.loadmat('/home/mannvora/Desktop/CSE-575-SML/Assignments/ASSIGNMENT2/data/test/digit1_testset.mat')

train0 = Numpyfile0.get('target_img') # Training samples of digit 0
train1 = Numpyfile1.get('target_img') # Training samples of digit 1

test0 = Numpyfile2.get('target_img') # Test samples of digit 0
test1 = Numpyfile3.get('target_img') # Test samples of digit 1

# 1. Feature Extraction - Transforming the Raw Images into 2D

The images of MNIST in its raw form are all 28x28 greyscale pixel values, or 1x784 if flattened. That is way too many dimensions! How can we find a way to represent the images but with far fewer dimensions?

For this project, we will transform all image samples into 2D vectors. You need to first extract features from your original training sets **i.e. train0, train1**, by converting the original data arrays to 2D arrays. The two features out of any image x are defined as following: 

**Feature1 / x_f1:** The average brightness of each image (average all pixel values within a whole image array). 

**Feature2 / x_f2:** The standard deviation of the brightness of each image (standard deviation of all pixel brightness values within a whole image array). 

We assume that these two features are i.i.d and are sampled from Gaussian distributions with regard to all images. Below is a function template you may make use of.

In [3]:
# Step 1 Template (you are free to modify it at will)

#Function to extract the feactures 
def feat_extract(x):
    assert x.shape == (28, 28)
    
    # Calculate average brightness
    x_f1 = np.mean(x)
    
    # Calculate standard deviation of brightness
    x_f2 = np.std(x)
    
    return x_f1, x_f2


##Extracting the features for digit 0

num_samples = train0.shape[0]
features_train0 = np.zeros((num_samples, 2))  

for i in range(num_samples):
    x = train0[i]
    features_train0[i] = feat_extract(x)

print(features_train0)

## Doing the same to extract the features for Digit1 

num_samples = train1.shape[0]
features_train1 = np.zeros((num_samples, 2)) 

for i in range(num_samples):
    x = train1[i]
    features_train1[i] = feat_extract(x)
    
print(features_train1)

[[ 54.09566327  97.5900018 ]
 [ 52.67984694  96.57544794]
 [ 48.31377551  93.12555171]
 ...
 [ 31.06377551  75.10089142]
 [ 38.66836735  84.32684964]
 [ 62.8877551  103.05217201]]
[[19.19387755 63.13439797]
 [15.95153061 56.93856515]
 [17.25510204 59.17016471]
 ...
 [21.45790816 65.21079058]
 [ 9.52423469 41.79007211]
 [13.57142857 52.35572986]]


# 2. The NB Likehood Parameters are Distributed as Gaussian PDFs

You need to calculate all the parameters for our two-class Naive Bayes (NB) classifier respectively, based upon the 2-D featurized data you have obtained in Step 1. 

**Assuming the two priors P(Y = 0) = P(Y = 1) = 0.5**, we now need to figure out **4 sets of Gaussian PDFs** for the 4 likelihood parameters in order to make predictions. **Remember, you obtain the parameters only from the training sets**:  

1. Mean of Feature1 for digit0 
2. Variance of Feature1 for digit0 
3. Mean of Feature2 for digit0 
4. Variance of Feature2 for digit0 
5. Mean of Feature1 for digit1
6. Variance of Feature1 for digit1 
7. Mean of Feature2 for digit1 
8. Variance of Feature2 for digit1 

**At the end of this step, you need to print out a list that contains these 8 values in the above order.**

Hint: Double check the NB classifier formula, what exactly are the 4 likelihood parameters?

In [4]:
def gaussian_pdf(mu, var, x_f):
    # Compute Gaussian PDF
    exponent = -((x_f - mu)**2) / (2 * var)
    pdf = (1 / np.sqrt(2 * np.pi * var)) * np.exp(exponent)
    return pdf

# Calculate the mean and variance for each feature for digit 0
mean_f1_digit0 = np.mean(features_train0[:, 0])  # Feature 1 (average brightness)
var_f1_digit0 = np.var(features_train0[:, 0])

mean_f2_digit0 = np.mean(features_train0[:, 1])  # Feature 2 (standard deviation of brightness)
var_f2_digit0 = np.var(features_train0[:, 1])

# Calculate the mean and variance for each feature for digit 1
mean_f1_digit1 = np.mean(features_train1[:, 0])  # Feature 1 (average brightness)
var_f1_digit1 = np.var(features_train1[:, 0])

mean_f2_digit1 = np.mean(features_train1[:, 1])  # Feature 2 (standard deviation of brightness)
var_f2_digit1 = np.var(features_train1[:, 1])

print([mean_f1_digit0, var_f1_digit0,
       mean_f2_digit0, var_f2_digit0,
       mean_f1_digit1, var_f1_digit1,
       mean_f2_digit1, var_f2_digit1])

[44.15345433673469, 113.58645671141707, 87.39560564946916, 100.33375184004153, 19.379585969387758, 31.234096625264932, 61.3637481389197, 82.25698654654198]


# 3. NB Predictions and Accuracies out of the Test Sets

Once Step 1 and Step 2 are finished, you now have everything to generate prodictions for samples in our test sets. We have two test sets in this project: **test0** of all digit 0's, and **test1** of all digit 1's.

Your prediction using a NB classifier is by comparing the two posteriors - P(Y=0|X) vs. P(Y=1|X). For a single test set image X in its raw form, you will obtain the two posteriors by doing 

1. Convert X into the same feature1-feature2 vectors with your feature extractor done in Step 1, then
2. Use the 4 established Gaussian PDFs from Step 2 to obtain the two posteriors and make judgements.

**At the end of this step, you need to print out a two-item list that contains the overall prediction accuracies for test0 and test1. The accuracy values should both be within 0 and 1.**

In [5]:
# Step 3: Predictions
correct0 = 0  # Counter for correct predictions in test0
correct1 = 0  # Counter for correct predictions in test1

# Predictions for test0 (digit 0)
for x in test0:
    x_f1, x_f2 = feat_extract(x)
    
    # Calculate posteriors for digit 0 and digit 1
    posterior_digit0 = gaussian_pdf(mean_f1_digit0, var_f1_digit0, x_f1) * gaussian_pdf(mean_f2_digit0, var_f2_digit0, x_f2)
    posterior_digit1 = gaussian_pdf(mean_f1_digit1, var_f1_digit1, x_f1) * gaussian_pdf(mean_f2_digit1, var_f2_digit1, x_f2)
    
    # Make prediction
    if posterior_digit0 > posterior_digit1:
        correct0 += 1

# Accuracy for test0
acc0 = correct0 / len(test0)

# Predictions for test1 (digit 1)
for x in test1:
    x_f1, x_f2 = feat_extract(x)
    
    # Calculate posteriors for digit 0 and digit 1
    posterior_digit0 = gaussian_pdf(mean_f1_digit0, var_f1_digit0, x_f1) *  gaussian_pdf(mean_f2_digit0, var_f2_digit0, x_f2)
    posterior_digit1 = gaussian_pdf(mean_f1_digit1, var_f1_digit1, x_f1) *  gaussian_pdf(mean_f2_digit1, var_f2_digit1, x_f2)
    
    # Make prediction
    if posterior_digit1 > posterior_digit0:
        correct1 += 1

# Accuracy for test1
acc1 = correct1 / len(test1)

# Print accuracies
print([acc0, acc1])

[0.9173469387755102, 0.9233480176211454]
