## README
In this demostration, I have trained the CIFAR10 classification using both SVM and Random Forest classifiers, as well as eveaulated their respective accuracy and stability. 

Note the data directory is not uploaded to Github due to the large size. Feel free to modify my directory global variable to work on your dataset.

### Dev Env (Python 3.10.5)

numpy==1.26.1

opencv-python==4.9.0.80

matplotlib==3.8.2

scikit-learn==1.4.1.post1

pickleshare==0.7.5

### Imports & Setup
Image directory and image file name can be setup here

In [6]:
import numpy as np
import matplotlib.pyplot as plt
import cv2
import os
import pickle
from sklearn import svm

%matplotlib inline

# Relative dir or absolute dir
PART1_DIR = "part1/"
PART2_IMG_PATH = "part2/Person.png"
CIFAR10_DIR = 'cifar-10-batches-py/'

### Util Functions


In [7]:
def show_images(images, cmap=None):
    """
    Display multiple imgs with a title.
    """
    n_images = len(images)

    # Create a subplot with n_images columns
    fig, axes = plt.subplots(n_images, 1, figsize=(5*n_images, 10))

    # Make sure still works with just one img in the arr
    if n_images == 1:
        axes = [axes]

    for i, img in enumerate(images):
        if cmap:
            axes[i].imshow(img, cmap=cmap)
        else:
            axes[i].imshow(img)
        axes[i].axis('off')

    plt.tight_layout()
    plt.show()

def load_and_convert_images_to_grayscale(dir):
    """
    Loads all images from the specified directory, converts them to grayscale, and returns a list of the grayscale images.
    """
    grayscale_images = []

    # Loop through each file
    for filename in os.listdir(dir):
        # Get the full file path
        file_path = os.path.join(dir, filename)
        
        # Check if the file is an image. (For the assignment's convension in Q2 Part1 I only check for jpg but other format can be added here)
        if file_path.lower().endswith(('.jpg')):
            img = cv2.imread(file_path)
            
            # Convert the image to grayscale
            grayscale_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
            grayscale_images.append(grayscale_img)

    return grayscale_images

# P1: CIFAR10 Classification using SVM and Random Forest

## 1.1: Resize and Compute HoG
## 1.2: Fit a non-linear SVM classifier with default hyperparameters

In [14]:
Q1_IMG_SIZE = (64, 64)
# For training time on my little machine, I have to set it to 2. (train on batch 1 and 2)
USE_BATCH_UNTIL = 2

# From Doc https://www.cs.toronto.edu/~kriz/cifar.html
def load_cifar10_batch(file):
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict

def load_data(cifar10_dir):
    # Load all training batches
    train_data = []
    train_labels = []

    for i in range(1, USE_BATCH_UNTIL+1):
        batch = load_cifar10_batch(os.path.join(cifar10_dir, 'data_batch_' + str(i)))
        train_data.append(batch[b'data'])
        train_labels += batch[b'labels']

    # Load test batch
    test_batch = load_cifar10_batch(os.path.join(cifar10_dir, 'test_batch'))
    test_data = test_batch[b'data']
    test_labels = test_batch[b'labels']

    # Convert to numpy arrays and reshape to images
    train_data = np.vstack(train_data).reshape(-1, 3, 32, 32).transpose(0, 2, 3, 1)
    test_data = test_data.reshape(-1, 3, 32, 32).transpose(0, 2, 3, 1)


    return train_data, train_labels, test_data, test_labels

def resize_and_grayscale(images):
    grayscale_images = []
    for img in images:
        # Resize to Q1_IMG_SIZE
        resized_img = cv2.resize(img, Q1_IMG_SIZE)
        # Convert to grayscale
        grayscale_img = cv2.cvtColor(resized_img, cv2.COLOR_BGR2GRAY)
        grayscale_images.append(grayscale_img)
    return np.array(grayscale_images)

def compute_hog_features(images):
    img_size = (64, 64) # h x w in pixels
    cell_size = (8, 8)  # h x w in pixels
    block_size = (4, 4)  # h x w in cells
    nbins = 4  # number of orientation bins

    # create HoG Object
    # winSize is the size of the image cropped to multiple of the cell size
    # all arguments should be given in terms of number of pixels
    hog_descriptor = cv2.HOGDescriptor(_winSize=(img_size[1] // cell_size[1] * cell_size[1],
                                    img_size[0] // cell_size[0] * cell_size[0]),
                            _blockSize=(block_size[1] * cell_size[1],
                                        block_size[0] * cell_size[0]),
                            _blockStride=(cell_size[1], cell_size[0]),
                            _cellSize=(cell_size[1], cell_size[0]),
                            _nbins=nbins)
    hog_features = []
    for img in images:
        hog_feat = hog_descriptor.compute(img)
        hog_features.append(hog_feat)
    return np.array(hog_features).reshape(len(images), -1)

train_data, train_labels, test_data, test_labels = load_data(CIFAR10_DIR)
# Resize and convert to grayscale
train_images_gray = resize_and_grayscale(train_data)
test_images_gray = resize_and_grayscale(test_data)

# Compute HoG features
train_hog_features = compute_hog_features(train_images_gray)
test_hog_features = compute_hog_features(test_images_gray)

clf = svm.SVC(gamma='auto')
clf.fit(train_hog_features, train_labels)

## 1.3: Predict labels of the test images by feeding the test features


In [15]:
# Evaluate on test data
predicted_labels = clf.predict(test_hog_features)
print("Accuracy:", np.mean(predicted_labels == test_labels))

Accuracy: 0.394


Note: I consider the above accuracy solid because I am only using 2 of the data batches and also, as discussed in the website, the set contain images from 10 classes. So ramdom guess will only have accuracy of 0.1

## 1.4: Tune values of hyperparameters ’gamma’ and ’C’ to observe the accuracy change.
Note I will automate the process of testing and finding values. (Because I can't check my computer every 1 hour and change the values to next)

Under time constaint I have selected 3 values each for GAMMA_VALUES and C_VALUES but I hope the idea is clear -- By looping through the Accuracy result with different gamma and C values, we can find the trend that leads us to a higher Accuracy.

In [16]:
# Define the range of values for 'gamma' and 'C' you want to test
GAMMA_VALUES = [1e-2, 1e-4, 1e-6]
C_VALUES = [1, 10, 100]

# Prepare a list to store the results
results = []

# Loop over all possible combinations of 'gamma' and 'C'
for gamma in GAMMA_VALUES:
    for C in C_VALUES:
        # Initialize and train the SVM model
        clf = svm.SVC(gamma=gamma, C=C)
        clf.fit(train_hog_features, train_labels)
        
        # Predict on the test set and compute accuracy
        predicted_labels = clf.predict(test_hog_features)
        accuracy = np.mean(predicted_labels == test_labels)
        
        # Store the results
        results.append((gamma, C, accuracy))
        print(f"Gamma: {gamma}, C: {C}, Accuracy: {accuracy}")

Gamma: 0.01, C: 1, Accuracy: 0.4904
Gamma: 0.01, C: 10, Accuracy: 0.545
Gamma: 0.01, C: 100, Accuracy: 0.5784
Gamma: 0.0001, C: 1, Accuracy: 0.164
Gamma: 0.0001, C: 10, Accuracy: 0.4215
Gamma: 0.0001, C: 100, Accuracy: 0.4828
Gamma: 1e-06, C: 1, Accuracy: 0.1
Gamma: 1e-06, C: 10, Accuracy: 0.1
Gamma: 1e-06, C: 100, Accuracy: 0.1639


Note Above: As we can see as C increases from 1 to 100 the modles accuracy also increases and as the Gamma increases, accuracy also increases. This tell us the next step is to set C_VALUES to something like [200, 500, 1000] and Gamma values to something like [1e-1, 1, 10] to continue on the fine tuning. For now, I will select C as 100, and Gamma as 0.01.

## 1.5: Random Forest(RF) Classifier

In [17]:
from sklearn.ensemble import RandomForestClassifier
# Params
N_ESTIMATOR = 10
MAX_DEPTH = 5
CRITERION = 'entropy'

rf_clf = RandomForestClassifier(n_estimators=N_ESTIMATOR, max_depth=MAX_DEPTH, criterion=CRITERION)

# Fit the classifier to the training data
rf_clf.fit(train_hog_features, train_labels)

# 1.6: RF Classification Accuracy

In [19]:
# Evaluate on test data
predicted_labels = rf_clf.predict(test_hog_features)
print("Accuracy:", np.mean(predicted_labels == test_labels))

Accuracy: 0.3528


## 1.7: SVM and RF Comparasion
Again, I will automate the process of running SVM and RF with different ramdom states and evauate their mean and standard deviation for finding out the stability

In [22]:
# Feel free to adjust them
RANDOM_STATES = [5, 10, 15, 20, 25]

# SVM params
GAMMA = 0.01
C = 100

# RF params
N_ESTIMATOR = 10
MAX_DEPTH = 5
CRITERION = 'entropy'

svm_accuracies = []
rf_accuracies = []

for state in RANDOM_STATES:
    # Train and evaluate SVM
    svm_clf_rd = svm.SVC(gamma=GAMMA, C=C, random_state=state)
    svm_clf_rd.fit(train_hog_features, train_labels)
    svm_predictions = svm_clf_rd.predict(test_hog_features)
    svm_accuracy = np.mean(svm_predictions == test_labels)
    svm_accuracies.append(svm_accuracy)
    print(f"INFO: SVM accuracy with state {state}: {svm_accuracy}")
    
    # Train and evaluate RF
    rf_clf_rd = RandomForestClassifier(n_estimators=N_ESTIMATOR, max_depth=MAX_DEPTH, criterion=CRITERION, random_state=state)
    rf_clf_rd.fit(train_hog_features, train_labels)
    rf_predictions = rf_clf_rd.predict(test_hog_features)
    rf_accuracy = np.mean(rf_predictions == test_labels)
    rf_accuracies.append(rf_accuracy)
    print(f"INFO: RF accuracy with state {state}: {rf_accuracy}")

# Calculate average accuracies and std dev to evaluate stability
svm_avg_accuracy = np.mean(svm_accuracies)
rf_avg_accuracy = np.mean(rf_accuracies)
svm_std_dev = np.std(svm_accuracies)
rf_std_dev = np.std(rf_accuracies)

print("------------------------------RESULT----------------------------------")
print(f"INFO: SVM - Avg Accuracy: {svm_avg_accuracy}, Std Dev: {svm_std_dev}")
print(f"INFO: RF - Avg Accuracy: {rf_avg_accuracy}, Std Dev: {rf_std_dev}")

INFO: SVM accuracy with state 5: 0.5784
INFO: RF accuracy with state 5: 0.3525
INFO: SVM accuracy with state 10: 0.5784
INFO: RF accuracy with state 10: 0.3579
INFO: SVM accuracy with state 15: 0.5784
INFO: RF accuracy with state 15: 0.3602
INFO: SVM accuracy with state 20: 0.5784
INFO: RF accuracy with state 20: 0.3627
INFO: SVM accuracy with state 25: 0.5784
INFO: RF accuracy with state 25: 0.3605
------------------------------RESULT----------------------------------
INFO: SVM - Avg Accuracy: 0.5784, Std Dev: 0.0
INFO: RF - Avg Accuracy: 0.35876, Std Dev: 0.003480000000000013


## Stability:
As we can see from the result the SVM is more stable which is as we expected because the solution patch is oftern deterministic and there isn't much randoness that comes to play.
RF on the otherhand do have variations in the accuracy because different tree path can be selected and taken.

Inaddition, from the offical doc for svm.SVC, we can see adjusting random_state along will not have any effect on the SVC unless paring with other params.

Doc: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

## Strengths and weaknesses:
### SVM strenth: 
* Flexibility in choosing a kernel function -> Can adapt to different type of data
* Ability to handle large feature spaces
### SVM weakness:
* Slower and more memory hungery as training set grows (I reall feel this in this assignment)
* Parameter selection can be tricky and the accuracy can drop a lot with param not carefully selected
### RF strenth:
* It is robust to overfitting (Due to the ensemble approach of averaging multiple decision trees)
* It can handles large datasets and feature spaces
### RF weakness:
* Has weaker performance on small size training datasets