## **Image Classification of Animals Using CNNs and PCA**
#### `Geraldine Marten-Ellis`, `Dawit Hailu`, `Jan McConnell`, `Aaron J. Smith`
**DS510 Team Project** `Summer 2025`

### **Introduction**


> This project looks at how well Convolutional Neural Networks (CNNs) can classify images of animals and examines whether applying Principal Component Analysis (PCA) improves the results, using a small selection from the High-Resolution Cat-Dog-Bird Image Dataset, with 150 grayscale images in total, divided evenly between cats, dogs, and birds. Our goal is to compare classification accuracy, training time, and loss between models trained on raw images and those preprocessed with PCA. The project highlights how dimensionality reduction can influence both efficiency and generalization in small-scale image classification tasks. Performance metrics and visualizations, including confusion matrices and loss curves will support findings.

*Setup and Imports*

In [81]:
! pip install torch torchvision matplotlib scikit-learn pandas numpy




[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: C:\Users\davuc\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [82]:
import os
import numpy as np
import pandas as pd
import torch.nn as nn
from tqdm import tqdm
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
from IPython.display import clear_output
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader, TensorDataset


## 1 **Pure CNN Solution**

#### 1.1 **Load and Preprocess Dataset**

This class handles reading downloaded CSV datasets, `normalizing` and `reshaping` the image data, `splitting` into training, validation, and test sets, and preparing PyTorch `DataLoader` objects for use in model training class workflows. It also manages label remapping and provides access to data splits, tensors, and class mappings.

In [83]:
class DataModule:
    def __init__(self, data_dir="./dataset", dataset_dir = "mnist-animals-dataset", batch_size=64, test_size=0.2, random_state=42):
        self.data_dir = data_dir
        self.dataset_dir = dataset_dir
        self.batch_size = batch_size
        self.test_size = test_size
        self.random_state = random_state
        
        self.class_map = None
        # training data
        self.x_train = None
        self.y_train = None
        self.x_test = None
        self.y_test = None
        

        self._preprocess()

    def _preprocess(self):
        # Resize images to 28x28 if not already
        if not os.path.exists(self.data_dir):
            raise FileNotFoundError(f"Data directory {self.data_dir} does not exist. Please check the path.")
        
        # check if preprocessed data already exists
        csv_target = f"{self.dataset_dir}/mnist-animals.csv"
        if not os.path.exists(csv_target):
            # create dataset directory if it does not exist
            if not os.path.exists(self.dataset_dir):
                os.makedirs(self.dataset_dir)
            # write header to csv
            with open(csv_target, 'w') as f:
                f.write('class,' + ','.join([f'pixel_{i}' for i in range(28 * 28)]) + '\n')
                f.close()
            # This loop assumes the images are categorized in subdirectories
                # e.g., data_dir/cat, data_dir/dog, data_dir/bird
            for subdir in os.listdir(self.data_dir):
                subdir_path = os.path.join(self.data_dir, subdir)
                for file in os.listdir(subdir_path):
                    if file.endswith('.png') or file.endswith('.jpg'):
                        file_path = os.path.join(subdir_path, file)
                        
                        # Here we resize the image to 28x28 and append it to csv
                        image = plt.imread(file_path)
                        if image.shape[0] != 28 or image.shape[1] != 28:
                            resized_image = np.resize(image, (28, 28))
                            # add image class to the resized image
                            class_label = np.array([subdir])
                            resized_image = np.append(class_label, resized_image.flatten())
                            
                            # write resized image to csv
                            with open(csv_target, 'a') as f:
                                f.write(','.join(map(str, resized_image)) + '\n')
                                f.close()
            print(f"Preprocessing complete. CSV file saved to {csv_target}")
        else:
            print(f"CSV files already exist in {self.data_dir}/{self.data_dir}. Skipping preprocessing step.")                  
        
        # Now we can read the csv file and split it into train and test sets
        data = pd.read_csv(csv_target)
        self.class_map = {i: label for i, label in enumerate(data['class'].unique())}
        X= data.drop(columns=['class']).values
        y = data['class'].apply(lambda x: list(self.class_map.keys())[list(self.class_map.values()).index(x)]).values
        self.x_train, self.x_test, self.y_train, self.y_test = train_test_split(X, y, test_size=self.test_size, random_state=self.random_state)
        

    def get_class_map(self):
        return self.class_map
    
    def get_train_data(self):
        return self.x_train, self.y_train, self.x_test, self.y_test, self.x_val, self.y_val
            
    

In [84]:
dataset = DataModule(data_dir="./dataset", dataset_dir="mnist-animals", batch_size=64, test_size=0.2, random_state=42)

print(f"Train data shape: {dataset.x_train.shape}, Train labels shape: {dataset.y_train.shape}")
print(f"Test data shape: {dataset.x_test.shape}, Test labels shape: {dataset.y_test.shape}")
print(f"Class map: {dataset.class_map}")

CSV files already exist in ./dataset/./dataset. Skipping preprocessing step.
Train data shape: (120, 784), Train labels shape: (120,)
Test data shape: (30, 784), Test labels shape: (30,)
Class map: {0: 'bird', 1: 'cat', 2: 'dog'}
