## **Image Classification of Animals Using CNNs and PCA**
#### `Geraldine Marten-Ellis`, `Dawit Hailu`, `Jan McConnell`, `Aaron J. Smith`
**DS510 Team Project** `Summer 2025`

### **Introduction**


> This project looks at how well Convolutional Neural Networks (CNNs) can classify images of animals and examines whether applying Principal Component Analysis (PCA) improves the results, using a small selection from the High-Resolution Cat-Dog-Bird Image Dataset, with 150 grayscale images in total, divided evenly between cats, dogs, and birds. Our goal is to compare classification accuracy, training time, and loss between models trained on raw images and those preprocessed with PCA. The project highlights how dimensionality reduction can influence both efficiency and generalization in small-scale image classification tasks. Performance metrics and visualizations, including confusion matrices and loss curves will support findings.

*Setup and Imports*

In [2]:
! pip install torch torchvision matplotlib scikit-learn pandas numpy

Collecting torch
  Downloading torch-2.7.1-cp311-cp311-win_amd64.whl.metadata (28 kB)
Collecting torchvision
  Downloading torchvision-0.22.1-cp311-cp311-win_amd64.whl.metadata (6.1 kB)
Collecting filelock (from torch)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting sympy>=1.13.3 (from torch)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch)
  Downloading networkx-3.5-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch)
  Downloading MarkupSafe-3.0.2-cp311-cp311-win_amd64.whl.metadata (4.1 kB)
Downloading torch-2.7.1-cp311-cp311-win_amd64.whl (216.1 MB)
   ---------------------------------------- 0.0/216.1 MB ? eta -:--:--
   ---------------------------------------- 0.1/216.1 


[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: C:\Users\davuc\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [None]:
import os
import numpy as np
import pandas as pd
import torch.nn as nn
from tqdm import tqdm
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
from IPython.display import clear_output
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader, TensorDataset


## 1 **Data Loading and Preprocessing**

This class handles reading downloaded CSV datasets, `normalizing` and `reshaping` the image data, `splitting` into training, validation, and test sets, and preparing PyTorch `DataLoader` objects for use in model training class workflows. It also manages label remapping and provides access to data splits, tensors, and class mappings.

In [49]:
class DataModule:
    def __init__(self, data_dir="./dataset", dataset_dir = "mnist-animals-dataset", batch_size=64, test_size=0.2, random_state=42):
        self.data_dir = data_dir
        self.dataset_dir = dataset_dir
        self.batch_size = batch_size
        self.test_size = test_size
        self.random_state = random_state

        self.train_loader = None
        self.val_loader = None
        self.test_loader = None
        self.class_map = None
        # training data
        self.x_train = None
        self.y_train = None
        self.x_test = None
        self.y_test = None
        self.x_val = None
        self.y_val = None
        #training tensors
        self.x_train_tensor = None
        self.y_train_tensor = None
        self.x_test_tensor = None
        self.y_test_tensor = None
        self.x_val_tensor = None
        self.y_val_tensor = None

        self._preprocess()

    def _preprocess(self):
        # Resize images to 28x28 if not already
        if not os.path.exists(self.data_dir):
            raise FileNotFoundError(f"Data directory {self.data_dir} does not exist. Please check the path.")
        
        # check if preprocessed data already exists
        csv_target = f"{self.dataset_dir}/mnist-animals.csv"
        if not os.path.exists(csv_target):
            # create dataset directory if it does not exist
            if not os.path.exists(self.dataset_dir):
                os.makedirs(self.dataset_dir)
            # write header to csv
            with open(csv_target, 'w') as f:
                f.write('class,' + ','.join([f'pixel_{i}' for i in range(28 * 28)]) + '\n')
                f.close()
            # This loop assumes the images are categorized in subdirectories
                # e.g., data_dir/cat, data_dir/dog, data_dir/bird
            for subdir in os.listdir(self.data_dir):
                subdir_path = os.path.join(self.data_dir, subdir)
                for file in os.listdir(subdir_path):
                    if file.endswith('.png') or file.endswith('.jpg'):
                        file_path = os.path.join(subdir_path, file)
                        
                        # Here we resize the image to 28x28 and append it to csv
                        image = plt.imread(file_path)
                        if image.shape[0] != 28 or image.shape[1] != 28:
                            resized_image = np.resize(image, (28, 28))
                            # add image class to the resized image
                            class_label = np.array([subdir])
                            resized_image = np.append(class_label, resized_image.flatten())
                            
                            # write resized image to csv
                            with open(csv_target, 'a') as f:
                                f.write(','.join(map(str, resized_image)) + '\n')
                                f.close()
            print(f"Preprocessing complete. CSV file saved to {csv_target}")
        else:
            print(f"CSV files already exist in {self.data_dir}/{self.data_dir}. Skipping preprocessing step.")                  
        
        # Now we can read the csv file and split it into train and test sets
        data = pd.read_csv(csv_target)
        self.class_map = {i: label for i, label in enumerate(data['class'].unique())}
        X = data.drop(columns=['class']).values
        y = data['class'].apply(lambda x: list(self.class_map.keys())[list(self.class_map.values()).index(x)]).values
        x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=self.test_size, random_state=self.random_state, shuffle=True)
        
        self.x_test = x_test
        self.y_test = y_test
        # Split the training set into train/validation
        self.x_train, self.x_val, self.y_train, self.y_val = train_test_split(
            x_train, y_train, test_size=self.test_size, random_state=self.random_state, stratify=y_train
        )
        # Normalize
        x_train = self.x_train.astype('float32') / 255.0
        x_val = self.x_val.astype('float32') / 255.0
        x_test = x_test.astype('float32') / 255.0

        # Reshape to (N, 1, 28, 28)
        x_train = x_train.reshape(-1, 1, 28, 28)
        x_val = x_val.reshape(-1, 1, 28, 28)
        x_test = x_test.reshape(-1, 1, 28, 28)
        
        

        # Convert to tensors
        self.x_train_tensor = torch.from_numpy(self.x_train)
        self.y_train_tensor = torch.from_numpy(self.y_train).long()
        self.x_val_tensor = torch.from_numpy(self.x_val)
        self.y_val_tensor = torch.from_numpy(self.y_val).long()
        self.x_test_tensor = torch.from_numpy(self.x_test)
        self.y_test_tensor = torch.from_numpy(self.y_test).long()

        # Make TensorDatasets and DataLoaders
        train_dataset = TensorDataset(self.x_train_tensor, self.y_train_tensor)
        val_dataset = TensorDataset(self.x_val_tensor, self.y_val_tensor)
        test_dataset = TensorDataset(self.x_test_tensor, self.y_test_tensor)

        self.train_loader = DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)
        self.val_loader = DataLoader(val_dataset, batch_size=self.batch_size)
        self.test_loader = DataLoader(test_dataset, batch_size=self.batch_size)

    def get_class_map(self):
        return self.class_map
    
    def get_train_data(self):
        return self.x_train, self.y_train, self.x_test, self.y_test, self.x_val, self.y_val
    
    def get_train_tensors(self):
        return self.x_train_tensor, self.y_train_tensor, self.x_test_tensor, self.y_test_tensor, self.x_val_tensor, self.y_val_tensor

    def get_dataloaders(self):
        return self.train_loader, self.val_loader, self.test_loader
            
    

In [50]:
dataset = DataModule(data_dir="./dataset", dataset_dir="mnist-animals", batch_size=64, test_size=0.2, random_state=42)

print(f"Train data shape: {dataset.x_train.shape}, Train labels shape: {dataset.y_train.shape}")
print(f"Test data shape: {dataset.x_test.shape}, Test labels shape: {dataset.y_test.shape}")
print(f"Class map: {dataset.class_map}")

CSV files already exist in ./dataset/./dataset. Skipping preprocessing step.
Train data shape: (96, 784), Train labels shape: (96,)
Test data shape: (30, 784), Test labels shape: (30,)
Class map: {0: 'bird', 1: 'cat', 2: 'dog'}


In [51]:
print(f"Train data shape: {dataset.x_train.shape}, Train labels shape: {dataset.y_train.shape}")

Train data shape: (96, 784), Train labels shape: (96,)


In [52]:
# Initialize and use the data module
data_module = DataModule()
train_loader, val_loader, test_loader = data_module.get_dataloaders()
print(f"Train loader: {len(train_loader)} batches, Val loader: {len(val_loader)} batches, Test loader: {len(test_loader)} batches")
# get training data
x_train, y_train, x_test, y_test, x_val, y_val = data_module.get_train_data()
print(f"Train data shape: {x_train.shape}, Train labels shape: {y_train.shape}")
print(f"Test data shape: {x_test.shape}, Test labels shape: {y_test.shape}")
print(f"Validation data shape: {x_val.shape}, Validation labels shape: {y_val.shape}")
# get training tensors
x_train_tensor, y_train_tensor, x_test_tensor, y_test_tensor, x_val_tensor, y_val_tensor = data_module.get_train_tensors()
print(f"Train tensors shape: {x_train_tensor.shape}, Train labels shape: {y_train_tensor.shape}")
print(f"Test tensors shape: {x_test_tensor.shape}, Test labels shape: {y_test_tensor.shape}")
print(f"Validation tensors shape: {x_val_tensor.shape}, Validation labels shape: {y_val_tensor.shape}")
classifications = data_module.get_class_map()
print(f"Class map: {classifications}")

Preprocessing complete. CSV file saved to mnist-animals-dataset/mnist-animals.csv
Train loader: 2 batches, Val loader: 1 batches, Test loader: 1 batches
Train data shape: (96, 784), Train labels shape: (96,)
Test data shape: (30, 784), Test labels shape: (30,)
Validation data shape: (24, 784), Validation labels shape: (24,)
Train tensors shape: torch.Size([96, 784]), Train labels shape: torch.Size([96])
Test tensors shape: torch.Size([30, 784]), Test labels shape: torch.Size([30])
Validation tensors shape: torch.Size([24, 784]), Validation labels shape: torch.Size([24])
Class map: {0: 'bird', 1: 'cat', 2: 'dog'}
