# Introduction
This project originated as a final project for my 'CSC 351: Machine Learning' class. I worked on this project with one of my classmates, Braydon Johnson. The goal was to make a convolutional neural network that could accurately classify different animal classes when given an image of an animal.

There wasn't a whole lot of inspiration behind this so much as just a fascination with CNNs. We previously worked with CNNs in a lab for facial recognition, so we thought doing another CNN project would be fun. We also decided to include three pretrained models (ResNet50, EfficientNetB0, and MobileNetV2) as a benchmark to compare our model to.

While our model is a CNN and not a residual network, like the other three, our CNN model is most similar to a residual network since a residual network is just a CNN, but with residual blocks added to it.

In [None]:
# Libraries
import kagglehub
import os
import random
import numpy as np
import pandas as pd
from PIL import Image as PILImage
import matplotlib.pyplot as plt
import torchvision.transforms as transforms
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
import torchvision.models as models
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
from tqdm import tqdm
import torch.nn.functional as F

# Data Preparation

### About the data

The data we used in this project is the Animals-10 dataset from kaggle.

Animals-10 Dataset Details:
- Contains ~26k Images
- 10 Different classes:
  - Dog, Horse, Elephant, Butterfly, Chicken, Cat, Cow, Sheep, Squirrel, and Spider
- All images in the dataset were gathered from Google Images.
- Dataset Link: https://www.kaggle.com/datasets/alessiocorrado99/animals10

In [None]:
# Data initialization

# Translation for classes
translate = {
    "cane": "dog", "cavallo": "horse", "elefante": "elephant", "farfalla": "butterfly",
    "gallina": "chicken", "gatto": "cat", "mucca": "cow", "pecora": "sheep",
    "scoiattolo": "squirrel", "ragno": "spider"
}

# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

In [None]:
# Download data and assign classes

# Download data
path = kagglehub.dataset_download("alessiocorrado99/animals10")
image_dir = os.path.join(path, "raw-img")

# Assign correct classes
translated_images = []
class_counts = {}

for italian_class in os.listdir(image_dir):
    class_path = os.path.join(image_dir, italian_class)
    if os.path.isdir(class_path):
        english_class = translate.get(italian_class, italian_class)
        image_files = [
            os.path.join(class_path, f)
            for f in os.listdir(class_path)
            if f.lower().endswith((".jpg", ".jpeg", ".png"))
        ]
        class_counts[english_class] = len(image_files)
        for img_path in image_files:
            translated_images.append((img_path, english_class))

In [None]:
# Dataset stats
print(f"Total images: {len(translated_images)}")
for class_name, count in sorted(class_counts.items()):
    print(f"  {class_name}: {count} images")

### Image Transformations

We came up with two different transformation sets, `train_transform` and `val_transform`. The `train_transform` set applies minor transformations to the training images to help mitigate overfiting in our model. The `val_transform` set does not apply many changes because this is where we want the model to see the true images and measure its true performance.

`train_transform` changes:
- Resize to a slightly larger image
- Random Crop
- Random Horizontal Flip
- Random Rotation
- Color Jitter
- Image normalization*

`val_transform` changes:
- Resize to the original size
- Convert to tensor
- Image normalization*

*normalization applied for ImageNet standards in regards to the resnet models used

In [None]:
# Image transformations

# Define separate transforms for training and validation
train_transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.1),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

val_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

### Custom Dataset
We define a custom `AnimalDataset` class to handle our images and labels. This allows us to flexibly load images from any list (rather than a fixed folder structure) and apply transformations for data augmentation or normalization.

In [None]:
# Custom dataset class
class AnimalDataset(Dataset):
    def __init__(self, image_label_list, transform=None, class_to_idx=None):
        self.image_label_list = image_label_list
        self.transform = transform
        self.class_to_idx = class_to_idx

    def __len__(self):
        return len(self.image_label_list)

    def __getitem__(self, idx):
        img_path, label = self.image_label_list[idx]
        image = PILImage.open(img_path).convert("RGB")
        if self.transform:
          image = self.transform(image)
          label_idx = self.class_to_idx[label]
        return image, label_idx

In [None]:
# Class names and mappings
class_names = sorted(set(label for _, label in translated_images))
class_to_idx = {label: idx for idx, label in enumerate(class_names)}
idx_to_class = {idx: label for label, idx in class_to_idx.items()}
num_classes = len(class_names)
print(f"Number of classes: {num_classes}")

### Train, Test, Split

We split the dataset into 80% training, 10% validation, and 10% testing.  
This ensures the model is trained on one portion, validated on unseen data during training, and evaluated on completely untouched data afterward for an honest performance measure.

In [None]:
# Train, Test, Split
random.shuffle(translated_images)

# Split up by class
images_by_class = {}
for img_path, label in translated_images:
    if label not in images_by_class:
        images_by_class[label] = []
    images_by_class[label].append((img_path, label))

train_images = []
val_images = []
test_images = []

# 80% train, 10% validation, 10% test
for label, images in images_by_class.items():
    n_train = int(0.8 * len(images))
    n_val = int(0.10 * len(images))

    train_images.extend(images[:n_train])
    val_images.extend(images[n_train:n_train+n_val])
    test_images.extend(images[n_train+n_val:])

print(f"Train set: {len(train_images)} images")
print(f"Validation set: {len(val_images)} images")
print(f"Test set: {len(test_images)} images")

In [None]:
# Establish a dataset for Training, Validation, and Testing
train_dataset = AnimalDataset(train_images, transform=train_transform, class_to_idx=class_to_idx)
val_dataset = AnimalDataset(val_images, transform=val_transform, class_to_idx=class_to_idx)
test_dataset = AnimalDataset(test_images, transform=val_transform, class_to_idx=class_to_idx)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, num_workers=4, pin_memory=True)

# Training and Evaluation

### Training process

This function handles the full model training: running multiple epochs, optimizing the model, tracking loss and accuracy, saving the best model, and adjusting the learning rate with a scheduler.  
We also record history for visualizing training progress later.

In [None]:
# Training Process
def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, num_epochs=10, model_name="Model"):
  model.to(device)
  best_val_acc = 0.0
  best_model_wts = model.state_dict()
  history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}

  for epoch in range(num_epochs):
    print(f'Epoch {epoch+1}/{num_epochs}')
    print('-' * 10)

    model.train()
    running_loss = 0.0
    running_corrects = 0
    num_samples = 0

    # tqdm settings
    loading_bar = tqdm(train_loader, desc=f"{model_name} Training Epoch {epoch+1}/{num_epochs}")

    for inputs, labels in loading_bar:
        inputs = inputs.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        _, preds = torch.max(outputs, 1)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * inputs.size(0)
        running_corrects += torch.sum(preds == labels.data)
        num_samples += inputs.size(0)

        loading_bar.set_postfix({'Loss': running_loss / num_samples, 'Accuracy': (running_corrects.double() / num_samples).item()})

    epoch_loss = running_loss / num_samples
    epoch_acc = running_corrects.double() / num_samples
    history['train_loss'].append(epoch_loss)
    history['train_acc'].append(epoch_acc.item())

    print(f'Train Loss: {epoch_loss:.4f} Accuracy: {epoch_acc:.4f}')

    # Validation
    model.eval()
    val_loss = 0.0
    val_corrects = 0

    with torch.no_grad():
        loading_bar = tqdm(val_loader, desc=f"{model_name} Validation Epoch {epoch+1}/{num_epochs}")
        for inputs, labels in loading_bar:
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(inputs)
            _, preds = torch.max(outputs, 1)
            loss = criterion(outputs, labels)
            val_loss += loss.item() * inputs.size(0)
            val_corrects += torch.sum(preds == labels.data)

            loading_bar.set_postfix({'Loss': running_loss / len(val_loader.dataset), 'Accuracy': (val_corrects.double() / len(val_loader.dataset)).item()})

    val_loss = val_loss / len(val_loader.dataset)
    val_acc = val_corrects.double() / len(val_loader.dataset)

    history['val_loss'].append(val_loss)
    history['val_acc'].append(val_acc.item())

    print(f'Validation Loss: {val_loss:.4f} Accuracy: {val_acc:.4f}')

    # Save best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        best_model_wts = model.state_dict().copy()
        print(f'Best model saved with accuracy: {best_val_acc:.4f}')
        torch.save(best_model_wts, 'best_model.pth')

    # Step the scheduler
    scheduler.step(epoch_loss)

  print(f'Best Validation Accuracy: {best_val_acc:.4f}')
  model.load_state_dict(best_model_wts)
  return model, history

### Evaluation

After training, we evaluate the model on the test set.  
We compute the classification report (precision, recall, F1-score), confusion matrix, and collect predictions vs. ground-truth labels for later visualization.

In [None]:
# Evaluation process
def evaluate_model(model, dataloader):
    model.eval()
    all_predictions = []
    all_labels = []

    with torch.no_grad():
        for inputs, labels in tqdm(dataloader, desc="Evaluating..."):
          inputs = inputs.to(device)
          labels = labels.to(device)
          outputs = model(inputs)
          _, preds = torch.max(outputs, 1)
          all_predictions.extend(preds.cpu().numpy())
          all_labels.extend(labels.cpu().numpy())

    report = classification_report(all_labels, all_predictions, target_names = [idx_to_class[i] for i in range(num_classes)], output_dict=True)

    cm = confusion_matrix(all_labels, all_predictions)

    return report, cm, all_predictions, all_labels

In [None]:
# Plot training history
def plot_training(history):
  plt.figure(figsize=(12, 4))

  plt.subplot(1, 2, 1)
  plt.plot(history['train_loss'], label='Training Loss')
  plt.plot(history['val_loss'], label='Validation Loss')
  plt.title('Loss through Epochs')
  plt.xlabel('Epoch')
  plt.ylabel('Loss')
  plt.legend()

  plt.subplot(1, 2, 2)
  plt.plot(history['train_acc'], label='Training Accuracy')
  plt.plot(history['val_acc'], label='Validation Accuracy')
  plt.title('Accuracy through Epochs')
  plt.xlabel('Epoch')
  plt.ylabel('Accuracy')
  plt.legend()

  plt.tight_layout()
  plt.show()

In [None]:
# Confusion Matrix generation
def plot_cm(cm, class_names):
  plt.figure(figsize=(10, 8))
  sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
  plt.xlabel('Predicted')
  plt.ylabel('True')
  plt.title('Confusion Matrix')
  plt.tight_layout()
  plt.show()

In [None]:
# Visualize predictions
def visualize_preds(model, test_dataset, num_samples=16):
    model.eval()
    indices = random.sample(range(len(test_dataset)), num_samples)
    fig, axs = plt.subplots(4, 4, figsize=(12, 12))

    with torch.no_grad():
        for i, idx in enumerate(indices):
            image, label = test_dataset[idx]
            input_tensor = image.unsqueeze(0).to(device)
            output = model(input_tensor)
            _, pred = torch.max(output, 1)

            image = image.cpu().numpy().transpose((1, 2, 0))

            # Denormalize image
            mean = np.array([0.485, 0.456, 0.406])
            std = np.array([0.229, 0.224, 0.225])
            image = std * image + mean
            image = np.clip(image, 0, 1)

            # Plot image
            ax = axs.flat[i]
            ax.imshow(image)
            true_label = idx_to_class[label]
            pred_label = idx_to_class[pred.item()]
            ax.set_title(f"True: {true_label}\nPred: {pred_label}")
            ax.axis('off')

    plt.tight_layout()
    plt.show()

In [None]:
# Output incorrect predictions
def visualize_incorrect_preds(model, test_dataset, num_samples=16):
  model.eval()
  incorrect_samples = []
  fig, axs = plt.subplots(4, 4, figsize=(12, 12))

  # Randomly select incorrect predictions
  with torch.no_grad():
    for idx in range(len(test_dataset)):
      image, label = test_dataset[idx]
      input_tensor = image.unsqueeze(0).to(device)
      output = model(input_tensor)
      _, pred = torch.max(output, 1)
      if pred.item() != label:
        incorrect_samples.append((image, label, pred.item()))
  if len(incorrect_samples) == 0:
    print("No incorrect predictions found.")
    return

  display_samples = random.sample(incorrect_samples, min(num_samples, len(incorrect_samples)))
  for i, (image, label, pred) in enumerate(display_samples):
    image = image.cpu().numpy().transpose((1, 2, 0))

    # Denormalize image for better quality
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    image = std * image + mean
    image = np.clip(image, 0, 1)

    ax = axs.flat[i]
    ax.imshow(image)

    true_label = idx_to_class[label]
    pred_label = idx_to_class[pred]
    ax.set_title(f"True: {true_label}\nPred: {pred_label}")
    ax.axis('off')

  plt.tight_layout()
  plt.savefig('incorrect_predictions.png')
  plt.show()


## Convolved Images

To better understand what the model is learning, we visualize the feature maps produced after the first convolutional layer. This gives insights into how the model detects basic patterns like edges, textures, and shapes at early stages.

Currently this is performed only on the first layer and we did it just to see the patterns it was detecting, but I think it would be better to implement this for multiple layers to get an even better understanding of how it changes throughout different epochs.

In [None]:
# Display convolved images and their feature maps
def convolved_images(model, model_name):
    model.eval()

    # Gather all images by class
    class_to_images = {}
    for img, label in test_dataset:
        if label not in class_to_images:
            class_to_images[label] = []
        class_to_images[label].append(img)

    # Randomly select one image per class
    class_examples = {label: random.choice(images) for label, images in class_to_images.items()}

    # Reduced figure size for compactness
    fig, axes = plt.subplots(num_classes, 6, figsize=(15, 2.5 * num_classes))
    fig.suptitle("Original Image + Convolved Outputs per Class", fontsize=16)

    for row_idx, (true_label, img) in enumerate(class_examples.items()):
        img_input = img.unsqueeze(0).to(device)

        # Predict the class
        with torch.no_grad():
            output = model(img_input)
            _, predicted_label = torch.max(output, 1)
            predicted_class_name = idx_to_class[predicted_label.item()]
            true_class_name = idx_to_class[true_label]

        # Pass through only the early layers
        with torch.no_grad():
            if model_name == "Custom":
                features = model.conv1(img_input)
            elif model_name == "ResNet50":
                features = list(model.children())[0](img_input)
            elif model_name == "MobileNetV2":
                features = model.features[0](img_input)
            elif model_name == "EfficientNetB0":
                features = model.features[0][0](img_input)
            else:
                raise ValueError(f"Unsupported model_name: {model_name}")

        features = features.squeeze(0).cpu()

        # Return image back to original for plotting
        image = img.cpu().numpy().transpose((1, 2, 0))
        mean = np.array([0.485, 0.456, 0.406])
        std = np.array([0.229, 0.224, 0.225])
        image = std * image + mean
        image = np.clip(image, 0, 1)

        # Original image
        ax = axes[row_idx, 0]
        ax.imshow(image)
        ax.axis('off')
        ax.set_title(f"True: {true_class_name}\nPred: {predicted_class_name}", fontsize=8)

        # Feature maps
        for i in range(5):
            ax = axes[row_idx, i+1]
            ax.imshow(features[i], cmap='viridis')
            ax.axis('off')
            if row_idx == 0:
                ax.set_title(f'Feature Map {i+1}', fontsize=7)

    # Tighter layout with reduced padding
    plt.tight_layout(rect=[0, 0, 1, 0.97], pad=0.5)
    plt.show()

In [None]:
# Define models

# Custom Model
class CustomModel(nn.Module):
  def __init__(self, num_classes, dropout_rate=0.5):
    super(CustomModel, self).__init__()

    self.conv1 = nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)
    self.bn1 = nn.BatchNorm2d(32)
    self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
    self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
    self.bn2 = nn.BatchNorm2d(64)
    self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
    self.conv3 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
    self.bn3 = nn.BatchNorm2d(128)
    self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)
    self.conv4 = nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1)
    self.bn4 = nn.BatchNorm2d(256)
    self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2)

    self.feature_size = self._get_feature_size(224)

    self.fc1 = nn.Linear(self.feature_size, 512)
    self.bn5 = nn.BatchNorm1d(512)
    self.dropout1 = nn.Dropout(dropout_rate)
    self.fc2 = nn.Linear(512, 256)
    self.bn6 = nn.BatchNorm1d(256)
    self.dropout2 = nn.Dropout(dropout_rate)
    self.fc3 = nn.Linear(256, num_classes)

  def _get_feature_size(self, input_size):
    size = input_size // 2
    size = size // 2
    size = size // 2
    size = size // 2
    return 256 * size * size

  def forward(self, x):
    x = self.pool1(F.leaky_relu(self.bn1(self.conv1(x)), 0.1))
    x = self.pool2(F.leaky_relu(self.bn2(self.conv2(x)), 0.1))
    x = self.pool3(F.leaky_relu(self.bn3(self.conv3(x)), 0.1))
    x = self.pool4(F.leaky_relu(self.bn4(self.conv4(x)), 0.1))

    x = x.view(x.size(0), -1)

    x = F.leaky_relu(self.bn5(self.fc1(x)), 0.1)
    x = self.dropout1(x)
    x = F.leaky_relu(self.bn6(self.fc2(x)), 0.1)
    x = self.dropout2(x)
    x = self.fc3(x)
    return x

# Resnet Model
model_res = models.resnet50(weights='DEFAULT')
num_ftrs = model_res.fc.in_features
model_res.fc = nn.Linear(num_ftrs, num_classes)
model_res.to(device)

# Efficient Model
model_enet = models.efficientnet_b0(weights='DEFAULT')
num_ftrs = model_enet.classifier[1].in_features
model_enet.classifier[1] = nn.Linear(num_ftrs, num_classes)
model_enet.to(device)

# Mobile Model
model_mnet = models.mobilenet_v2(weights='DEFAULT')
num_ftrs = model_mnet.classifier[1].in_features
model_mnet.classifier[1] = nn.Linear(num_ftrs, num_classes)
model_mnet.to(device)

## Full Training, Testing, and Evaluation

We train each model (Custom CNN, ResNet50, EfficientNetB0, MobileNetV2), evaluate their performance, visualize predictions and misclassifications, extract feature maps, and save the trained weights for future use.

In [None]:
# Full Training, testing, and evaluation of the code
models = [
    ("Custom", CustomModel(num_classes)),
    ("ResNet50", model_res),
    ("EfficientNetB0", model_enet),
    ("MobileNetV2", model_mnet)
]

for model_name, model in models:
  criterion = nn.CrossEntropyLoss()
  optimizer = optim.Adam(model.parameters(), lr=0.001)

  scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=3, verbose=True)

  # Train model
  num_epochs = 10
  trained_model, history = train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, num_epochs, model_name)
  plot_training(history)

  print("\nEvaluating on test set...")
  report, cm, all_predictions, all_labels = evaluate_model(trained_model, test_loader)

  # Classification report
  print("\nClassification Report:")
  for class_name, metrics in report.items():
    if class_name in ['accuracy']:
      continue
    print(f"{class_name}: Precision: {metrics['precision']:.4f}, Recall: {metrics['recall']:.4f}, F1-Score: {metrics['f1-score']:.4f}")
  print(f"\nOverall Accuracy: {report['accuracy']:.4f}")

  # Confusion Matrix
  print("\nConfusion Matrix:")
  plot_cm(cm, [idx_to_class[i] for i in range(num_classes)])

  # Convolved Images
  convolved_images(trained_model, model_name)

  # Visualize predictions
  visualize_preds(trained_model, test_dataset)
  visualize_incorrect_preds(trained_model, test_dataset)

  # Save model
  torch.save({
        'model_state_dict': trained_model.state_dict(),
        'class_to_idx': class_to_idx,
        'idx_to_class': idx_to_class,
        'model_name': model
  }, f'final_animal_classifier{model_name}.pth')

print(trained_model, class_to_idx, idx_to_class)

# Final Thoughts

This project taught us a lot about CNNs and Residual Networks; We learned a lot throughout the whole process from the creation of our models to the evaluation.

Ultimately, our custom CNN implementation was too simple and underperformed significantly compared to the pretrained residual networks. It might be a good idea to go back and revisit our model and figure out ways to improve it.

We also tried to make our own custom implementation of a residual network, but it didn't workout well and we were short on time, but it may also be worth it to go back and revist that as well.

End Model Results:

| Model          | Accuracy | Precision | f1-Score | Recall  |
| :-----         | :------: | :-------: | :------: | :-----: |
| Custom         | 0.7568   | 0.7714    | 0.7292   | 0.7150  |
| ResNet50       | 0.9151   | 0.9043    | 0.9080   | 0.9152  |
| EfficientNetB2 | 0.9410   | 0.9413    | 0.9370   | 0.9339  |
| MobileNetV2    | 0.9242   | 0.9268    | 0.9170   | 0.9177  |