# Simpsons Character Classification with CNN

Implementation of a Convolutional Neural Network (CNN) from scratch to classify images of Simpsons characters.

**Project Requirements:**
- Train a CNN entirely from scratch (no pre-trained models allowed).
- Use the hierarchical dataset provided in `characters_train/` where each folder corresponds to a character.
- Ensure reproducibility by setting random seeds.
- Save the trained model to disk after training.

**Goals of this Notebook:**
1. Load and preprocess the dataset.
2. Build a CNN model from scratch.
3. Train the model on the training set and validate its performance.
4. Save the trained model for later inference.

> **Important**
>
> This notebook does not work well with python version `3.14`. It was run on python `3.12`

## 1. Setup Environment
Import necessary libraries, configure device (CPU/GPU), and set random seeds for reproducibility.

In [19]:
import random
import numpy as np
import torch

In [20]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [21]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cpu


## 2. Dataset Overview

- Dataset is located in `./data/simpsons/archive/characters_train/`.
- Each folder corresponds to a character class.
- Images will be resized to 64x64 pixels for uniformity. Reason for resizing will make sense later.
- We'll map each class to a numeric label for training.

In [22]:
import os

DATA_DIR = "./data/simpsons/archive/characters_train"
pic_size = 64  # resize images to 64x64

# Gather all class names
class_names = sorted([d for d in os.listdir(DATA_DIR) if os.path.isdir(os.path.join(DATA_DIR, d))])
class_to_idx = {name: i for i, name in enumerate(class_names)}
idx_to_class = {i: name for name, i in class_to_idx.items()}

In [23]:
print(f"Number of different characters: {len(class_names)}")
print(f"Examples of names: {class_names[:5]}")

Number of different characters: 42
Examples of names: ['abraham_grampa_simpson', 'agnes_skinner', 'apu_nahasapeemapetilon', 'barney_gumble', 'bart_simpson']


We need numerical representation for each character name since CNNs don't work with string output. Fortunately number of characters is fixed number - 42.

In [24]:
print("Class → Index mapping:")
print(list(class_to_idx.items())[:5])

print("\nIndex mapping -> Class:")
print(list(idx_to_class.items())[:5]) # good to have both mappings

Class → Index mapping:
[('abraham_grampa_simpson', 0), ('agnes_skinner', 1), ('apu_nahasapeemapetilon', 2), ('barney_gumble', 3), ('bart_simpson', 4)]

Index mapping -> Class:
[(0, 'abraham_grampa_simpson'), (1, 'agnes_skinner'), (2, 'apu_nahasapeemapetilon'), (3, 'barney_gumble'), (4, 'bart_simpson')]


## 3. Exploring the Dataset

- The dataset contains 42 characters and a total of 16,764 images. Each character has different number of pictures.
- Each image is in RGB format, but they vary in dimensions and aspect ratios.
- To feed the images into a CNN, we need to resize them to a uniform size. Common choices are 256×256, 128×128, or 64×64.
- On my setup, resizing to 128×128 resulted in training times exceeding 10 minutes per epoch. Too long.
- Resizing to 64×64 significantly reduced computation time while still producing reasonable accuracy, making it the practical choice for this project.

In [25]:
counts = {}

for character in class_names:
    character_dir = os.path.join(DATA_DIR, character)
    num_images = sum(
        1 for f in os.listdir(character_dir)
         if os.path.isfile(os.path.join(character_dir, f)) and f.lower().endswith((".jpg", ".jpeg", ".png"))
    )
    counts[character] = num_images

print(f"Total number of classes: {len(class_to_idx)}")
print(f"Total number of pictures: {sum(counts.values())}")

print("\nImage counts per class:")
for name, count in counts.items():
    print(f"{name}: {count} images")

Total number of classes: 42
Total number of pictures: 16764

Image counts per class:
abraham_grampa_simpson: 731 images
agnes_skinner: 34 images
apu_nahasapeemapetilon: 499 images
barney_gumble: 85 images
bart_simpson: 1074 images
carl_carlson: 79 images
charles_montgomery_burns: 955 images
chief_wiggum: 789 images
cletus_spuckler: 38 images
comic_book_guy: 376 images
disco_stu: 7 images
edna_krabappel: 366 images
fat_tony: 22 images
gil: 22 images
groundskeeper_willie: 97 images
homer_simpson: 1797 images
kent_brockman: 399 images
krusty_the_clown: 965 images
lenny_leonard: 248 images
lionel_hutz: 3 images
lisa_simpson: 1084 images
maggie_simpson: 103 images
marge_simpson: 1033 images
martin_prince: 57 images
mayor_quimby: 197 images
milhouse_van_houten: 864 images
miss_hoover: 14 images
moe_szyslak: 1162 images
ned_flanders: 1164 images
nelson_muntz: 287 images
otto_mann: 26 images
patty_bouvier: 58 images
principal_skinner: 956 images
professor_john_frink: 52 images
rainier_wolfcast

## 4. Loading Images and Assigning Labels

To train our CNN, we need to load all images from the dataset, resize them, and assign numeric labels for each class.

We define a function `get_data_opencv_with_map()` that:

1. Iterates through each character folder in the dataset.
2. Loads all images using OpenCV.
3. Converts them from BGR to RGB if needed.
4. Resizes all images to a consistent size (`pic_size` × `pic_size`).
5. Stores images and labels as NumPy arrays for easy processing.

In [26]:
import cv2

def get_data_opencv_with_map(directory: str, BGR: bool = False):
    """
    Load all images folder by folder using OpenCV and assign numeric labels.
    Returns:
        images: np.ndarray of shape (N, H, W, 3)
        labels: np.ndarray of numeric labels (N)
        num_classes: int
        images_per_class: dict[int, list[np.ndarray]] mapping class index to list of images
    """
    images = []
    labels = []
    images_per_class = {idx: [] for idx in range(len(class_names))}

    for c in sorted(os.listdir(directory)):
        folder_path = os.path.join(directory, c)
        if not os.path.isdir(folder_path):
            continue

        class_id = class_to_idx[c]

        for f in sorted(os.listdir(folder_path)):
            fpath = os.path.join(folder_path, f)
            if not os.path.isfile(fpath):
                continue
            try:
                img = cv2.imread(fpath)
                if img is None:
                    continue
                if BGR:
                    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
                img = cv2.resize(img, (pic_size, pic_size))
                images.append(img)
                labels.append(class_id)
                images_per_class[class_id].append(img)
            except Exception as e:
                print(f"Failed to load {fpath}: {e}")

    return np.array(images), np.array(labels), len(class_names), images_per_class

Loading and processing all images can take **some time**, especially for large datasets like this one (~16,764 images). Be patient while this cell runs. Once completed, the images are ready for training, and this step only needs to be done once per session.

In [27]:
images, labels, num_classes, images_per_class = get_data_opencv_with_map(DATA_DIR, BGR=True)

print(f"Total images: {len(images)}")
print(f"Number of classes: {num_classes}")

Total images: 16764
Number of classes: 42


## 5. Creating Dataset and DataLoader

We need a custom PyTorch dataset to feed our images and labels into the model. This dataset will:
- Store the images and their corresponding labels.
- Convert images to PyTorch tensors.
- Normalize pixel values to [0, 1].
- Support indexing for batching during training.

In [28]:
from torch.utils.data import Dataset

class SimpsonsDataset(Dataset):
    def __init__(self, images: np.ndarray, labels: np.ndarray):
        """
        Args:
            images: numpy array of shape (N, H, W, 3)
            labels: numpy array of numeric labels
        """
        self.images = images
        self.labels = labels

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        # Convert image from HWC to CHW and normalize to [0,1]
        image = torch.tensor(self.images[idx], dtype=torch.float32).permute(2, 0, 1) / 255.0
        label = torch.tensor(self.labels[idx], dtype=torch.long)
        return image, label

We will use 85% of the data for training and 15% for validation.

Stratified splitting ensures each character has the same proportion of images in both sets.

In [29]:
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split

# Split images and labels
X_train, X_val, y_train, y_val = train_test_split(
    images, labels, test_size=0.15, random_state=42, stratify=labels
)

# Create datasets
train_dataset = SimpsonsDataset(X_train, y_train)
val_dataset = SimpsonsDataset(X_val, y_val)

# Create DataLoaders for batching
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

print(f"Training samples: {len(train_dataset)}, Validation samples: {len(val_dataset)}")

Training samples: 14249, Validation samples: 2515


Batching with DataLoader speeds up training and ensures that memory usage stays manageable. Shuffling the training data improves model generalization.

## 6. Designing Model

The architecture uses 4 convolutional layers, 2 max pooling layers, dropout for regularization, and fully connected layers to output predictions for 42 classes.

Model is separated into separate file [model.py](model.py).

In [30]:
import torch.nn as nn
import torch.nn.functional as F

class CNN4Conv(nn.Module):
    def __init__(self, num_classes):
        super().__init__()

        # --- Convolutional layers ---
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)   # Input: RGB image, Output: 32 feature maps, Kernel: 3x3
        self.conv2 = nn.Conv2d(32, 32, 3)             # Input: 32 maps, Output: 32 maps, Kernel: 3x3
        self.conv3 = nn.Conv2d(32, 64, 3, padding=1)  # Input: 32 maps, Output: 64 maps, Kernel: 3x3
        self.conv4 = nn.Conv2d(64, 64, 3)             # Input: 64 maps, Output: 64 maps, Kernel: 3x3

        # --- Pooling layer ---
        self.pool = nn.MaxPool2d(2)  # 2x2 Max Pooling reduces H and W by half

        # --- Dropout layers ---
        self.dropout25 = nn.Dropout(0.25)  # Regularization: randomly zero 25% of inputs
        self.dropout50 = nn.Dropout(0.5)   # Regularization: randomly zero 50% of inputs

        # --- Fully connected layers ---
        self.fc1 = nn.Linear(64 * 14 * 14, 512)  # Flattened conv output → 512 neurons
        self.fc2 = nn.Linear(512, num_classes)   # Output layer → number of classes

    def forward(self, x):
        # First convolutional block
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = self.pool(x)
        x = self.dropout25(x)

        # Second convolutional block
        x = F.relu(self.conv3(x))
        x = F.relu(self.conv4(x))
        x = self.pool(x)
        x = self.dropout25(x)

        # Flatten for fully connected layers
        x = x.view(x.size(0), -1)

        # Fully connected layers
        x = F.relu(self.fc1(x))
        x = self.dropout50(x)
        x = self.fc2(x)

        return x

Design Reasoning:
1. Four Convolutional Layers:
    - The first two layers extract low-level features like edges, corners, and textures.
    - The next two layers extract higher-level features, such as facial structures or hair patterns.
    - After experimentation, 4 convolutional layers provided a good balance between accuracy and training time.
2. Kernel Sizes and Padding:
    - `3x3` kernels are standard for capturing local patterns while keeping computation reasonable.
    - Padding is applied selectively to control feature map dimensions, ensuring proper flow into fully connected layers.
3. Pooling Layers:
    - `MaxPool2d(2)` halves the spatial size of the feature maps.
    - This reduces computation and helps the model learn hierarchical features.
4. Dropout Regularization:
    - 25% dropout after conv blocks and 50% dropout after the first fully connected layer prevent overfitting.
    - Dropout rates were tuned experimentally for the best validation performance.
5. Fully Connected Layers:
    - The first FC layer reduces the flattened feature maps to 512 neurons, allowing complex feature combinations.
    - The second FC layer outputs logits for each of the 42 classes.
6. ReLU Activation:
    - ReLU introduces non-linearity, which helps the network learn complex relationships between pixels.

---

Justification of the Architecture
- Tested multiple variations:
    - 2–6 convolutional layers, different dropout rates, and different numbers of neurons in FC layers.
- The current 4-conv + 512-FC design gave the best trade-off between training time and validation accuracy on this dataset.
- Using larger input sizes (128x128) significantly increased computation time without improving accuracy much.

# 7. Training the Model

Now that the dataset and model are ready, we train the CNN from scratch.

We ensure reproducibility, use GPU if available, and log training/validation performance for each epoch.

In [31]:
# Use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Initialize the model and move it to the device
model = CNN4Conv(num_classes).to(device)

# Optimizer: Adam with small learning rate
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

# Loss function: CrossEntropy for multi-class classification
criterion = nn.CrossEntropyLoss()

Using device: cpu


### Training Function

The `train_one_epoch` function runs a single pass over the training data.

The model is in training mode, processes each batch, computes loss, backpropagates gradients, and updates weights.

It returns the average loss and accuracy for the epoch.

In [32]:
def train_one_epoch(model, loader, optimizer, criterion, device):
    """Train the model for one epoch"""
    model.train()
    total_loss, correct = 0, 0
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * images.size(0)
        correct += (outputs.argmax(1) == labels).sum().item()

    avg_loss = total_loss / len(loader.dataset)
    accuracy = correct / len(loader.dataset)
    return avg_loss, accuracy

### Validation

The `validate` function evaluates the model on the validation set in evaluation mode (dropout disabled).

It computes loss and counts correct predictions without gradient calculations, providing average validation loss and accuracy to measure generalization.

In [33]:
def validate(model, loader, criterion, device):
    """Evaluate the model on validation data"""
    model.eval()
    total_loss, correct = 0, 0
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            total_loss += loss.item() * images.size(0)
            correct += (outputs.argmax(1) == labels).sum().item()

    avg_loss = total_loss / len(loader.dataset)
    accuracy = correct / len(loader.dataset)
    return avg_loss, accuracy

### Training Loop


In [34]:
num_epochs = 40

for epoch in range(num_epochs):
    train_loss, train_acc = train_one_epoch(model, train_loader, optimizer, criterion, device)
    val_loss, val_acc = validate(model, val_loader, criterion, device)

    print(f"Epoch {epoch + 1}/{num_epochs} — "
          f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}, "
          f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")


Epoch 1/40 — Train Loss: 3.0310, Train Acc: 0.1432, Val Loss: 2.5844, Val Acc: 0.3137
Epoch 2/40 — Train Loss: 2.3877, Train Acc: 0.3533, Val Loss: 2.1058, Val Acc: 0.4406
Epoch 3/40 — Train Loss: 2.0817, Train Acc: 0.4419, Val Loss: 1.9019, Val Acc: 0.4998
Epoch 4/40 — Train Loss: 1.8790, Train Acc: 0.4873, Val Loss: 1.7577, Val Acc: 0.5312
Epoch 5/40 — Train Loss: 1.7001, Train Acc: 0.5290, Val Loss: 1.6079, Val Acc: 0.5698
Epoch 6/40 — Train Loss: 1.5548, Train Acc: 0.5685, Val Loss: 1.4820, Val Acc: 0.6072
Epoch 7/40 — Train Loss: 1.4136, Train Acc: 0.6076, Val Loss: 1.4057, Val Acc: 0.6330
Epoch 8/40 — Train Loss: 1.2966, Train Acc: 0.6356, Val Loss: 1.3367, Val Acc: 0.6485
Epoch 9/40 — Train Loss: 1.1822, Train Acc: 0.6661, Val Loss: 1.2711, Val Acc: 0.6668
Epoch 10/40 — Train Loss: 1.0749, Train Acc: 0.6943, Val Loss: 1.2235, Val Acc: 0.6708
Epoch 11/40 — Train Loss: 0.9953, Train Acc: 0.7139, Val Loss: 1.1715, Val Acc: 0.6954
Epoch 12/40 — Train Loss: 0.9145, Train Acc: 0.7341,

It took me about 1 hour.

**Note:** Training can take some time, especially on CPU.
Using a GPU will speed up each epoch significantly.

Across 40 epochs, training accuracy rose from about 14% to over 95%, while validation accuracy improved from around 31% to roughly 76%, pretty accurate. Loss steadily decreased, showing that the model learned meaningful features and improved its ability to classify the images.

### Saving the Model
These weights can be loaded later for inference without retraining, ensuring reproducibility.

In [35]:
torch.save(model.state_dict(), "simpsons_cnn4conv.pth")
print("Model weights saved to simpsons_cnn4conv.pth")

Model weights saved to simpsons_cnn4conv.pth


For continuation see `inference.ipynb`
