# Model Training in PyTorch

In this lab, you will explore the core concepts and steps involved in ***building a convolutional neural network (CNN) using PyTorch***, a leading deep learning framework. CNNs are particularly well-suited for image classification tasks due to their ability to automatically learn spatial hierarchies in images.

**Throughout the lab, you will:**

* Define a simple CNN architecture using torch.nn.Module.
* Work with image data by loading and transforming it for training.
* Implement the forward pass with convolutional, pooling, and fully connected layers.
* Optimize the model using an optimizer like SGD or Adam.
* Evaluate the model's performance with metrics such as loss and accuracy.

By the end of the lab, you will have a foundational understanding of CNNs in PyTorch and how to train them for image classification tasks. This hands-on experience will serve as a stepping stone for building more advanced, custom CNN architectures.

As was practice in previous labs, `XXXX` means you have to fill in the correct code. If you are following along and not in our course at the University of Rhode Island, you can find the answers in the `12-model-training-with-pytorch-ANSWERKEY.ipynb` file in the repository.

Let’s begin by importing the necessary libraries for your first CNN.

## 1. Set up

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# for resnet
import torchvision.models as models

## 2. Parameters for Convolutional Layers

There a few parameters that we have to specify for each convolutional layer that we add. Below is a description of what they are and how to select appropriate values.

#### **`in_channels`: The Number of Input Feature Maps**

* `in_channels` refers to the number of input channels (feature maps) being fed into a convolutional layer.

* It must match the number of output channels from the previous layer (except for the first layer, which depends on the input data).

  **First Convolutional Layer**:

  * If the input is an RGB image (CIFAR-10, ImageNet, etc.), it has 3 channels (R, G, B), so `in_channels`=3.

  * If the input is a grayscale image (MNIST, medical imaging, etc.), it has 1 channel, so `in_channels`=1.

  **Subsequent Layers**:

  * The `in_channels` for each layer is equal to the out_channels of the previous convolutional layer.

#### **`out_channels`: The Number of Output Feature Maps**

* `out_channels` determines how many feature maps (filters) the convolutional layer will output.

* Each filter in a CNN learns to detect different features (edges, textures, shapes, etc.), so increasing `out_channels` allows the model to learn more complex patterns.

  **Typical Design Choices:**

  * Start with a small number of filters (e.g., `out_channels`=16 or
`out_channels`=32) to extract low-level patterns.

  * Gradually increase `out_channels` (e.g., 32 → 64 → 128 → 256) as the network goes deeper, capturing more abstract features.

#### **`kernel_size`: The x,y dimensions of the filters**
  
  * Start with a smaller filter (e.g., `kernel_size`=3) and work your way to larger filters if needed.

#### **`padding`: The additional pixels added around your image**
  
  * Typically set to zero.
  * Add padding if the information at the edges of your images is highly important to your task.
  
#### **`stride`: The number of pixels we shift our kernel**

* `stride` refers to the size of the shift of the kernel across the image at each step.
* Practitioners oven opt for a `stride` of `1` to capture the largest amount of detail. This is the default value, so we can leave it out for now.

## 3. Parameters for Pooling Layers

There a few parameters that we have to specify for each ***pooling*** layer that we add. Below is a description of what they are and how to select appropriate values.

#### **`kernel_size`: The size of the pooling kernel**

* `kernel_size` refers to the size of the pooling kernel that you are applying. A common value to select in practice is `2`. The reduces the size of the feature map by an order of 2.

#### **`stride`: The number of pixels we shift our kernel**

* `stride` refers to the size of the shift of the pooling kernel across the image at each step.
* Practitioners oven opt for no overlap with their pooling kernel. Hence, if you select `2` for your pooling `kernel_size`, then you should select `2` for the `stride`.


## 4. Parameters for Fully Connected Layers

There a few parameters that we have to specify for our ***fully connected*** layer. We must specify 2 numbers in `nn.Linear(#, #)`.

* The first number is the input depth (i.e., the `output_size` of the last layer).
* The final number is the number of classes in your dataset.

## 5. Let's build your first custom CNN!

In [None]:
# Define the CNN model
class CustomCNN(nn.Module):
    def __init__(self):
        super(CustomCNN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(32 * 7 * 7, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        return x

# 6. Prepare the Data

For this lab, we will be using the MNIST dataset. This is a popular dataset containing black-and-white images of numbers.

Before we can train a neural network on images, we often need to prepare the images so the model can understand them. That’s what the `transform` section does: tells PyTorch how to process each image when it’s loaded from the dataset. Think of `Compose` as a pipeline or a to-do list for how each image should be transformed before it’s given to the model. This particular set of transforms does two things: (1) converts the image data to Tensors and (2) normalizes (i.e., centers and scales) the pixel values (0.1307 is mean and 0.3081 is standard deviation, these come from the average pixel values of the MNIST dataset).

Check out the documentation for [`DataLoader` here](https://docs.pytorch.org/docs/stable/data.html) to help you create your `trainloader` and `testloader`.

In [None]:
# Load MNIST dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=32, shuffle=True)

testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
testloader = DataLoader(testset, batch_size=32, shuffle=False)

100%|██████████| 9.91M/9.91M [00:02<00:00, 4.59MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 126kB/s]
100%|██████████| 1.65M/1.65M [00:01<00:00, 1.19MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 13.1MB/s]


## 7. Training the CNN

First we must specify a few things about our training process:
* What model we will use,
* what loss function we want,
* and what optimizer we want (where we also specify our learning rate).

Check out the documentation for different [loss functions](https://docs.pytorch.org/docs/stable/nn.html#loss-functions) and for different [optimizers](https://docs.pytorch.org/docs/stable/optim.html) to decide what to use.

In [None]:
# specify that we want gpu if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# send model to device
model = CustomCNN().to(device)

# specify loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

Now, we can create a loop to run through the batches of data and update the weights for each batch.

We will also save the training loss and report it back once after each epoch.

In [None]:
# Training loop
epochs = 5
for epoch in range(epochs):
    running_loss = 0.0
    for inputs, labels in trainloader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {running_loss/len(trainloader):.4f}")

print("Training complete!")

Epoch 1, Loss: 0.1379
Epoch 2, Loss: 0.0837
Epoch 3, Loss: 0.0791
Epoch 4, Loss: 0.0803
Epoch 5, Loss: 0.0722
Training complete!


**Congratulations**! You've successfully trained your first custom CNN! Try and mess with the training hyperparameters to see if you can improve your outcomes. Be sure to look at all of the training metrics we've learned in class!

# 8. Transfer Learning with ResNet18

Now that you’ve built and trained your own CNN from scratch, we can move on to another powerful approach called **transfer learning**.

Transfer learning lets us take a model that has already been trained on a large dataset and adapt it for a new task. This is especially useful when:
- You have limited data for your new task
- You want to save time and computational resources
- You want to take advantage of the model’s ability to extract useful features that it already learned

In this section, we will take a pretrained **ResNet18** model (originally trained on ImageNet) and fine-tune it to classify MNIST digits.


## 8.1 Modifying a Pretrained Model

The pretrained ResNet18 model expects color images (3 channels) of size 224x224 and produces outputs for 1000 ImageNet classes.

We will modify:
1. The first convolutional layer to accept grayscale input (1 channel).
2. The final fully connected layer to output 10 classes (digits 0–9).

In [None]:
class ResNet18(nn.Module):
    def __init__(self, num_classes=10):
        super(ResNet18, self).__init__()
        self.resnet18 = models.resnet18(pretrained=True)

        # Adjust first conv layer for grayscale (1 input channel)
        self.resnet18.conv1 = nn.Conv2d(
            1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
        )

        # Adjust final layer for 10 MNIST classes
        num_ftrs = self.resnet18.fc.in_features
        self.resnet18.fc = nn.Linear(num_ftrs, num_classes)

    def forward(self, x):
        return self.resnet18(x)

## 8.2 Preparing the Data

ResNet18 was trained on ImageNet images of size 224x224, so we will resize MNIST images to match that shape.

We also apply normalization using the standard MNIST mean and standard deviation.

In [None]:
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)

trainloader = DataLoader(trainset, batch_size=32, shuffle=True)
testloader = DataLoader(testset, batch_size=32, shuffle=False)

## 8.3 Model, Loss, and Optimizer Setup

We’ll use a smaller learning rate here because we are starting from pretrained weights rather than random initialization.


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model = ResNet18().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


Using device: cuda
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth


100%|██████████| 44.7M/44.7M [00:00<00:00, 138MB/s]


## 8.4 Training the Pretrained Model

This training loop is very similar to what you did before.

The key difference is that now we are fine-tuning an existing model rather than training from scratch.


In [None]:
epochs = 2
for epoch in range(epochs):
    model.train()
    running_loss = 0.0

    for inputs, labels in trainloader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch [{epoch+1}/{epochs}], Loss: {running_loss/len(trainloader):.4f}")

print("Training complete!")


Epoch [1/2], Loss: 0.0783
Epoch [2/2], Loss: 0.0420
Training complete!


In [None]:
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for inputs, labels in testloader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f"Test Accuracy (whole model): {accuracy:.2f}%")

Test Accuracy (whole model): 99.13%


## 8.6 Comparing Models

At this point, you have trained two models on the same dataset:

1. Your **custom CNN**, which you built and trained from scratch.
2. The **pretrained ResNet18**, which you fine-tuned using transfer learning.

Try comparing:
- The training time for each model
- The test accuracy
- How quickly each model starts to perform well

You should notice that the pretrained model reaches good accuracy faster, even though it has many more parameters. This is because it already knows how to extract useful image features from its earlier training on ImageNet.

This is the core idea of transfer learning: we reuse knowledge from one large task to improve performance on a smaller one.


## 8.7 Optional Experiment: Freezing Early Layers

When we fine-tune a pretrained model, we can choose **how much of the network to update**.

Sometimes we only want to train the final layers and keep the earlier layers fixed. This can make training faster and help prevent overfitting when the dataset is small.

Here, we will try freezing the feature extraction layers of ResNet18 and retraining only the classifier.


In [None]:
# Freeze all pretrained layers
model_frozen = ResNet18().to(device)
for param in model_frozen.resnet18.parameters():
    param.requires_grad = False

# Replace the final layer (unfrozen)
num_ftrs = model_frozen.resnet18.fc.in_features
model_frozen.resnet18.fc = nn.Linear(num_ftrs, 10).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_frozen.resnet18.fc.parameters(), lr=0.001)  # only train last layer


Now we train again, but this time only the final layer’s weights are updated.

In [None]:
epochs = 2
for epoch in range(epochs):
    model_frozen.train()
    running_loss = 0.0

    for inputs, labels in trainloader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model_frozen(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch [{epoch+1}/{epochs}], Loss: {running_loss/len(trainloader):.4f}")

print("Training complete (frozen feature extractor)!")

Epoch [1/2], Loss: 0.3474
Epoch [2/2], Loss: 0.1666
Training complete (frozen feature extractor)!


## 8.8 Evaluating the Frozen Model

Let’s check how the frozen model performs compared to the fully fine-tuned one.

In [None]:
model_frozen.eval()
correct = 0
total = 0

with torch.no_grad():
    for inputs, labels in testloader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model_frozen(inputs)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy_frozen = 100 * correct / total
print(f"Test Accuracy (frozen layers): {accuracy_frozen:.2f}%")

Test Accuracy (frozen layers): 95.62%


## 8.9 Discussion

Compare the results:
- How fast did each model train?
- How did accuracy change when we froze the early layers?
- Which approach seems more efficient for MNIST?

You’ll likely see that even without updating all the weights, the frozen model still performs well. This is because the early layers of ResNet18 have already learned general-purpose features that transfer well to many visual tasks.

This experiment highlights the flexibility of transfer learning and why it is widely used in computer vision projects today.
