# Convolutional Neural Networks (CNNs) with PyTorch
Welcome! In this notebook you will learn how convolutional neural networks work and how to implement them in PyTorch. Throughout the notebook, theory cells explain the core concepts, code cells demonstrate them, and **exercises** help you practice.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hyperskill-content/hyperskill-ml-notebooks/blob/main/DL_internals/pytorch_tensors.ipynb)

## 🚀 Prerequisites <a id="prerequisites"></a>

Make sure you're comfortable with the topics below before starting this notebook:

| # | Topic (clickable links)|
|---|-------|
| 1 | **[Convolutions](https://hyperskill.org/learn/step/39250)**  |
| 2 | **[Pooling](https://hyperskill.org/learn/step/39100)** |
| 3 | **[Padding](https://hyperskill.org/learn/step/39261)** | 
| 4 | **[Batch normalization](https://hyperskill.org/learn/step/41705)** | 
| 5 | **[Default train-validation loop](https://hyperskill.org/learn/step/42852)** | 
| 6 | **[Adam optimizer](https://hyperskill.org/learn/step/35555)** |
| 7 | **[Activation functions](https://hyperskill.org/learn/step/35741)** |
| 8 | **[Visualization of convolutions](https://animatedai.github.io)** | 

*If you're new to any item above, review it quickly, then dive back in here.*

# 📑 Table of Contents  
  
1. [What’s the big idea?](#sec-idea)  
2. [Convolution Layer in Plain Words](#sec-conv)  
3. [Receptive Field — How far can a neuron “see”?](#sec-rf)  
4. [Activation Function](#4-activation-function)  
5. [Pooling](#sec-pool)  
6. [Putting Blocks Together — A Mini CNN](#6-putting-blocks-together--a-mini-cnn)  
7. [Pre-trained Models & Architecture Zoo](#sec-pretrained)  
8. [Training Loop — Putting the CNN to Work](#sec-train-loop)  
9. [1 × 1 Convolutions — Tiny Filters, Big Impact](#sec-1x1)  
10. [Practice Section](#sec-exercises)

## 1. What's the big idea? <a id="sec-idea"></a>

Imagine you're trying to recognize your friend's face in a photo. A regular neural network (called **Fully Connected Neural Network** or **FCNN**) would look at EVERY single pixel in the entire image at once - like trying to memorize a phone book by reading all numbers simultaneously!

That's pretty crazy, right? Your brain doesn't work this way. When you look at a face, you first notice eyes, then nose, then mouth - **piece by piece**.

**CNNs are smarter than FCNNs at image understanding**:

| **FCNN** (Old way) | **CNN** (Smart way) |
|-------------------|-------------------|
| Looks at ALL pixels at once  | Looks at small patches (like 3×3)  |
| Each neuron connects to EVERY pixel | Same filter slides over the whole image |
| Millions of connections = slow  | Fewer connections = fast & efficient  |
| Doesn't understand "spatial relationships" | Knows that nearby pixels are related |

When we feed an image into a FCNN, we need to flatten the pixels into a 1D array, which causes the loss of spatial information. A **Convolutional Neural Network (CNN)** is basically a smart way to look at
photos (and even audio) **piece by piece** instead of all at once.

* **Local view** – filters only look at a small patch (e.g. 3×3)
* **Weight sharing** – the same filter slides over the whole image  
* **Stacked layers** – early layers find edges (high level features), later layers find more complex patterns (low level features)

## 2. Convolution Layer in Plain Words <a id="sec-conv"></a>

Think of a filter as a sliding window:

* You place it on a 3×3 patch of pixels.  
* Multiply numbers element-wise, add them up → **one output pixel**.  
* Slide right by **stride `S`** steps, repeat.

**Padding** `P` adds a border of zeros so the sliding window can fully cover edge pixels without going out of bounds.

Formula for output size (1-D for simplicity):

$$
\text{output\_length} = \frac{N + 2P - F}{S} + 1
$$

Where  

* `N` = input size (e.g., 32)  
* `F` = filter size (e.g., 3)  
* `S` = stride  
* `P` = padding

In [None]:
# 🔎 Code peek — one conv layer
import torch, torch.nn as nn

conv = nn.Conv2d(in_channels=3, out_channels=8,
                 kernel_size=3, stride=1, padding=1)

x = torch.randn(1, 3, 32, 32)  # fake RGB image
out = conv(x)
print("input :", x.shape)
print("output:", out.shape)   # (1, 8, 32, 32) — same H/W thanks to P=1 (same padding)

## 3. Receptive Field — How far can a neuron "see"? <a id="sec-rf"></a>

Add more conv layers → each output pixel depends on **a larger area**
of the original image.

* Layer 1 (3×3 filter) → sees 3×3 pixels  
* Layer 2 → sees 5×5 (because 3 grows to 5)…  
* Keep stacking and the receptive field grows like gossip in a small town.

> **Why stack two 3×3 filters instead of one 5×5?** (this is pretty common question on ML interviews)

* **Fewer weights** – a single 5×5 filter has **25 parameters**, while two
  consecutive 3×3 filters use **9 + 9 = 18**.  
  That's 28 % less memory and math.

* **More non-linearity** – you get **two ReLU activations** instead of one,
  letting the network learn richer functions for (almost) the same receptive
  field.

* **Same receptive field** – two 3×3 layers (no padding tricks) end up "seeing"
  a 5×5 area of the input, so you don't lose context.

It's like using two small magnifying glasses 🔍🔍: you cover the same area as a
big lens but carry less weight and get an extra layer of detail.

![Expanding receptive field through stacked convolutional layers](https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-17-at-5.32.45-PM.png)

## 4. Activation Function <a id="4-activation-function"></a>

After the convolution, we pass numbers through activation function, the most common one is **ReLU**:

$$\operatorname{ReLU}(x) = \max(0,\; x)$$

Why? It keeps positive signals and kills the boring negative ones → helps the
network learn non-linear stuff.

In [None]:
relu = nn.ReLU()
print(relu(torch.tensor([-2., 0., 3.])))

**❓ What is a CNN (or any neural network) without activations?**

A CNN (or any neural network) without activation functions is just a stack of linear operations—essentially, a single big linear transformation. Without non-linear activations like ReLU, the network cannot learn or represent complex, non-linear patterns. It would be no more powerful than a single linear layer, no matter how many layers you stack. Non-linear activations are what give neural networks their expressive power!

## 5. Pooling <a id="sec-pool"></a>

**Max-pooling** keeps the *strongest* signal in a window (e.g., 2×2).
This:

* Shrinks the feature map (less memory)  
* Adds a bit of translation tolerance (small shifts, same max)

Common choice: `kernel_size=2`, `stride=2` halves H and W.

![Illustration of 2×2 max pooling](https://production-media.paperswithcode.com/methods/MaxpoolSample2.png)

In [None]:
pool = nn.MaxPool2d(kernel_size=2, stride=2)
feat = torch.arange(0,16).view(1,1,4,4).float()
print("before:\n", feat.squeeze())
print("after pool:\n", pool(feat).squeeze())

## 6. Putting Blocks Together — A Mini CNN <a id="6-putting-blocks-together--a-mini-cnn"></a>

Conv → ReLU → Pool → Conv → ReLU → Pool → Flatten → Linear → **Logits**

Below, we build a tiny network for 10-class images.

In [None]:
class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),                # 32→16
            nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1), 
            nn.ReLU(),
            nn.MaxPool2d(2)                 # 16→8
        )
        self.classifier = nn.Linear(32*8*8, 10)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)   # flatten
        return self.classifier(x)

net = TinyCNN()
fake = torch.randn(4, 3, 32, 32)
print(net(fake).shape)   # (4, 10)

## 7. Pre-trained Models & Architecture Zoo <a id="sec-pretrained"></a>

> 🔑 **Key idea:** in ~99 % of real projects you **don't design a CNN from scratch**.  
> You download a ready-made network that was trained on ImageNet (or similar) and
> either **use it as-is** or **fine-tune** it for your task.

### Why use pre-trained networks?

| Benefit | What it means |
|---------|---------------|
| **Speed** | Skip days (or weeks) of training 🚀 |
| **Accuracy** | Years of research baked in (ResNet, EfficientNet, ViT…) |
| **Data-efficient** | Works even if you only have a few hundred labelled images |
| **Less tuning** | Good defaults for learning-rate, normalisation, etc. |

Popular families:

* **[VGG](https://hyperskill.org/learn/step/39368)** – simple, but large  
* **[ResNet](https://hyperskill.org/learn/step/40068)** – skip connections 🏃‍♂️ (good baseline)  
* **[EfficientNet](https://hyperskill.org/learn/step/43876)** – better accuracy / size trade-off  
* **Vision Transformers (ViT, Swin)** – attention instead of conv layers

### Typical workflow

1. **Load** a pre-trained backbone.  
2. **Freeze** most layers (optional).  
3. **Replace** the final classifier with one matching your number of classes.  
4. Train a few epochs on your data.

In [None]:
from torchvision import models
# Load ResNet-18 pre-trained on ImageNet
model = models.resnet18(weights="IMAGENET1K_V1")

# Freeze all layers (optional; comment out to fine-tune)
for p in model.parameters():
    p.requires_grad = False

# Replace the final layer (ResNet-18 has 512-dim features)
num_classes = 3          # e.g. cat, dog, llama
model.fc = nn.Linear(512, num_classes)

# Ready for your training loop
dummy = torch.randn(2, 3, 224, 224)
print("logits shape:", model(dummy).shape)   # (2, 3)

### When should you build a custom model?

Only when:

* Your input is **radically different** (e.g. 3-D medical scans).  
* You need a **tiny** network for microcontrollers.  
* You're doing research and inventing a new architecture.

For almost everything else, grab a pre-trained model, tweak, and ship. 🚢

## 8. Training Loop — Putting the CNN to Work <a id="sec-train-loop"></a>

So far we've built models and passed dummy data through them.  
Now let's **train** one for real (well, for a single epoch).

Key pieces:

| Piece | PyTorch object | What it does |
|-------|----------------|--------------|
| **Dataset** | `torchvision.datasets.*` | knows how to load images + labels |
| **DataLoader** | `torch.utils.data.DataLoader` | serves mini-batches |
| **Model** | our `TinyCNN` (or any pre-trained net) | predicts logits |
| **Loss** | `nn.CrossEntropyLoss` | compares logits & labels |
| **Optimizer** | `torch.optim.SGD / Adam` | updates weights |

We'll use **CIFAR-10** (tiny 32×32 colour images, 10 classes).  
Feel free to swap in Fashion-MNIST if your internet is slow.

In [None]:
import torch, torch.nn as nn, torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Running on:", device)

# 1. Data
transform = transforms.Compose([
    transforms.ToTensor(),                   # (0..1)
    transforms.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))  # to [-1,1]
])

train_set = datasets.CIFAR10(root="./data", train=True,
                             download=True, transform=transform)
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)

# 2. Model
model = TinyCNN().to(device)

# 3. Loss & Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# 4. Training loop (1 epoch demo)
for images, labels in train_loader:
    images, labels = images.to(device), labels.to(device)

    # forward
    logits = model(images)
    loss   = criterion(logits, labels)

    # backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print("Finished 1 epoch!  Final mini-batch loss:", loss.item())


### What's happening each iteration?

1. **Forward pass** – model predicts logits from images.  
2. **Loss calculation** – `CrossEntropyLoss` compares logits vs. true labels.  
3. **Zero grads** – clear old gradients (`optimizer.zero_grad()`).  
4. **Backward pass** – `loss.backward()` fills `.grad` for each weight.  
5. **Optimizer step** – `optimizer.step()` nudges weights to reduce loss.

Run more epochs, tweak `lr`, or switch to `Adam` and watch accuracy rise.  
When loss plateaus, you're ready to save the model:

```python
torch.save(model.state_dict(), "tinycnn_cifar10.pth")

## 9. 1 × 1 Convolutions — Tiny Filters, Big Impact <a id="sec-1x1"></a>

A **1 × 1 convolution** may look useless (it covers only one pixel!),  
but it's a secret weapon in many famous architectures (ResNet,
MobileNet, EfficientNet).

### What does a 1 × 1 filter actually do?

* Works **across channels**, not across height/width.  
  *Think of it as a mini fully-connected layer for each pixel location.*

* **Mixes features** — combines information from all input channels and
  outputs a new set of channels.

* **Shrinks or expands depth** — great for **bottlenecks** to cut parameter
  count before a bigger 3 × 3 layer.

| Example | In-channels → Out-channels | Parameters |
|---------|---------------------------|------------|
| 3×3 conv | 64 → 64 | 64 × 64 × 3 × 3 = 36 864 |
| **1×1 + 3×3 combo** | 64 → **16** (1×1), then 16 → 64 (3×3) | 64 × 16 × 1 × 1 + 16 × 64 × 3 × 3 = **10 240** |

→ **72 % fewer weights** for the same spatial "receptive field".

In [None]:
import torch, torch.nn as nn

bottleneck = nn.Sequential(
    nn.Conv2d(64, 16, kernel_size=1), nn.ReLU(),   # 1×1 reduce
    nn.Conv2d(16, 64, kernel_size=3, padding=1), nn.ReLU()  # 3×3 process
)

x = torch.randn(1, 64, 32, 32)
print("input shape :", x.shape)
y = bottleneck(x)
print("output shape:", y.shape)

### Practical uses

* **Inception modules** – mix 1 × 1, 3 × 3, 5 × 5 in parallel.
* **ResNet bottleneck blocks** – 1 × 1 reduce → 3 × 3 → 1 × 1 expand.
* **MobileNet and EfficientNet blocks** – "inverted" bottlenecks with depthwise + 1 × 1 pointwise.

Whenever you see a CNN diagram with a skinny → fat → skinny channel pattern,
a 1 × 1 convolution is doing the slimming or bulking.

Add it to your toolbox when you need to:

* Cut GPU memory without shrinking spatial size.  
* Blend information across channels cheaply.  
* Add extra non-linearity between big convolutions.

## Practice Section <a id="sec-exercises"></a>

Now it's your turn! The following exercises will guide you through building a CNN from scratch and then fine-tuning a pre-trained model. This will solidify your understanding of the concepts covered in the theory section.

### Part 1: Build a CNN from Scratch for MNIST

In this part, you will define your own simple CNN, write a training loop, and train it to classify handwritten digits from the famous MNIST dataset.


In [None]:
# 1.1: Load and Visualize MNIST Data
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np

# Define a transform to normalize the data
# The values (0.1307,) and (0.3081,) are the global mean and standard deviation of the MNIST dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Download and load the training data
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

# Visualize a batch of training data
dataiter = iter(train_loader)
images, labels = next(dataiter)

fig = plt.figure(figsize=(15, 5))
for idx in np.arange(20):
    ax = fig.add_subplot(4, 5, idx+1, xticks=[], yticks=[])
    ax.imshow(np.squeeze(images[idx]), cmap='gray')
    ax.set_title(str(labels[idx].item()))
plt.show()


In [None]:
# 1.2 – Define the CNN Architecture  (all ops inside nn.Sequential)

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()

        # ─── 🔧 YOUR CODE HERE ───────────────────────────────────────────
        # Build ONE nn.Sequential called `self.features`.
        #   Suggested order:  Conv → ReLU → MaxPool → Conv → ReLU → MaxPool
        self.features = nn.Sequential(
            # nn.Conv2d(1, 10, kernel_size=3, padding=1),
            # ...
            
        )
        # For classificarion you need to flatten the output
        # Then use a linear layer or two with ReLU and finally a linear layer for the output with 10 classes
        self.classifier = nn.Sequential(
            nn.Flatten(),
            # ...
            # 
        )

    def forward(self, x):
        # Simply run the two sequences defined above
        x = self.features(x)
        x = self.classifier(x)
        return x


model_scratch = SimpleCNN()
print(model_scratch)


In [None]:
# 1.3: Write the Training and Validation Loop

def train_and_validate(model, train_loader, test_loader, criterion, optimizer, epochs=5):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    
    for epoch in range(epochs):
        model.train() # Set the model to training mode
        running_loss = 0.0
        for i, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)
            
            # YOUR CODE HERE: Implement the training step for one batch.
            # 1. Zero the gradients
            # 2. Forward pass: compute predicted outputs by passing inputs to the model
            # 3. Calculate the loss
            # 4. Backward pass: compute gradient of the loss with respect to model parameters
            # 5. Perform a single optimization step (parameter update)
            
            
            
            running_loss += loss.item()
        
        # Validation Phase
        model.eval() # Set the model to evaluation mode
        correct = 0
        total = 0
        with torch.no_grad():
            for inputs, labels in test_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        
        print(f"Epoch {epoch+1}/{epochs}.. "
              f"Training Loss: {running_loss/len(train_loader):.3f}.. "
              f"Test Accuracy: {100 * correct / total:.2f}%")
    
    print("Finished Training")
    return model

In [None]:
# 1.4: Train the Model

# Instantiate the model
model_scratch = SimpleCNN()

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_scratch.parameters(), lr=0.001)

# Train the model (this will take a few minutes)
trained_model_scratch = train_and_validate(model_scratch, train_loader, test_loader, criterion, optimizer, epochs=5)


In [None]:
# 1.5: Save, Load, and Visualize Model Predictions

# Save the trained model checkpoint
torch.save(trained_model_scratch.state_dict(), "mnist_cnn.pth")
print("Model checkpoint saved as mnist_cnn.pth")

# Load the model checkpoint into a new instance 
loaded_model = SimpleCNN()
loaded_model.load_state_dict(torch.load("mnist_cnn.pth"))
loaded_model.eval()

# Get a batch of test images
dataiter = iter(test_loader)
images, labels = next(dataiter)

# Get predictions
with torch.no_grad():
    outputs = loaded_model(images)
    _, preds = torch.max(outputs, 1)

# Plot the first 10 test images, their predicted labels, and true labels
fig = plt.figure(figsize=(15, 4))
plt.subplots_adjust(hspace=0.6)  # Increase vertical space between rows
for idx in np.arange(10):
    ax = fig.add_subplot(2, 5, idx+1, xticks=[], yticks=[])
    ax.imshow(np.squeeze(images[idx]), cmap='gray')
    ax.set_title(f"Pred: {preds[idx].item()}\nTrue: {labels[idx].item()}")
plt.show()

### Part 2: Use pretrained models from timm library

In this part, you will use a powerful repository called timm (PyTorch Image Models). It provides access to hundreds of pretrained models.

Note: If you don't have a GPU, the code below may take a considerable amount of time to run. To monitor progress, you can print the loss values after each batch (by adjusting your *train_and_validate* function).

In [None]:
!pip install timm

In [None]:
import timm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1. Load a pre-trained backbone and adapt it to 10 CIFAR classes
model_timm = timm.create_model("resnet18", pretrained=True, num_classes=10)
model_timm.to(device)

# (Optional) freeze everything except the classifier head
for name, p in model_timm.named_parameters():
    if not name.startswith("fc"):   # ResNet's final layer = "fc"
        p.requires_grad = False

# 2. Data — CIFAR-10 images resized to 224×224 expected by ImageNet models
transform_ft = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

train_set_ft = datasets.CIFAR10("./data", train=True,  download=False, transform=transform_ft)
test_set_ft  = datasets.CIFAR10("./data", train=False, download=False, transform=transform_ft)

train_loader_ft = DataLoader(train_set_ft, batch_size=64, shuffle=True)
test_loader_ft  = DataLoader(test_set_ft,  batch_size=256, shuffle=False)

# 3. Loss & Optimizer (only train unfrozen params)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model_timm.parameters()), lr=1e-3)

# 4. Quick sanity check
imgs, _ = next(iter(train_loader_ft))
print("batch:", imgs.shape, "→ logits:", model_timm(imgs.to(device)).shape)

# YOUR CODE HERE ------------------------------------------------------------------
# Re-use your `train_and_validate` function:
#
trained_timm = train_and_validate(model_timm,
                                  train_loader_ft,
                                  test_loader_ft,
                                  criterion,
                                  optimizer,
                                  epochs=3)


# Try:
#   • Un-freezing more layers and observe accuracy changes.
#   • Different timm models: "tf_efficientnet_b0", "mobilenetv3_small_100", "vit_base_patch16_224", …
#   • Tweaking learning-rate / batch-size / augmentation.

🎉 **Congratulations!**  
You now have 2 working classifiers.

Try experimenting with different architectures, data‑augmentation strategies, optimizers, or learning‑rate schedules to push the accuracy even higher.