📝 **Author:** Amirhossein Heydari - 📧 **Email:** amirhosseinheydari78@gmail.com - 📍 **Linktree:** [linktr.ee/mr_pylin](https://linktr.ee/mr_pylin)

---

# Dependencies

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn.functional as F
from torch import nn
from torchinfo import summary

In [2]:
# set a seed for deterministic results
random_state = 42
torch.manual_seed(random_state)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [3]:
# check if cuda is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

# Convolution vs. Correlation
   - Convolution and correlation are both operations used in signal processing and image analysis

**[Convolution](https://en.wikipedia.org/wiki/Convolution)**:
   - Convolution measures how one function (the kernel) modifies the other function (the signal or image).
   - In the context of image processing, it's used to apply a filter or kernel to an image.
   - Mathematical Formulation (discrete signals):
   $$(f * g)[i] = \sum_{j} f[j] \cdot g[i - j]$$

**[Correlation](https://en.wikipedia.org/wiki/Correlation)**:
   - Correlation measures the similarity between two signals as one is shifted over the other.
   - In image processing, it's used to detect patterns by sliding a filter over an image.
   - Mathematical Formulation (discrete signals):
   $$(f \star g)[i] = \sum_{j} f[j] \cdot g[i + j]$$

<figure style="text-align: center;">
    <img src="../assets/images/original/cnn/correlation-and-convolution.svg" alt="correlation-and-convolution.svg" style="width: 100%;">
    <figcaption>Correlation vs. Convolution</figcaption>
</figure>

**Basic Concepts**:
   - [Padding](https://medium.com/analytics-vidhya/convolution-padding-stride-and-pooling-in-cnn-13dc1f3ada26#:~:text=1%29%20%E2%88%97%20%28%F0%9D%91%9A%20%E2%88%92%20%F0%9D%91%9B%20%2B%201%29.-,Padding,-There%20are%20two)
      - It refers to adding extra values (usually zeros) around the input tensor (signal or image) before applying the convolution operation
      - Padding is used to control the size of the output and to allow the kernel to process the edges of the input
      - `padding='same'`
         - To ensure that the output of the convolution operation has the same spatial dimensions (width and height for 2D convolutions, length for 1D convolutions) as the input
         $$p = \left\lceil \frac{k - 1}{2} \right\rceil$$
      - `padding='valid'`
         - Means no padding is applied to the input
         $$\text{Output Size} = \left\lfloor \frac{\text{Input Size} - k}{s} + 1 \right\rfloor$$

<figure style="text-align: center;">
    <img src="../assets/images/original/cnn/padding.svg" alt="convolution-padding.svg" style="width: 75%;">
    <figcaption>Padding for Convolution</figcaption>
</figure>

   - [Stride](https://medium.com/analytics-vidhya/convolution-padding-stride-and-pooling-in-cnn-13dc1f3ada26#:~:text=in%20this%20case.-,Stride,-left%20image%3A%20stride)
      - It defines how much the kernel moves over the input tensor during the convolution
      - A stride of `1` means the kernel moves one step at a time, fully overlapping with each adjacent position
      - A stride of `2` means the kernel skips one element at a time, leading to downsampling (reducing the size of the output)

   - [Dilation](https://towardsdatascience.com/review-dilated-convolution-semantic-segmentation-9d5a5bd768f5)
      - It introduces gaps between the elements of the kernel, effectively "spreading out" the kernel
      - This allows the kernel to cover a larger area of the input without increasing the number of parameters (kernel size)
      - Dilation is useful for capturing long-range dependencies in the input.

<figure style="text-align: center;">
    <img src="../assets/images/original/cnn/dilation.svg" alt="convolution-dilation.svg" style="width: 75%;">
    <figcaption>Dilation for Convolution</figcaption>
</figure>

# Convolution in PyTorch
   - Convolution operations (e.g. `nn.Conv1d`, `nn.Conv2d`) in PyTorch (and most deep learning frameworks) technically performs **correlation, not convolution!**
   - Although the operation is named e.g. `Conv2d`, the correlation operation is preferred in practice for a few reasons
      1. **Simplicity**:
         - Correlation is easier to implement and understand since it doesn't require flipping the kernel
      1. **Equivalence in Learning**:
         - In the context of CNNs, the kernel weights are learned during training
         - Since the kernels are learned, whether you use convolution or cross-correlation doesn't matter
         - The network can learn equivalent filters regardless of whether the kernel is flipped or not

**Docs**:
   - [pytorch.org/docs/stable/generated/torch.nn.Conv1d.html](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html)
   - [pytorch.org/docs/stable/generated/torch.nn.Conv2d.html](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html)
   - [pytorch.org/docs/stable/generated/torch.nn.Conv3d.html](https://pytorch.org/docs/stable/generated/torch.nn.Conv3d.html)
   - [pytorch.org/docs/stable/generated/torch.nn.functional.conv1d.html](https://pytorch.org/docs/stable/generated/torch.nn.functional.conv1d.html)
   - [pytorch.org/docs/stable/generated/torch.nn.functional.conv2d.html](https://pytorch.org/docs/stable/generated/torch.nn.functional.conv2d.html)
   - [pytorch.org/docs/stable/generated/torch.nn.functional.conv3d.html](https://pytorch.org/docs/stable/generated/torch.nn.functional.conv3d.html)

## 1D Correlation

In [4]:
# create a 1D signal and a kernel
signal_1d = torch.arange(1, 10).reshape(1, 1, -1)      # shape: [1, 1, 10] -> (batch_size, num_channels, signal_length)
kernel_1d = torch.tensor([2, 1, 2]).reshape(1, 1, -1)  # shape: [1, 1,  3]

In [5]:
# convolution using torch.nn.functional.conv1d
conv_1d_1 = F.conv1d(signal_1d, kernel_1d, padding='same')       # applies convolution with "same" padding, output size is the same as input size
conv_1d_2 = F.conv1d(signal_1d, kernel_1d, padding='valid')      # applies convolution with "valid" padding, no padding is added, so the output size is reduced
conv_1d_3 = F.conv1d(signal_1d, kernel_1d, padding=2, stride=2)  # applies convolution with a padding of 2 and a stride of 2, which results in downsampling the output

# log
print(f"conv_1d_1 : {conv_1d_1}")
print(f"conv_1d_2 : {conv_1d_2}")
print(f"conv_1d_3 : {conv_1d_3}")

conv_1d_1 : tensor([[[ 5, 10, 15, 20, 25, 30, 35, 40, 25]]])
conv_1d_2 : tensor([[[10, 15, 20, 25, 30, 35, 40]]])
conv_1d_3 : tensor([[[ 2, 10, 20, 30, 40, 18]]])


In [None]:
# plot
fig, axs = plt.subplots(nrows=1, ncols=4, figsize=(16, 4), layout='compressed')

axs[0].plot(signal_1d.squeeze(), marker='o', label='Original Signal')
axs[0].plot(kernel_1d.squeeze(), marker='o', color='purple', label='Kernel')
axs[0].set_title("Original Signal")
axs[0].legend()
axs[1].plot(conv_1d_1.squeeze(), marker='o', color='orange')
axs[1].set_title("Convolution with \"Same\" Padding")
axs[2].plot(conv_1d_2.squeeze(), marker='o', color='green')
axs[2].set_title("Convolution with \"Valid\" Padding")
axs[3].plot(conv_1d_3.squeeze(), marker='o', color='red')
axs[3].set_title("Convolution with Custom Padding and Stride")

plt.show()

## 2D Correlation

In [7]:
# create a 2D signal (image) and a kernel
signal_2d = torch.arange(1, 26, dtype=torch.float32).reshape(1, 1, 5, 5)                                 # shape: [1, 1, 5, 5] -> (batch_size, num_channels, signal_length)
kernel_2d = torch.tensor([[1, 0, -1], [1, 0, -1], [1, 0, -1]], dtype=torch.float32).reshape(1, 1, 3, 3)  # shape: [1, 1, 3, 3]

In [8]:
# convolution using torch.nn.functional.conv2d
conv_2d_1 = F.conv2d(signal_2d, kernel_2d, padding='same')       # applies convolution with "same" padding, output size is the same as input size
conv_2d_2 = F.conv2d(signal_2d, kernel_2d, padding='valid')      # applies convolution with "valid" padding, no padding is added, so the output size is reduced
conv_2d_3 = F.conv2d(signal_2d, kernel_2d, padding=1, stride=2)  # applies convolution with a padding of 1 and a stride of 2, which results in downsampling the output

In [None]:
# plot
fig, axs = plt.subplots(nrows=1, ncols=5, figsize=(20, 4), layout='compressed')

axs[0].imshow(signal_2d.squeeze(), cmap='gray')
axs[0].set(title="Original Signal (Image)", xticks=range(signal_2d.shape[3]), yticks=range(signal_2d.shape[2]))
axs[1].imshow(kernel_2d.squeeze(), cmap='gray')
axs[1].set(title='Kernel', xticks=range(kernel_2d.shape[3]), yticks=range(kernel_2d.shape[2]))
axs[2].imshow(conv_2d_1.squeeze(), cmap='gray')
axs[2].set(title="Convolution with \"Same\" Padding", xticks=range(conv_2d_1.shape[3]), yticks=range(conv_2d_1.shape[2]))
axs[3].imshow(conv_2d_2.squeeze(), cmap='gray')
axs[3].set(title="Convolution with \"Valid\" Padding", xticks=range(conv_2d_2.shape[3]), yticks=range(conv_2d_2.shape[2]))
axs[4].imshow(conv_2d_3.squeeze(), cmap='gray')
axs[4].set(title="Convolution with Custom Padding and Stride", xticks=range(conv_2d_3.shape[3]), yticks=range(conv_2d_3.shape[2]))

plt.show()

# Convolutional Neural Networks
   - CNNs are a class of deep learning models specifically designed for processing structured grid-like data, such as images, videos, and even certain types of sequential data

**Key Components of CNNs**
   1. Feature Extraction
      - Convolutional Layers
         - This is the core building block of a CNN
         - It involves sliding a filter (kernel) over the input data to produce a feature map
      - Pooling Layers
         - Pooling layers reduce the spatial dimensions of the feature maps, which helps in making the model invariant to small translations and reducing computational load
         - Types:
            - Max Pooling: Takes the maximum value from each patch of the feature map.
            - Average Pooling: Takes the average value from each patch.
   1. Classification
      - After feature extraction, the resulting features are flattened and passed into a series of fully connected layers, forming a [Multi-Layer Perceptron (MLP)](./06_multi-layer-perceptrons.ipynb).
      - This section performs the final classification or regression task based on the features extracted by the previous layers

<figure style="text-align: center;">
    <img src="../assets/images/original/cnn/convolutional-neural-networks.svg" alt="convolutional-neural-networks.svg" style="width: 100%;">
    <figcaption>Convolutional Neural Networks Model</figcaption>
</figure>

<table style="margin: 0 auto; text-align:center;">
   <thead>
      <tr>
         <th colspan="4" style="text-align:center;">Feature Extraction</th>
         <th colspan="4" style="text-align:center;">Classification</th>
      </tr>
      <tr>
         <th colspan="2">Convolution<sub>1</sub> parameters</th>
         <th colspan="2">Convolution<sub>2</sub> parameters</th>
         <th colspan="2">hidden<sub>1</sub> parameters</th>
         <th colspan="2">logits parameters</th>
      </tr>
   </thead>
   <tbody>
      <tr>
         <td>Weights</td>
         <td>Biases</td>
         <td>Weights</td>
         <td>Biases</td>
         <td>Weights</td>
         <td>Biases</td>
         <td>Weights</td>
         <td>Biases</td>
      </tr>
      <tr>
         <td>(1 x 3 × 3) × A</td>
         <td>A</td>
         <td>(A x 3 × 3) × B</td>
         <td>B</td>
         <td>C × D</td>
         <td>D</td>
         <td>D × E</td>
         <td>E</td>
      </tr>
   </tbody>
   <tfoot>
      <tr>
         <td colspan="2">(1 × 3 × 3 + 1) × A</td>
         <td colspan="2">(A × 3 × 3 + 1) × B</td>
         <td colspan="2">(C + 1) × D</td>
         <td colspan="2">(D + 1) × E</td>
      </tr>
   </tfoot>
</table>

**Training a CNN**:
   - Forward Pass: Calculate the output using the current weights and biases.
   - Loss Function: Commonly used loss functions for CNNs include Cross-Entropy Loss for classification tasks and Mean Squared Error for regression tasks.
   - Backward Pass (Backpropagation): Calculate the gradient of the loss function with respect to each weight and bias.
   - Weight Update: Update the weights and biases using an optimization algorithm like Gradient Descent or Adam.
   - Regularization: Techniques like Dropout and Batch Normalization are used to prevent overfitting and stabilize training.

**Applications of CNNs**:
   - Image Classification: Identifying the class label of an input image.
   - Object Detection: Locating objects within an image and identifying their class.
   - Segmentation: Classifying each pixel in an image into a category.
   - Face Recognition: Identifying or verifying a person based on an image of their face.

**[Popular CNN Architectures](./models/CNN/)**
   - **LeNet-5**: One of the earliest CNNs, designed for handwritten digit recognition.
   - **AlexNet**: A deeper CNN that won the ImageNet competition in 2012, popularizing CNNs for large-scale image classification.
   - **VGGNet**: Known for its simplicity and use of very small (3x3) filters, VGGNet showed that depth (more layers) can lead to better performance.
   - **ResNet (Residual Networks)**: Introduces skip connections to combat the vanishing gradient problem, enabling much deeper networks.
   - ...

**Notes**:
   - `torch.nn.Conv2d`
      - loss function : 
         - multi-class classification : `torch.nn.CrossEntropyLoss` = `torch.nn.LogSoftmax` + `torch.nn.NLLLoss`
         - [pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)
         - [pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html)
      - activation function for the last layer:
         - when using `torch.nn.CrossEntropyLoss` as a loss function, the output layer doesn't need an activation function
         - `torch.nn.CrossEntropyLoss` calculates `torch.nn.LogSoftmax` and `torch.nn.NLLLoss` internally.
         - [pytorch.org/docs/stable/generated/torch.nn.Softmax.html](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html)
         - [pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html](https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html)
      - Weights
         - Initialized based on a scheme similar to Kaiming/He initialization
         - Uniform Distribution [default]: $W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}}}}, \sqrt{\frac{6}{n_{\text{in}}}}\right)$
         - Normal Distribution: $W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)$
      - Biases:
         - Initialized to zero
      - [pytorch.org/docs/stable/nn.init.html](https://pytorch.org/docs/stable/nn.init.html)
      - Paper: [Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - He, K. et al. (2015).](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf)

**Playground**:
   - [poloclub.github.io/cnn-explainer](https://poloclub.github.io/cnn-explainer/)
   - [convnetplayground.fastforwardlabs.com](https://convnetplayground.fastforwardlabs.com/)
   - [alexlenail.me/NN-SVG](https://alexlenail.me/NN-SVG/)

## Convolutional Neural Networks Using PyTorch
   - Refer to this [notebook](./projects/02_convolutional-neural-networks.ipynb) for a comprehensive example on the CNN concept.

📚 **Tutorials**:
   - Neural Networks: [pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial)
   - Training a Classifier: [pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html)

In [10]:
class CIFAR10Model(nn.Module):
    def __init__(self, in_channels, output_dim):
        super(CIFAR10Model, self).__init__()
        self.feature_extractor = nn.Sequential(

            # 3x32x32
            nn.Conv2d(in_channels, out_channels=32, kernel_size=3),
            nn.BatchNorm2d(32),  # StandardScaler along channel axis
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            # 32x15x15

            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            # 64x6x6

            nn.AdaptiveAvgPool2d(output_size=(1, 1))
            # 64x1x1
        )

        self.flatten = nn.Flatten(start_dim=1)

        self.classifier = nn.Sequential(
            nn.Linear(64, output_dim),
        )

    def forward(self, x):
        x = self.feature_extractor(x)
        x = self.flatten(x)
        x = self.classifier(x)
        return x

In [11]:
# example input
batch_size = 3
x = torch.randn(batch_size, 3, 32, 32)
y = torch.tensor([1, 0, 1], dtype=torch.int64)

In [12]:
# initialize the CNN
in_channels = x.shape[1]  # 
output_dim = 10           # 

model = CIFAR10Model(in_channels, output_dim)
model

CIFAR10Model(
  (feature_extractor): Sequential(
    (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1))
    (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
    (5): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (8): AdaptiveAvgPool2d(output_size=(1, 1))
  )
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (classifier): Sequential(
    (0): Linear(in_features=64, out_features=10, bias=True)
  )
)

In [13]:
summary(model, input_size=x.size(), device='cpu')

Layer (type:depth-idx)                   Output Shape              Param #
CIFAR10Model                             [3, 10]                   --
├─Sequential: 1-1                        [3, 64, 1, 1]             --
│    └─Conv2d: 2-1                       [3, 32, 30, 30]           896
│    └─BatchNorm2d: 2-2                  [3, 32, 30, 30]           64
│    └─ReLU: 2-3                         [3, 32, 30, 30]           --
│    └─MaxPool2d: 2-4                    [3, 32, 15, 15]           --
│    └─Conv2d: 2-5                       [3, 64, 13, 13]           18,496
│    └─BatchNorm2d: 2-6                  [3, 64, 13, 13]           128
│    └─ReLU: 2-7                         [3, 64, 13, 13]           --
│    └─MaxPool2d: 2-8                    [3, 64, 6, 6]             --
│    └─AdaptiveAvgPool2d: 2-9            [3, 64, 1, 1]             --
├─Flatten: 1-2                           [3, 64]                   --
├─Sequential: 1-3                        [3, 10]                   --
│    └─Li

In [14]:
# define a loss function
criterion = nn.CrossEntropyLoss()

# define an optimizer (e.g., SGD)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# training loop
num_epochs = 10  # Number of epochs

for epoch in range(num_epochs):

    # forward pass
    output = model(x)
    
    # compute the loss
    loss = criterion(output, y)
    
    # perform backward propagation automatically
    loss.backward()
    
    # update the weights & zero the gradients
    optimizer.step()
    optimizer.zero_grad()
    
    # log
    print(f'epoch {epoch+1:3}/{num_epochs}  ->  Loss: {loss.item():.4f}')

epoch   1/10  ->  Loss: 2.8423
epoch   2/10  ->  Loss: 1.0186
epoch   3/10  ->  Loss: 0.3171
epoch   4/10  ->  Loss: 0.1235
epoch   5/10  ->  Loss: 0.0634
epoch   6/10  ->  Loss: 0.0399
epoch   7/10  ->  Loss: 0.0284
epoch   8/10  ->  Loss: 0.0215
epoch   9/10  ->  Loss: 0.0168
epoch  10/10  ->  Loss: 0.0133
