# Deep Learning & Applied AI

We recommend going through the notebook using Google Colaboratory.

# Tutorial 7b: Batchnorm and dropout

In this tutorial, we will show two implementations of Batchnorm and Dropout from scratch.

You are encouraged to try and implement them by yourselves before looking at the solution. At the very least, once you have read and understood the code, try to re-implement it on your own.

Author:

Prof. Emanuele Rodolà

Course:

- Website and notebooks will be available at https://github.com/erodola/DLAI-s2-2025/

# Batchnorm from scratch

This is the 2d version of batchnorm, operating on batches with shape `(b,c,h,w)`. The 1d version is way simpler as it simply digests batches with shape `(b,c,n)` or `(b,n)`.

Key points to keep in mind:

- *Training:*
  - Mean/stdev are accumulated across all pixels of all images in the batch, yielding **one scalar per channel**.
  - Use `unbiased=False` in the stdev.
  - Compute running stats across batches.
  - Introduce trainable $\gamma$ and $\beta$ (again, **one per channel**).

- *Inference:*
  - Apply running stats computed during training.
  - Use batchnorm *before* the nonlinearity.

In [None]:
# Just the raw calculations for double check

import torch
import torch.nn as nn

b, c, h, w = 2, 3, 2, 2

x = torch.arange(b*c*h*w, dtype=torch.float32).reshape((b, c, h, w))

BN = nn.BatchNorm2d(c)

mu = x.mean(dim=(0, 2, 3), keepdim=True)
var = x.var(dim=(0, 2, 3), keepdim=True, unbiased=False)  # divide by N instead of N-1

torch_bn = BN(x)
my_bn = (x - mu) / torch.sqrt(var + 1e-5)

torch.allclose(torch_bn, my_bn, atol=1e-6)

In [None]:
# Custom batch norm module

import torch
import torch.nn as nn

class MyBatchNorm2d(nn.Module):

  def __init__(self, num_features: int, momentum: float, affine: bool, eps: float = 1e-5):
    """
    Implements BatchNorm2d from scratch.

    Args:
      num_features (int): The number of input features (channels) per data point.
      momentum (float): Momentum parameter; the larger, the more emphasis is given to later batches.
      affine (bool): Whether or not to include trainable scale and shift parameters.
      eps (float): Epsilon to prevent instabilities in the calculation of the batch variance.
    """
    super().__init__()

    self.momentum = momentum
    self.affine = affine
    self.eps = eps

    # trainable gamma (scale) and beta (shift) to allow learning the identity map
    if self.affine:
      self.gamma = nn.Parameter(torch.ones(num_features))  # includes them in model.parameters(),
      self.beta = nn.Parameter(torch.zeros(num_features))  # thus making them trainable

    # NOTE: the two commented lines below are *wrong*:
    # - requires_grad=False implies these will not be saved/loaded!
    # - will not be moved to the device if .to(device) is called
    # - unclear, because might be misinterpreted as non-trainable model parameters
    # self.running_mu = torch.zeros(num_features, requires_grad=False)
    # self.running_var = torch.ones(num_features, requires_grad=False)

    self.register_buffer("running_mu", torch.zeros(num_features))
    self.register_buffer("running_var", torch.ones(num_features))

  def forward(self, x: torch.Tensor) -> torch.Tensor:

    if self.training:

      print("training")

      mu = x.mean(dim=(0, 2, 3))
      var = x.var(dim=(0, 2, 3), unbiased=False)

      x = (x - mu[None, :, None, None]) / torch.sqrt(var[None, :, None, None] + self.eps)

      self.running_mu = self.momentum * self.running_mu + (1 - self.momentum) * mu
      self.running_var = self.momentum * self.running_var + (1 - self.momentum) * var

    else:

      print("evaluation")
      x = (x - self.running_mu[None, :, None, None]) / torch.sqrt(self.running_var[None, :, None, None] + self.eps)

    if self.affine:
      x = x * self.gamma[None, :, None, None] + self.beta[None, :, None, None]

    return x

  def __repr__(self) -> str:
    return f"MyBatchNorm2d({self.momentum=}, {self.affine=})"

# Usage example

b, c, h, w = 32, 5, 28, 28
xb = torch.randn((b, c, h, w))

model = nn.Sequential(
    nn.Conv2d(in_channels=c, out_channels=7, kernel_size=3),
    MyBatchNorm2d(num_features=7, momentum=0.9, affine=True)
)

# model.train()
model.eval()
model(xb).shape

# Dropout from scratch

Key points:

- **The output is rescaled by $\frac{1}{1-p}$** to compensate for the dropped elements. This way, the activations have similar scale at training and inference.

- Zero-out the **features** instead of the layer neurons, e.g.

```
x = self.fc1(x)
x = self.relu(x)
x = self.dropout(x)  # after the nonlinearity
x = self.fc2(x)
...
```

- For convolutional layers, there are two variants:
  1. randomly drop pixel-wise features
  2. randomly drop entire feature maps (*spatial dropout*, in pytorch `dropout2d`)

In [None]:
import torch
import torch.nn as nn

torch.manual_seed(1337)

x = torch.randn((3, 4))
p = 0.8  # percent of elements to drop on average

dropout = nn.Dropout(p=p)
dropout(x)

In [None]:
# torch.bernoulli(x) draws 0 or 1 for each element in x,
# according to the probability specified at that element

mask = torch.bernoulli(torch.ones_like(x) * (1 - p)).bool()
mask * x / (1 - p)  # apply mask and rescale

In [None]:
import torch
import torch.nn as nn

class MyDropout(nn.Module):

  def __init__(self, p: float):
    super().__init__()
    self.p = p

  def forward(self, x: torch.Tensor) -> torch.Tensor:

    if self.training:
      mask = torch.bernoulli(torch.ones_like(x) * (1 - self.p)).bool()
      return x * mask / (1 - self.p)
    else:
      return x

  def __repr__(self) -> str:
    pass  # TODO

# Usage example

x = torch.rand(10)

model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    MyDropout(p=0.5)
)

model.eval()
model(x)
