# probly Tutorial — Dropout Transformation (Part A)

*Date:* 2025-11-03

**Audience:** New probly users · **Framework:** PyTorch · **Author:** Nidhi Jain and Julia

This notebook is meant as a gentle, practical introduction to the **Dropout transformation** in `probly`.
The goal is not to be mathematically perfect, but to give you an intuition you can actually use when you
work on models in PyTorch.

We will slowly build up from the very basic idea of *normal* Dropout to the slightly more advanced idea of
a **Dropout transformation that makes a model uncertainty‑aware**. After that, we look at a tiny PyTorch
example and inspect how the transformation changes the model.

## 1. Concept: What is Dropout (normal) vs Dropout Transformation?

The original question for this part is:

> **“What is Dropout (normal) vs Dropout Transformation?”**  
> **1.1 Normal Dropout (no probly, just PyTorch)**  
> In “normal” deep learning, Dropout is a layer used during training to reduce overfitting.  
> Overfitting = model memorizes training data and sucks on new data.

Below is a more detailed version of that explanation, in my own words.

### 1.1 Normal Dropout (standard PyTorch Dropout layer)

When we train a neural network, there is always the risk of **overfitting**. That means the model becomes
very good at the training set but fails on new data, because it has more or less *memorised* patterns that
only appear in the training examples.

**Normal Dropout** is a simple trick to make overfitting less likely. During training, a `Dropout(p)` layer
will, for every mini‑batch, randomly set a fraction `p` of its input activations to zero. You can imagine
this as:

- with probability `p` a neuron is “switched off” for this training step,
- with probability `1 − p` it behaves as usual.

Because different neurons get switched off in every step, the network is forced to **spread the information**
across many neurons instead of relying on a few very strong ones. This usually makes the model **more robust**
and helps it generalise better.

Important detail: in **normal PyTorch usage**

- Dropout is **active only in training mode** (`model.train()`),
- and it is **disabled in evaluation mode** (`model.eval()`).

So at test / inference time, the model behaves like a **deterministic function**: the same input always gives
the same output, and there is no randomness from Dropout anymore. The purpose of normal Dropout is therefore
*only* to improve generalisation during training, not to provide uncertainty information.

### 1.2 Dropout Transformation (probly)

The **Dropout transformation** in `probly` takes this Dropout idea and uses it in a slightly different role.
Instead of treating Dropout purely as a regularisation trick during training, we use it to make the model
**uncertainty‑aware** at prediction time.

Roughly speaking, the transformation does the following:

- It walks through your PyTorch model and finds the relevant linear layers.
- It programmatically inserts Dropout layers around those linear layers.
- Crucially, these Dropout layers stay **active during inference**, so each forward pass is a bit different.

If we now feed the **same input** through the transformed model multiple times, we do **not** get exactly the
same output each time. Instead we get a *cloud* of slightly different predictions. From this cloud we can:

- compute a mean prediction (what the model “on average” thinks), and
- look at how much the predictions vary (this variation is a proxy for **uncertainty**).

So the Dropout transformation reuses the usual Dropout mechanism, but with a **different goal**:

- normal Dropout: better training, less overfitting, Dropout OFF in eval mode;
- Dropout transformation: keep Dropout ON in eval mode to get a distribution of outputs and estimate how
  confident the model is.

### 1.3 Short side‑by‑side comparison

| Aspect                        | Normal Dropout (PyTorch)                               | Dropout Transformation (probly)                          |
|------------------------------|--------------------------------------------------------|----------------------------------------------------------|
| Where it appears in code     | You explicitly add `nn.Dropout` layers                 | Transformation walks the model and inserts Dropout       |
| When Dropout is active       | Only in `model.train()`                                | Also (and intentionally) in `model.eval()`               |
| Main purpose                 | Reduce overfitting / improve generalisation            | Make predictions uncertainty‑aware                       |
| Output behaviour in eval     | Deterministic (same input → same output)               | Stochastic (same input → slightly different outputs)     |
| How we use the randomness    | We ignore it at inference                              | We *use* it to measure spread / uncertainty              |

The rest of this notebook now assumes this picture: **“normal” Dropout is a training regulariser, the
Dropout transformation turns the same mechanism into a tool for estimating uncertainty.**

## 2. Quickstart (PyTorch)

Below: build a small MLP, apply `dropout(model, p)`, and inspect the modified architecture.


In [None]:
# If you're running inside the repo's environment, these imports should work directly.
from probly.transformation import dropout

import torch
from torch import nn

def build_mlp(in_dim=10, hidden=32, out_dim=1):
    # A sequential model that ends on a Linear
    return nn.Sequential(
        nn.Linear(in_dim, hidden),
        nn.ReLU(),
        nn.Linear(hidden, hidden),
        nn.ReLU(),
        nn.Linear(hidden, out_dim),
    )

p = 0.2  # dropout probability

model = build_mlp()
print("Original model:\n", model)

model_do = dropout(model, p)
print("\nWith Dropout transformation (p=%.2f):\n" % p, model_do)


### Notes on the structure
- Expect a Dropout layer **before** each intermediate `nn.Linear`.
- If the last layer is a linear output head, the transform usually **does not** add a Dropout layer in front of it, preserving your final mapping.


## 3. Uncertainty via Monte Carlo (MC) Dropout

To obtain predictive *uncertainty*, we run multiple stochastic forward passes with Dropout **active** and compute the mean and variance of predictions.

> **Important:** In PyTorch, Dropout is active in `model.train()` mode. For MC Dropout at inference, we intentionally call `train()` while disabling gradients.


In [None]:
import torch
from torch import nn
import math

# Toy regression data
torch.manual_seed(0)
n = 128
X = torch.randn(n, 10)
true_w = torch.randn(10, 1)
y = X @ true_w + 0.1 * torch.randn(n, 1)

# (Re)build and transform the model
model = build_mlp(in_dim=10, hidden=64, out_dim=1)
model_do = dropout(model, p=0.2)

# Simple training loop (few steps just for illustration)
opt = torch.optim.Adam(model_do.parameters(), lr=1e-2)
loss_fn = nn.MSELoss()

model_do.train()
for step in range(200):
    opt.zero_grad()
    pred = model_do(X)
    loss = loss_fn(pred, y)
    loss.backward()
    opt.step()

# MC dropout prediction function
@torch.no_grad()
def mc_predict(model_with_dropout, inputs, T=50):
    model_with_dropout.train()  # activate dropout
    preds = []
    for _ in range(T):
        preds.append(model_with_dropout(inputs).detach())
    stacked = torch.stack(preds, dim=0)  # [T, N, out_dim]
    mean = stacked.mean(dim=0)
    var = stacked.var(dim=0, unbiased=False)
    return mean, var

mean_pred, var_pred = mc_predict(model_do, X[:5], T=100)
print("Predictive mean (first 5):\n", mean_pred.squeeze())
print("\nPredictive variance (first 5):\n", var_pred.squeeze())


## 4. Good practices
- Tune `p` (e.g., 0.1–0.5) based on validation performance.
- Use a reasonable number of MC samples `T` (e.g., 20–200). Larger `T` → smoother uncertainty estimates, but slower.
- Keep your **final layer behavior** in mind when interpreting where Dropout is inserted.


## 5. Common errors
- `ValueError: p must be between 0 and 1` — ensure `0 ≤ p ≤ 1`.
- Seeing no Dropout layers? Confirm your model actually contains `nn.Linear` modules where you expect them.


## 6. Next steps
- Try other architectures (e.g., with Conv blocks feeding into Linear heads).
- Compare models **with vs. without** the transformation using the same training loop.
