# An Introduction To Dropout As A Bayesian Aproximation With Probly

This Notebook aims to be a short introduction into Dropout and its use for uncertainty quantification in neural networks.

## 0. Motivation and Intuition

When training a neural network, we often face a problem called "Overfitting". This is where the model becomes so good at memorizing the training data, that it performs poorly on new data. This happens because the network learns to rely too much on specific neurons or patterns that don't generalize.
This is where the dropout operation comes in handy. 


## 1. What is Dropout?

The idea is to insert layers into a model, that have a certain chance to "drop out" or disable neurons in that layer, thus creating a slightly different output every time the network makes a prediction.
For our purposes we can use dropout, to evaluate how certain a model is about its inference.

![Dropout Image](https://www.baeldung.com/wp-content/uploads/sites/4/2020/05/2-1-2048x745-1.jpg)

As shown above, some of the neurons in the dropout layer have been deactivated, therefore we have a sub-model that will give us a different output than the original model for the same input.  
Which neurons get deactivated will be decided by chance, every time we generate an output. That chance is called the dropout-rate and can be configured manually.

## 2. Dropout and Model Confidence

Usually dropout would not be used in inference but only in the training process, if you want to deal with overfitting. But we want to keep it active in the inference process, to calculate uncertainty.  
  
If we give the model the same input for several forward passes, we can look at how much the outputs of the model differ in certain points and with that information, we can make assumptions about the epistemic uncertainty of the model in that specific point. The less variance in the outputs for the same input, the less uncertainty and vice versa. This approach is often referred to as Monte Carlo Dropout.

This Monte Carlo Dropout is an approximation to the behavior of a bayesian neural network, with the advantage of being significantly more efficient. 

## 3. How can Probly help?

The dropout transformation in Probly takes a standard model and transforms it into a dropout-enabled model, that can be used to quantify uncertainty. This is achieved by traversing every layer in the regular model and prepending a dropout layer for every linear layer, except for the model's first layer. This dropout-enabled model can then be extended to a Monte Carlo Dropout as explained above.

## 4. Implementation of Dropout with Probly

### 4.1 Using Probly's `dropout` function to transform a model

Let's implement a simple linear model in PyTorch!

In [1]:
from torch import nn

simple_model = nn.Sequential(
        nn.Linear(2, 2),
        nn.Linear(2, 2),
        nn.Linear(2, 2),
    )

Here we defined a simple model in torch that has three linear layers.  
  
At the moment we can't make any statements about its epistemic uncertainty, but we can use dropout layers to enable us to do so.  
  
With Probly inserting the dropout layers becomes very easy,
all we have to do is import the transformation function `dropout` and apply it to the model. 

In [2]:
from probly.transformation import dropout

dropout_model = dropout(simple_model)

Now our simple model has been converted to a fully functional dropout model.
  
We would expect that before every linear layer except for the first one a dropout layer has been added to the model, without changing anything else.  

We can subsequently print out the sturcture of our new Dropout-model to verify this: 

In [3]:
print(dropout_model)

Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1_0): Dropout(p=0.25, inplace=False)
  (1_1): Linear(in_features=2, out_features=2, bias=True)
  (2_0): Dropout(p=0.25, inplace=False)
  (2_1): Linear(in_features=2, out_features=2, bias=True)
)


As you can see, probly took care of inserting the dropout layers with one simple function call. 

### 4.2 Specifying the Dropout-rate

If we don't specify a Dropout-rate when transforming our model, probly will give every dropout layer a standard probability of 0.25.

To specify the Dropout-rate we can pass it as the second argument to the `dropout` function.

Now let's transform our model again with a custom probability.

In [None]:
custom_dropout_rate = 0.5
dropout_model = dropout(simple_model, custom_dropout_rate)

print(dropout_model)

Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1_0): Dropout(p=0.5, inplace=False)
  (1_1): Linear(in_features=2, out_features=2, bias=True)
  (2_0): Dropout(p=0.5, inplace=False)
  (2_1): Linear(in_features=2, out_features=2, bias=True)
)


As expected every dropout layer has the specified dropout-chance of 0.5.

### 4.3 Other models...

Naturally this also works for every other model that a dropout-transformation makes sense for.  

Let's try it with a convolutional model next.

In [5]:
conv_model = nn.Sequential(
        nn.Conv2d(3, 5, 5),
        nn.ReLU(),
        nn.Flatten(),
        nn.Linear(5, 2),
    )

dropout_conv_model = dropout(conv_model, 0.2)

print(dropout_conv_model)

Sequential(
  (0): Conv2d(3, 5, kernel_size=(5, 5), stride=(1, 1))
  (1): ReLU()
  (2): Flatten(start_dim=1, end_dim=-1)
  (3_0): Dropout(p=0.2, inplace=False)
  (3_1): Linear(in_features=5, out_features=2, bias=True)
)


Like with our simple model, probly inserted the dropout layer into our convolutional model without us even needing to use a different function.

As expected, the Dropout-layer is only added before the single linear layer.

### 4.4 And other Frameworks?

That's cool and all, but what if we wanted to use a framework other than torch?  
  
No problem! Probly aims to be unbound by any specific framework or library, so lets see if we can transform some models implemented in flax.

So lets implement a small linear model in flax...

In [14]:
from flax import nnx

flax_rngs = nnx.Rngs(0)

simple_model_flax = nnx.Sequential(
        nnx.Linear(2, 2, rngs=flax_rngs),
        nnx.Linear(2, 2, rngs=flax_rngs),
        nnx.Linear(2, 2, rngs=flax_rngs),
    )


As before with torch, we defined a simple linear model with three linear layers in flax.  
Now we can try transforming it with probly's transformation function `dropout`...

In [36]:
dropout_model_flax = dropout(simple_model_flax)

To confirm everything went as expected we create a short function that prints out the layers in a flax model without all the information that is unimportant to us right now.  
It simply goes through the structure of the network and prints out what kind of layers are contained in it:

In [None]:
def print_layers(module, indent=0):
    for name, attr in module.__dict__.items():
        if isinstance(attr, nnx.Module):
            print(" " * indent + f"{name}: {attr.__class__.__name__}")
            print_layers(attr, indent + 2)

And now we can look at the result of our transformation:

In [37]:
print_layers(dropout_model_flax)

layers: List
  0: Linear
  1: Dropout
  2: Linear
  3: Dropout
  4: Linear


Now let us try it one more time with a convolutional model implemented in flax.
For that we just define a small convolutional model like we did in torch before: 

In [38]:
conv_model_flax = nnx.Sequential(
        nnx.Conv(3, 5, (5, 5), rngs=flax_rngs),
        nnx.relu,
        nnx.flatten,
        nnx.Linear(5, 2, rngs=flax_rngs),
    )

dropout_conv_model_flax = dropout(conv_model_flax)
print_layers(dropout_conv_model_flax)

layers: List
  0: Conv
  3: Dropout
  4: Linear


As anticipated, Probly can work with different kinds of models built with different frameworks without the need for seperate functions to handle their differences. 
Likewise we can specify the dropout-rate by passing it as the second argument. 

### 4.5 A quick look at some output

At last we want to look at some output generated by a model implementing dropout as a bayesian aproximation for the quantification of uncertainty.

Note that these examples are made with untrained models and are only intended to demonstrate what we could do with those transformed models.  
  
For our example we will implement a small model with torch and transform it:

In [67]:
demo_model = nn.Sequential(
    nn.Linear(2, 2),
    nn.ReLU(),
    nn.Linear(2, 1)
)

demo_dropout_model = dropout(demo_model, 0.3)

print(demo_dropout_model)

Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): ReLU()
  (2_0): Dropout(p=0.3, inplace=False)
  (2_1): Linear(in_features=2, out_features=1, bias=True)
)


Now that we have transformed our model we can proceed with:
1. Generating some random input for the model
2. Setting the model to train since we want to keep the dropout layer active
3. Analysing the output

In [None]:
from torch import randn, no_grad

# generates a random input for the model
input = randn(1, 2)
output = []

demo_dropout_model.eval()
# Set only dropout layers to train to keep them active
for m in demo_dropout_model.modules():
    if isinstance(m, nn.Dropout):
        m.train()


# pass the input to the model for a few times
with no_grad():
    for _ in range(8):
        x = demo_dropout_model(input)
        output.append(x.item())

print(output)

[0.312858909368515, 0.27893561124801636, 0.27893561124801636, 0.312858909368515, 0.27893561124801636, 0.27893561124801636, 0.27893561124801636, 0.27893561124801636]


We can see that the output is not always the same if we run this multiple times.  
Now we will calculate the output variance with numpy: 

In [117]:
import numpy as np

variance = np.var(output)
print(f"The output has a variance of {variance}")

The output has a variance of 0.0002157731541322927


This variance is what gives us information about input uncertainty in that input point.

You might ask why the variance in this example is so low. That is caused by the fact that we did not train the model in any shape or form and it is just "uniformly bad" at predicting anything.  

A quick training would probably suffice to generate a bigger variance.



#### This concludes the introduction to probly's Dropout function.

