# Dropout

> "Dropout, simply described, is the concept that if you can learn how to do a task repeatedly whilst drunk, you should be able to do the task even better when sober."

## Seriously, what is it?

- __Widely spread neural network specific regularization method__
- Zeroing-out part of neurons (outputs from the layer) __during training and forward pass__.
- Left out connections have to "catch up" for the ones dropped and perform task independently.
- Different set of neurons are randomly dropped during each pass

For single neuron equation would look like this:

$$
O_i = X_ig(\sum_{k=1}^{d_i}w_k x_k + b), P(X_i = 0) = p
$$

In simple terms, `p` specifies probability of zeroing out this specific neuron (and `q=1-p` is a probability of keeping it).

## Train vs test behaviour

Of course, this approach would be wasteful during test as:
- It might produce unreproducible behaviour for single sample
- It would not utilize the whole network

Because of that, the above equation only applies during training phase.

> __For test (evaluation) we use all of the neurons BUT SCALED BY THE PROBABILITY OF THE NEURON BEING KEPT.__

For single neuron testing equation would look like this:

$$
O_i = qg(\sum_{k=1}^{d_i}w_k x_k + b), q = 1-p
$$

## Exercise

Implement `Dropout` layer on your own.
- Inside `__init__`:
    - single argument `p` the probability of neuron being dropped
    - check whether `p` lies within `(0, 1)` range and if it doesn't raise `ValueError` with appropriate message (e.g. `p (probability) has to lie in (0, 1) range!`)
    - Create `self._distribution = torch.distributions.binomial.Binomial` object with specified `p` probability
- Inside `forward`:
    - Use `self.training` `bool` value in `forward` to differentiate between test and train behaviour
    - Use `self._distribution.sample` method to get binary mask with the same shape as `inputs` tensor (training)
    - Use `.to(inputs.device)` to cast created tensor to `cuda` (or other device) if needed (training). __Note:__ `torch.distributions` __is not casted to device with the module__ as it's not `torch.nn.Module` instance (see [this issue]() for more on the topic)
    - Multiply with the binary mask and return it (training)
    - Multiply by keep constant (testing phase)

In [1]:
import torch


class Dropout(torch.nn.Module):
    ...

### Test

Run the code below to see for eventual errors. 

You should see some values zeroed out during training and no zeroes during testing

In [4]:
def test_my_dropout(module):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    cpu_input = torch.randn(8, 5)
    gpu_input = torch.randn(8, 5).to(device)


    module(cpu_input)
    print("\n\n------------------- TRAINING -------------------\n\n")
    print(module(gpu_input))
    print("\n\n-------------------- TESTING -------------------\n\n")

    module.eval()
    print(module(gpu_input))

test_my_dropout(Dropout(p=0.5))



------------------- TRAINING -------------------


tensor([[ 0.0000, -1.0756, -0.6573, -0.0000,  0.0000],
        [ 0.6543,  0.5908, -0.1279,  0.0000,  0.0000],
        [ 0.0000, -0.8477,  0.0000, -0.0000,  0.0000],
        [-0.0000, -0.0318,  0.0000, -0.0000,  0.0000],
        [-0.0000, -0.5612, -0.1582, -0.0000, -0.0000],
        [-0.7603,  0.2506, -0.0000, -0.9221, -0.2294],
        [ 0.0000,  0.0000, -1.5315,  0.0000,  0.0000],
        [ 0.0512,  0.0000,  0.0000,  0.1913, -0.0000]], device='cuda:0')


-------------------- TESTING -------------------


tensor([[ 0.3406, -0.5378, -0.3287, -0.3442,  0.2489],
        [ 0.3271,  0.2954, -0.0639,  0.5538,  0.1177],
        [ 0.5797, -0.4238,  0.1026, -0.0614,  0.2204],
        [-0.1302, -0.0159,  1.1633, -0.1587,  0.4389],
        [-0.2908, -0.2806, -0.0791, -0.4145, -0.2283],
        [-0.3801,  0.1253, -0.0988, -0.4610, -0.1147],
        [ 0.4038,  0.5556, -0.7657,  0.6603,  0.6052],
        [ 0.0256,  0.3206,  0.6579,  0.0957, -0.778

## Dropout rationale

### Ensemble

> Dropout works like an ensemble of models

- During each `forward` pass different internal routes are used to propagate information
- During `testing` phase all of the routes are considered but scaled appropriately

> Why this value for scaling?

- On average, each neuron will "fire" with `1-p` probability (activation's expected value)
- They will have to output higher values in order to compensate for signal loss during training phase
- __If we do not scale values might explode somewhere in the deeper layers__ (especially for deep models and `ReLU` activations) when evaluation is runf

### Sparsity (most important weights)

- Dropout pushes distributions of activations towards zero
- Neural network focuses more on the important weights and important output neurons
- We get a model that is __easier to reason about__ (not to confuse with easy!)
- __Breaks co-adaptation__ (multiple neurons do similar tasks, hence decision boundary is less clear)
- Due to above, generalization is likely to improve as the most important features are considered (most important factor according to original authors)

### Scaling rationale - Monte Carlo sampling

- Save each model created during forward pass (__a lot of models__!) randomly (`k=50` used) during training (preferably after some training passed)
- Ask each one to predict on test
- Average their results
- __Results similar to just multiplying activation by expected value (within one standard deviation)!__

### Noise addition

- As we randomly generate masks, we create noise during each forward pass
- Noise is known to improve generalization as it makes the model more reluctant to follow random/uninformative patterns
- __This is called internal representation noise__

## Inverted dropout

Inverted dropout is similar, but is a practical implementation which libraries like PyTorch or Tensorflow adhere to.

### Training

Works almost the same, but is additionally scaled by inverse of keep probability `1 / q`:

$$
O_i = \frac{1}{1-p}X_ig(\sum_{k=1}^{d_i}w_k x_k + b), P(X_i = 0) = p
$$

- This approach allows the network to adjust to larger inputs (it would multiply outputs by `2` for `p=0.5`)
- Simulates input from other neurons that will be added during testing phase

### Testing

> In this approach testing phase __is left untouched (simply forward inputs to the layer)!__

__Advantages__:
- Define your model once
- Faster inference (one might even remove this layer completely)

## Exercise

Based on the description, implement `InvertedDropout` (inherit from previously created `Dropout` class).

Change `forward` method appropriately.

In [5]:
class InvertedDropout(Dropout):
    def forward(self, inputs):
        ...

In [6]:
test_my_dropout(InvertedDropout(p=0.5))



------------------- TRAINING -------------------


tensor([[-1.1520, -0.0000,  0.0000, -0.8818, -0.0000],
        [ 0.0000,  0.0383,  0.0000,  0.0000, -0.0000],
        [-1.4515, -0.0000, -0.1408,  0.0000,  0.0000],
        [-0.0000, -0.3405, -0.0000, -0.8192,  0.0000],
        [ 0.0000, -0.0000, -0.0000, -1.1061, -0.0000],
        [-0.0000, -3.0036, -0.0000, -0.8002,  0.0000],
        [-0.0000, -1.2126,  0.4333, -0.0000,  0.2125],
        [-0.0000, -0.0000, -0.0000, -0.0000, -0.0000]], device='cuda:0')


-------------------- TESTING -------------------


tensor([[-1.1520, -0.6724,  1.3133, -0.8818, -0.6427],
        [ 0.2690,  0.0383,  0.1700,  0.1068, -1.4302],
        [-1.4515, -2.1973, -0.1408,  0.5725,  0.8788],
        [-0.2042, -0.3405, -0.4493, -0.8192,  1.5596],
        [ 0.1847, -0.1502, -0.0321, -1.1061, -0.1467],
        [-0.8698, -3.0036, -0.6992, -0.8002,  1.6355],
        [-1.7038, -1.2126,  0.4333, -0.3142,  0.2125],
        [-1.8729, -1.6412, -0.5230, -0.2739, -0.550

## Usage tips & tricks

> Those are mostly anectodal, always perform validation! It __might be__ worth to check those out

- Use layer size of `N/p`. If you think `128` layer size would be good for this problem and you set `p=0.5`, go with `256` neurons instead
- __Use `p=0.5` for internal layers__
- __Use `p=0.2` if Dropout is applied on input__
- __Use with Fully Connected Networks__, that's where this technique is most likely to bring improvements
- __It should not be the first technique you go to__ as others are more popular and usually work better in practice
- __Increase learning rate when using dropout__, momentum to `0.95-0.99` instead of `0.9`
- `L1` regularization should improve sparsity and force the network to keep only the most valuable connections, __might be a good choice__.

## When to use?

- Fully Connected Networks (without batch normalization)
- Between linear layers
- Possibly on input data (as long as it's not an image or text)

## When not to use?

> Possible solutions are outlined in the challenges for you to read!

- When using one of the most popular building blocks for neural networks: **Batch Normalization** (more about that during batch normalization explanation)
- At least not in the same "block", as dropout changes mean and std of activations
- Most neural network architectures use Batch Normalization hence Dropout is not as popular anymore (sometimes for input, sometimes for linear layers at the very end of network)
- Convolutional neural networks (as weights are highly correlated and the effect is miniscule if any)
- Also prediction surface is more smooth and we are "un-smoothing" it using standard Dropout
- Recurrent Neural networks

## Summary

- Dropout is well-known & battle tested regularization technique
- Randomly switching of neurons after activation layer during training
- Leaving all the connections during test phase but scaled
- Works like an ensemble
- __In practice Inverted Dropout is used__ (test phase is fully untouched)
- Should be used with FCNs, rarely other types of layers (or you need a sound rationale for that)
- PyTorch provides `torch.distributions` module for random data generation

## Challenges

- What is `AlphaDropout`?
- What is `SpatialDropout`?
- What is `DropConnect`? 
- What is `ShakeShake` regularization (you can do this one after convolutional neural networks also)