# Misconception - Dropout kills/rmoves an entire Neuraon

It is NOT true. Every neuron is active, just temporarily/randomly disabled at each forward and back-propagation training step. 

* [If dropout is going to remove neurons, why are those neurons built?](https://stats.stackexchange.com/a/590808/105137)

> The neurons are only dropped **temporarily during training**. They are not dropped from the network altogether. It is just that it turns out that we get better weights if we randomly set them to zero, temporarily, so the other neurons "think" they cannot "rely" on the other neurons and have to "perform well themselves". The neural network that you get out **at the end contains all the neurons**.

>the neurons that are dropped out are **randomly selected each time the weights are updated**. So while on each iteration only some of the neurons are used and updated, **over the entire training cycle all the neurons are trained**. According to Jason Brownlee's A Gentle Introduction to Dropout for Regularizing Deep Neural Networks, dropout can be thought of as training an ensemble of models in parallel.

As in the PyTorch documentation, it is **NOT the entire neuron** that is zeroed out, but the **random sampled elements in each channel** (D features e.g. **single Token Embedding vector** in Transformer).

* [PyTorch Dropout](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html)

> The zeroed elements are chosen independently for **each forward call and are sampled** from a Bernoulli distribution.
> Each **CHANNEL** will be zeroed out independently on every forward call.

## Evidence from the implementaton

As below, given ```X.shape: (N, D)``` and ```W.shape: (M, D)```, then ```H:shape = (N, M)```. The elements to be zeroed out are randomply sampled from ```(N, M)``` matrix. Therefore, a entire neuron of shape ```(M,)``` in ```H``` will NOT be entirely zeroed-out (removed). 

For instance, ```M``` is the dimension (num features) of a token embedding vector in Transformer. Only some of ```M``` features of a token embedding vector will be zeroed out. Hence the entire token will NOT be killed (zeroed out). Exception is when ```M==1``` such as a pixel in a MNIST digit image.

<img src="../image/cs231n_dropout_summary.png" align="left"/>




The diagram below is **misleading or entirely incorrect depending on architectures** by giving the impression that neurons get removed from the network.

<img src="../image/incorrect_dropout_concept.png" align="left" width=250/>


* [Why Transformer applies Dropout after Positional Encoding?](https://datascience.stackexchange.com/a/128330/68313)

> Normal dropout does not remove whole tokens, but individual values within the vectors. Therefore, dropout does not remove 10% of the tokens in a sequence, but 10% of the values.