Example of using fraternal dropout in a case of LSTM Language Model for PTB dataset
The architecture used here is a single layer LSTM with DropConnect (Wan et al. 2013) applied on the RNN hidden to hidden matrix. The same that is used in the ablation studies in the Fraternal Dropout paper. Included below are hyper-parameters to get equivalent results to those in the original paper for a single layer LSTM.
State-of-the-art results on Penn Treebank (PTB) dataset and WikiText-2 (WT2) dataset
If you want to replicate state-of-the-art results from Fraternal Dropout paper on Penn Treebank dataset (PTB) or WikiText-2 dataset (WT2) you have to apply fraternal dropout on the top of AWD-LSTM 3-layer architercure. It is more time-consuming (approximately one day for PTB and three days for WT2). The code with hyper-parameters used in the paper may be found in the other branches (PTB or WT2).
These models do not support all options implemended for a single LSTM. You should simply run
python main.py to start training AWD-LSTM 3-layer model with fraternal dropout. For fine-tuning just run
python finetune.py --save PATH where
PATH is the path to the model that should be fine-tuned (the model will be override, so make a copy if needed). The perplexities from the corresponding branches can be expected to be:
58.0without fine-tuning and
65.3without fine-tuning and
Python 3 and PyTorch 0.2.
How to run the code
The easiest way to train FD model (baseline model enchanced by fraternal dropout with κ=0.15) achiving perplexities of approximately
64.9 (validation / testing) is to run
python main.py --model FD
For the comparison you can try
python main.py --model ELDor
python main.py --model PM
to train expectation-linear dropout model (κ=0.25) or Π-model (κ=0.15), respectively.
If you want to override default κ value just use
--kappa, for example
python main.py --model FD --kappa 0.1
There are a few hyper-parameters you may try, run
python main.py --help
to get the full list of all of them.
python main.py --model FD --same_mask_w
use the same dropout mask for the RNN hidden to hidden matrix in both networks. That gives a little better results i.e.
64.6 (validation / testing).
Using fraternal dropout in other pytorch models
With this example, it should be easy to apply fraternal dropout in any PyTorch model that uses dropout. However, this example incorporates additional options (like using the same dropout mask for a part of neural network or applying expectation-linear dropout model instead of fraternal dropout), and hence simpler example is provided below.
If you are interested in applying fraternal dropout without additional options (which are not important to achieve better results, they are implemented just to have a comparison) just a simple modification of your code should be enough. You will have to find and modify the lines of code that calculate the output and loss. In the simplest, typical case you should find something like that
output = model(data)
loss = criterion(output, targets)
and replace with
output = model(data)
kappa_output = model(data)
loss = 1/2*(criterion(output, targets) + criterion(kappa_output, targets))
loss = loss + kappa * (output - kappa_output).pow(2).mean()
Since by default a new dropout mask is drawn for each forward pass,
kappa_output are calculated for different masks and hence are not the same. You should average target loss for both of them (
loss = 1/2*(criterion(output, targets) + criterion(kappa_output, targets))) and add regularization that makes the variance for different masks smaller (
loss = loss + kappa * (output - kappa_output).pow(2).mean()). The variable
kappa is a κ hyper-parameter.
You may halve the batch size of your model to use the same amount of memory. It may also improve the final performance.