# Guide 2: Research projects with PyTorch

![Status](https://img.shields.io/static/v1.svg?label=Status&message=Under%20development&color=red)

* Based on some feedback I got, we will try to summarize tips and tricks on how to setup and structure large research projects in PyTorch, such as your Master Thesis
* Feel free to contribute yourself if you have good ideas

## Setup

### Framework

* Choose the right framework. If you have simple setups like classification, consider going with PyTorch Lightning. If you need to change the default training procedure, go with plain PyTorch and write your own framework
    * Good setup: a `train.py` file which summarizes the default operations every model needs (training loop, loading/saving model, setting up model, etc.), and a `task.py` file which is specific to a certain task. Also allows you to do multi-task learning

## Hyperparameter search 

### Reproducibility

* Everything is about reproducibility. Make sure you can reproduce any training you do with the same random values, batches, etc. You will come to a point where you have tried a lot of different approaches, but none were able to improve upon one of your previous runs. When you try to run the model again with the best hyperparameters, you don't want to have a bad surprise (believe me, enough people have this issue, and it can also happen to you). Hence, before starting any grid search, make sure you are able to reproduce runs. Run two jobs in parallel on Lisa with the same hyperparams, seeds, etc., and if you don't get the exact same results, stop and try to fix it before anything else.
* Another fact about reproducibility is that saving and loading a model works without any problems. Make sure before a long training that you are able to load a saved model from the disk, and achieve the exact same test score as you had during training.
* Print your hyperparameters into the SLURM output file (simple print statement in python). This will help you identifying the runs, and you can easily check whether Lisa executes the job you intended to
* When running a job, copy the job file automatically to your checkpoint folder. Improves repoducibility
* Besides the slurm output file, create a output file in which you store the best training, validation and test score. This helps when you want to compare 

### Seeds

* DL models are noisy. Before running a grid search, try to get a feeling of how noisy your experiments might be. The more noise you expect compared to 
* After finishing the grid search, run another model of the best configuration with a new seed. If the score is still the best, take the model. If not, consider running a few more seeds for the top $k$ models in your grid search. Otherwise you risk taking a suboptimal model, which was just lucky to the best for a specific seed.

### Learning rate

* Adam: 1e-3 is a good place to start if you do classification. If you have very deep models, consider reducing the learning rate
* The lower your batch, the lower the lr should be. Consider using gradient accumulation if your batch size is getting too small (PyTorch Lightning supports this). 

#### LR scheduler

* For classifiers: multi-step LR has shown to be good
* For other models, a smooth decay like exponential decay or cosine scheduler is a good alternative

### Regularization

* Dropout is always a good idea

#### Domain specific regularization

* Computer Vision: image augmentation
* NLP: input dropout of whole words
* Graphs: dropping edges, inputs


### Grid search

#### SLURM 

* SLURM supports you do to grid search. It is recommended to use job arrays instead of creating N job files, or use PyTorch Lightnings build-in support for running grid searches on SLURM schedulers.