# Tutorial 1: Working with the Lisa cluster

This tutorial explains how to work with the Lisa for the Deep Learning course. It is recommended to have listened to the presentation by the SURFsara team before going through this tutorial.

## First steps

### How to connect to Lisa

You can login to Lisa using a secure shell (SSH): 

```bash
ssh -X lgqu____@lisa.surfsara.nl
```

Replace `lgpu___` by your username. You will be connected to one of its login nodes, and have the view of a standard Linux system in your home directory. Note that you should only use the login node as interface, and not as compute unit. Do not run any trainings on this node, as it will be killed after 15 minutes, and slows down the communication with Lisa for everyone. Instead, Lisa uses a SLURM scheduler to handle computational expensive jobs (see below).

If you want to transfer files between Lisa and your local computer, you can use standard Unix commands such as `scp` or `rsync`, or graphical interfaces such as [FileZilla](https://filezilla-project.org/) (use port 22 in FileZilla). 
A copy operation from Lisa to your local computer with `rsync`, started from your local computer, could look as follows:

```
rsync -av lgpu___@lisa.surfsara.nl:~/source destination
```

Replace `lgpu___` by your username, `source` by the directory/file on Lisa you want to copy on your local machine, and `destination` by the directory/file it should be copied to. Note that `source` is referenced from your home directory on Lisa. If you want to copy a file from your local computer to Lisa, use:

```
rsync -av source lgpu___@lisa.surfsara.nl:~/source
```

Again, replace `source` with the directory/file on your local computer you want to copy to Lisa, and `destination` by the directory/file it should be copied to.

### Modules

Lisa uses modules to provide you various pre-installed software. This includes simple Python, but also the NVIDIA libraries CUDA and cuDNN that are necessary to use GPUs in PyTorch. A standard pack of software we use is the following:

```bash
module load 2019
module load Python/3.7.5-foss-2019b
module load CUDA/10.1.243
module load cuDNN/7.6.5.32-CUDA-10.1.243
module load NCCL/2.5.6-CUDA-10.1.243
```

When working on the login node, it is sufficient to load the `2019` software pack and the `Python/...` module. CUDA and cuDNN is only required when you run a job on a node. 

### Install the environment

## The SLURM scheduler

Lisa relies on a SLURM scheduler to organize the jobs on the cluster. When logging into Lisa, you cannot just start a python script with your training, but instead submit a job to the scheduler. The scheduler will decide when and on which node to run your job, based on the number of nodes available and other jobs submitted. 

### Job files

We provide a template for a job file that you can use on Lisa. Create a file with any name you like, for example `template.job`, and start the job by executing the command `sbatch template.job`.

```bash
#!/bin/bash

#SBATCH --partition=gpu_shared_course
#SBATCH --gres=gpu:1
#SBATCH --job-name=ExampleJob
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --time=04:00:00
#SBATCH --mem=32000M
#SBATCH --output=slurm_output_%A.out

module purge
module load 2019
module load Python/3.7.5-foss-2019b
module load CUDA/10.1.243
module load cuDNN/7.6.5.32-CUDA-10.1.243
module load NCCL/2.5.6-CUDA-10.1.243

# Your job starts in the directory where you call sbatch
cd $HOME/...
# Activate your environment
source activate ...
# Run your code
srun python -u ...
```

#### Job arguments

You might have to change the `#SBATCH` arguments depending on your needs. We describe the arguments below:

* `partition`: The partition of Lisa on which you want to run your job. As a student, you only have access to the partition gpu_shared_course, which provides you nodes with GTX1080Ti GPUs (11GB). 
* `gres`: Generic resources include the GPU which is crucial for deep learning jobs. You can only select one GPU with your account, so no need to change it.
* `job-name`: Name of the job to pop up when you list your jobs with squeue. 
* `ntasks`: Number of tasks to run with the job. In our case, we will always use 1 task.
* `cpus-per-task`: Number of CPUs you request from the nodes. The gpu_shared_course partition restricts you to max. 3 CPUs per job/GPU.
* `time`: Estimated time your job needs to finish. It is no problem if your job finishes earlier than the specified time. However, if your job takes longer, it will be instantaneously killed after the specified time. Still, don't specify unnecessarily long times as this causes your job to be scheduled later (you need to wait longer in the queue if other people also want to use it). A good rule of thumb is to specify ~20% more than what you would expect.
* `mem`: RAM of the node you need. Note that this is *not* the GPU memory, but the random access memory of the node. On gpu_shared_course, you are restricted to 64GB per job/GPU which is more than you need for the assignments.
* `output`: Output file to which the slurm output should be written. The tag "%A" is automatically replaced by the job ID. Note that if you specify the output file to be in a directory that does not exist, no output file will be created.

SLURM allows you to specify many more arguments, but the ones above are the important ones for us. If you are interested in a full list, see [here](https://slurm.schedmd.com/sbatch.html).

#### Scratch 

If you work with a lot of data, or a larger dataset, it is advised to copy your data to the `/scratch` directory of the node. Otherwise, the read/write operation might become a bottleneck of your job. To do this, simply use your copy operation of choice (`cp`, `rsync`, ...), and copy the data to the directory `$TMPDIR`. You should add this command to your job file before calling `srun ...`. Remember to point to this data when you are running your code. In case you also write something on the scratch, you need to copy it back to your home directory before finishing the job. 

### Queueing

### Organizing your jobs

## Troubleshooting

### Lisa is refusing connection

### Slurm output file missing