# Tutorial 1: Working with the Lisa cluster

This tutorial explains how to work with the Lisa cluster for the Deep Learning course. It is recommended to have listened to the presentation by the SURFsara team before going through this tutorial.

## First steps

### How to connect to Lisa

You can login to Lisa using a secure shell (SSH): 

```bash
ssh -X lgpu____@lisa.surfsara.nl
```

Replace `lgpu___` by your username. You will be connected to one of its login nodes, and have the view of a standard Linux system in your home directory. Note that you should only use the login node as an interface, and not as compute unit. Do not run any trainings on this node, as it will be killed after 15 minutes, and slows down the communication with Lisa for everyone. Instead, Lisa uses a SLURM scheduler to handle computational expensive jobs (see below).

If you want to transfer files between Lisa and your local computer, you can use standard Unix commands such as `scp` or `rsync`, or graphical interfaces such as [FileZilla](https://filezilla-project.org/) (use port 22 in FileZilla) or [WinSCP](https://winscp.net/eng/index.php) (for Windows PC). 
A copy operation from Lisa to your local computer with `rsync`, started from your local computer, could look as follows:

```
rsync -av lgpu___@lisa.surfsara.nl:~/source destination
```

Replace `lgpu___` by your username, `source` by the directory/file on Lisa you want to copy on your local machine, and `destination` by the directory/file it should be copied to. Note that `source` is referenced from your home directory on Lisa. If you want to copy a file from your local computer to Lisa, use:

```
rsync -av source lgpu___@lisa.surfsara.nl:~/destination
```

Again, replace `source` with the directory/file on your local computer you want to copy to Lisa, and `destination` by the directory/file it should be copied to.

### Modules

Lisa uses modules to provide you various pre-installed software. This includes simple Python, but also the NVIDIA libraries CUDA and cuDNN that are necessary to use GPUs in PyTorch. A standard pack of software we use is the following:

```bash
module load 2019
module load Python/3.7.5-foss-2019b
module load CUDA/10.1.243
module load cuDNN/7.6.5.32-CUDA-10.1.243
module load NCCL/2.5.6-CUDA-10.1.243
```

When working on the login node, it is sufficient to load the `2019` software pack and the `Python/...` module. CUDA and cuDNN is only required when you run a job on a node. 

### Install the environment

To run the Deep Learning assignments and other code like the notebooks on Lisa, you need to install the [provided environment for Lisa](https://github.com/uvadlc/uvadlc_practicals_2020/blob/master/environment_Lisa.yml). Lisa provides an Anaconda module, which you can load via `module load Anaconda3/2018.12` (remember to load the `2019` module beforehand). Install the environment with the following command: 

```bash
conda env create -f environment.yml
```

If you experience issues with the Anaconda module, you can also install Anaconda yourself ([download link](https://docs.anaconda.com/anaconda/install/linux/)) or ask your TA for help.

## The SLURM scheduler

Lisa relies on a SLURM scheduler to organize the jobs on the cluster. When logging into Lisa, you cannot just start a python script with your training, but instead submit a job to the scheduler. The scheduler will decide when and on which node to run your job, based on the number of nodes available and other jobs submitted. 

### Job files

We provide a template for a job file that you can use on Lisa. Create a file with any name you like, for example `template.job`, and start the job by executing the command `sbatch template.job`.

```bash
#!/bin/bash

#SBATCH --partition=gpu_shared_course
#SBATCH --gres=gpu:1
#SBATCH --job-name=ExampleJob
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --time=04:00:00
#SBATCH --mem=32000M
#SBATCH --output=slurm_output_%A.out

module purge
module load 2019
module load Python/3.7.5-foss-2019b
module load CUDA/10.1.243
module load cuDNN/7.6.5.32-CUDA-10.1.243
module load NCCL/2.5.6-CUDA-10.1.243

# Your job starts in the directory where you call sbatch
cd $HOME/...
# Activate your environment
source activate ...
# Run your code
srun python -u ...
```

#### Job arguments

You might have to change the `#SBATCH` arguments depending on your needs. We describe the arguments below:

* `partition`: The partition of Lisa on which you want to run your job. As a student, you only have access to the partition gpu_shared_course, which provides you nodes with NVIDIA GTX1080Ti GPUs (11GB). 
* `gres`: Generic resources include the GPU which is crucial for deep learning jobs. You can only select one GPU with your account, so no need to change it.
* `job-name`: Name of the job to pop up when you list your jobs with squeue (see below). 
* `ntasks`: Number of tasks to run with the job. In our case, we will always use 1 task.
* `cpus-per-task`: Number of CPUs you request from the nodes. The gpu_shared_course partition restricts you to max. 3 CPUs per job/GPU.
* `time`: Estimated time your job needs to finish. It is no problem if your job finishes earlier than the specified time. However, if your job takes longer, it will be instantaneously killed after the specified time. Still, don't specify unnecessarily long times as this causes your job to be scheduled later (you need to wait longer in the queue if other people also want to use the cluster). A good rule of thumb is to specify ~20% more than what you would expect.
* `mem`: RAM of the node you need. Note that this is *not* the GPU memory, but the random access memory of the node. On gpu_shared_course, you are restricted to 64GB per job/GPU which is more than you need for the assignments.
* `output`: Output file to which the slurm output should be written. The tag "%A" is automatically replaced by the job ID. Note that if you specify the output file to be in a directory that does not exist, no output file will be created.

SLURM allows you to specify many more arguments, but the ones above are the important ones for us. If you are interested in a full list, see [here](https://slurm.schedmd.com/sbatch.html).

#### Scratch 

If you work with a lot of data, or a larger dataset, it is advised to copy your data to the `/scratch` directory of the node. Otherwise, the read/write operation might become a bottleneck of your job. To do this, simply use your copy operation of choice (`cp`, `rsync`, ...), and copy the data to the directory `$TMPDIR`. You should add this command to your job file before calling `srun ...`. Remember to point to this data when you are running your code. In case you also write something on the scratch, you need to copy it back to your home directory before finishing the job. 

### Starting and organizing jobs

To start a job, you simply have to run `sbatch jobfile` where you replace `jobfile` by the filename of the job. Note that no specific file postfix like `.job` is necessary for the job (you can use `.txt` or any other you prefer). After your job has been submitted, it will be first placed into a waiting queue. The SLURM scheduler decides when to start your job based on the time of your job, all other jobs currently running or waiting, and available nodes. 

Besides `sbatch`, you can interact with the SLURM scheduler via the following commands:

* `squeue`: Lists all jobs that are currently submitted to Lisa. This can be a lot of jobs as it includes all partitions. You can make it partition-specific using `squeue -p gpu_shared_course`, or only list the jobs of your account: `squeue -u lgpu___` (again, replace `lgpu___` by your username). See the [slurm documentation](https://slurm.schedmd.com/squeue.html) for details.
* `scancel JOBID`: Cancels and stops a job, independent of whether it is running or pending. The job ID can be found using `squeue`, and is printed when submitting the job via `sbatch`.
* `sinfo control show JOBID`: Shows additional information of a specific job, like the estimated start time.

## Troubleshooting

It can happen that you encounter some issues when interacting with Lisa. A short FAQ is provided on the [SURFSara website](https://userinfo.surfsara.nl/systems/lisa/faq), and here we provide a list of common questions/situations we have experienced from past students.

### Lisa is refusing connection

It can occasionally happen that Lisa refuses the connection when you try to ssh into it. If this happens, you can first try to login to different login nodes. Specifically, try the following three login nodes:

```bash
ssh -X lgqu____@login3.lisa.surfsara.nl
ssh -X lgqu____@login4.lisa.surfsara.nl
ssh -X lgqu____@login-gpu.lisa.surfsara.nl
```

If none of those work, the connection issue is likely not on your side. The problem usually resolves after 2-3 hours, and Lisa let's you login after it again. If the problem doesn't resolve after couple of hours, please contact your TA, and eventually the SURFSara helpdesk.

### Slurm output file missing

If a job of yours is running, but no slurm output file is created, check whether the path to the output file specified in your job file actually exists. If the specified file points to a non-existing directory, no output file will be created. Note that this is not an issue by default, but you are running your job "blind" without seeing the stdout or stderr channels. 

### Slurm output file is empty for a long time

The slurm output file can lag behind in showing the outputs of your running job. If your job is running for couple of minutes and you would have expected a few print statements to have happened, try to flush your stdout stream ([how to flush the output in python](https://stackoverflow.com/questions/230751/how-to-flush-output-of-print-function)).

### All my jobs are pending

With your student account, the SLURM scheduler restricts you to run only a single job at a time. However, you can still queue multiple jobs that will be then run in sequence. This is done because with more than 200 students, Lisa could get crowded very fast if we don't guarantee a fair share of resources. If all of your jobs are pending, you can check the reason for pending in the last column of `squeue`. All reasons are listed in the squeue [documentation](https://slurm.schedmd.com/squeue.html) under *JOB REASON CODES*. The following ones are common:

* `Priority`: There are other jobs on Lisa with a higher priority that are also waiting to be run. This means you just have to be patient. 
* `QOSResourceLimit`: The job is requesting more resources than allowed. Check your job file as you are only allowed to have 1 GPU, 3 CPU cores and 64GB RAM.
* `Resources`: All nodes on Lisa are currently busy, yours will be scheduled soon.

You can also see the estimated start time of a job by running `sinfo control show JOBID`. However, note that this is the "worst case" scenario for the current number of submitted jobs, as in if all currently running jobs would need their maximum runtime. At the same time, if more people would submit their job with higher priority, yours can fall back in the queue and get a later start time.