# Classifying Digits using HPC
---

In the previous exercises, we introduced the Compute Canada ecosystem, and initialized our Python environment with the Anaconda package manager, and PyTorch deep learning framework. In this document, we'll demonstrate how to use some of the more common job-scheduling commands for running programs within Compute Canada, and provide a practical example by training a deep, 7-layer convolutional neural network (CNN) to classify the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset.

The following commands will be covered:

* [salloc](https://slurm.schedmd.com/salloc.html): obtain a Slurm job allocation. This is useful for obtaining an interactive node
* [sbatch](https://slurm.schedmd.com/sbatch.html): submits a batch script to slurm
* [squeue](https://slurm.schedmd.com/squeue.html): view job information in the slurm queue
* [sacct](https://slurm.schedmd.com/sacct.html): shows information on recently completed / running jobs
* [scancel](https://slurm.schedmd.com/scancel.html): used to cancel jobs you no longer need or want


## About MNIST
---

MNIST is an extremely popular dataset for testing machine learning algorithms. MNIST dataset is a dataset of hand-written numbers between 0-9, that exist as 28 x 28 pixel images. There are 60,000 training examples and 10,000 testing examples. The goal of the dataset is to learn to predict the correct digit class when shown an image. 

An example of the dataset (64 samples) can be seen below.

<img src="images/mnist.png" width="350">


## About the Model
---

We will be training the following network:

<img src="images/network.png" width="400">

Which is configured as follows: 

* 2 convolution layers  (10 filters, kernel size 3x3)
* max-pooling layer (2x)
* 2 convolution layers  (10 filters, kernel size 3x3)
* max-pooling layer (2x)
* fully-connected   (64 neurons)
* fully-connected   (64 neurons)
* fully-connected   (10 neurons)

## More Resources
---

* SHARCNet has an incredible [wiki](https://www.sharcnet.ca/help/index.php/Main_Page) on how to use their resources
* [The Stanford CS231 class](http://cs231n.github.io/) is an excellent resource for learning more about deep learning and convolutional neural networks

# 1. Getting Started
---

__NOTE:__ A reservation was made for several SHARCNet nodes, to ensure all participants would have access to the resources during the workshop period. As such, there are several "reservation=" lines throughout the code that are not normally required to be present. If you are following this guide outside of the workshop hours, be sure to remove these lines or your job submissions may not run.


## 1.1. Task: Clone the Repository
---

Log in to the _graham_ cluster:

```shell
ssh graham.sharcnet.ca
```

Once logged in, download the code repository we'll be using for the workshop. We will try and work from the _scratch_ directory as much as possible. Recall that the scratch directory is __not__ backed up (i.e. if you delete a file, it's gone forever), and files that haven't been used in over 60 days are automatically deleted. 

* If this is your first time logging in to Compute Canada, your scratch directory may not be initialized yet (give it a few hours); instead, work from your _project_ directory.

```shell
cd /scratch/<username>
git clone https://github.com/mveres01/hpc-demo
cd hpc-demo```


## 1.2. Task: Download the Data
---

From the login nodes, download the data we'll use by running the download script. Note that the login nodes are the only ones with access to the internet; even if you were to transfer data from your compute to SHARCNet, you would do so through the login nodes. 

```shell
python download.py
```

This will download the MNIST dataset, pre-process it, and store it in a folder called data/. 

### 1.2.1. A Quick Peek into download.py
---

A quick look inside the file download.py shows the following lines of code:

```python
import os
from torchvision import datasets, transforms

data_dir = 'data'
if not os.path.exists(data_dir):
     os.makedirs(data_dir)

data = datasets.MNIST(data_dir, train=True, download=True)
data = datasets.MNIST(data_dir, train=False, download=True)
```

The PyTorch framework that we installed contains pre-built functions for downloading and standardizing data. In this case, it will save the processed data in a single file called test.pt and training.pt. Notice that although there are 60,000 images in the training set, it only occupies ~46MB of disk space, which is often small enough that we can load everything to memory when we perform our training. There are a variety of other methods that can be used in cases where data won't fit in memory, such as saving each image independently as its own .JPG, and learning from a batch of data at a time. 

# 2. Task: Submitting an Interactive Job
---

Request an interactive node: we'll use a single GPU for 2 hours


```shell
salloc --time=2:0:0 --gres=gpu:1 --mem=6000
```

You should see the following output if successful:

```
salloc: Pending job allocation 6436997
salloc: job 6436997 queued and waiting for resources
salloc: job 6436997 has been allocated resources
salloc: Granted job allocation 6436997
salloc: Waiting for resource configuration
salloc: Nodes gra984 are ready for job
```

The console should have changed from "@gra-login1" to the node that's been allocated to you (e.g. to "@gra984"). Now that the resource is yours, you are able to ssh _into_ the node using another terminal. If you want. For example, if you were granted node 984 as in this example, you would type:

```shell
ssh gra984
```

__NOTE__ that you are _not_ able to ssh into nodes that the scheduler has not assigned you. You can try, but it won't work!

## 2.1. Task: Check GPU Resources
---

To verify that you have been granted access to one GPU, enter the following command in the terminal:

```shell
nvidia-smi
```

SHARCNet uses Nvidia graphics cards; the "smi" part of the above command stands for _system management interface_. You should see the following:

<img src="images/nvidia-smi.PNG" width="600">

You can see that we've been allocated a Tesla P100 that's currently drawing 26W of power, and is using 0 bytes of its 12 GB limit. As you run programs on the GPU, the power and memory usage will steadily increase. _Using interactive nodes is a good way to estimate how much memory your models will require, before you go off and submit non-interactive jobs._ 


## 2.2. Task: Check Job Status
---

Once your request for a resource has been granted, the job scheduler will immediately begin counting down your remaining time. For interactive jobs, this means that even while you're not running a program, your time continuously decreases. 

You can view your jobs status with the "squeue" command:

```shell
squeue
```

To filter the jobs to only show *your* jobs, pass it the "-u" flag and specify your username:

```shell
squeue -u <username>
```

## 2.3. Task: Running a Model
---

Now that we are on an interactive node, we need to initialize our workspace. Activate the Python environment we wish to use.

```shell
source activate pytorch4
```

Run the "main.py" file. This file contains the specification for our CNN and provides a method for iterating over our dataset.

```
python main.py
```

You should see something like the following:

```
Epoch 0 accuracy: 0.0974, took: 45.3128s
Epoch 1 accuracy: 0.1135, took: 44.1006s
Epoch 2 accuracy: 0.1135, took: 44.1928s
Epoch 3 accuracy: 0.1135, took: 44.1635s
Epoch 4 accuracy: 0.5687, took: 44.1864s
Epoch 5 accuracy: 0.8960, took: 44.2826s
```

Note the time it takes to complete a full training epoch -- where the network has seen every sample _EPOCH_ times, and updated its weight based on the error it has made. 


## 2.4. Task: Running a Model with Cuda
---

Although we've been given access to a GPU model, it turns out the code has not been using it. This could have been diagnosed by running the "nvidia-smi" command on a seperate screen to view the usage, while the program was running.

In order to use the GPU, the code also needs to support it. PyTorch does, so converting the code to use the GPU is a matter of a couple of statements that moves the network and data to the GPU, before data gets passes through the network. How to write code to do this is outside the scope of this workshop -- but here we have enabled GPU capabilities by specifying the --use-cuda flag:

```
python main.py --use-cuda
```

Things should move much faster now.

```
Epoch 0 accuracy: 0.0974, took: 6.6521s
Epoch 1 accuracy: 0.1135, took: 6.5516s
Epoch 2 accuracy: 0.1135, took: 6.6110s
Epoch 3 accuracy: 0.1135, took: 6.5932s
Epoch 4 accuracy: 0.5674, took: 6.5807s
Epoch 5 accuracy: 0.8966, took: 6.6388s
Epoch 6 accuracy: 0.9142, took: 6.5820s
Epoch 7 accuracy: 0.9266, took: 6.6599s
Epoch 8 accuracy: 0.9533, took: 6.5851s
Epoch 9 accuracy: 0.9613, took: 6.6390s
```

## 2.5. Section Takeaway

Running an interactive node is similar to how you would run a model on your local workstation, wth the main difference being that you must make a request for resources, and then operate within the provided time and memory constraints. 

It is also important to understand that with deep learning, the time taken to complete a task is some function of:

* The depth of the network
* The hardware you are using
* The amount of data the network sees

# 3. Task: Submitting a Batch Job
---

Interactive jobs work great for when we want to interact with code, but there are a lot of time where we just want models to run. Within deep learning, a common situation where this is encountered is in trying to tune _hyperparameters_ of your model. Here, you want to quickly test out different combinations of parameters, and report the configuration that achieves the best result. 

When we don't need to interact with code, we can submit a request for a non-interactive job to the scheduler. Non-interactive jobs differ from interactive jobs in the following ways (note: non-exhaustive):

1. To request an interactive job, we used the command "salloc", and specified all constraints on a single line. Non-interactive "batch" jobs are achieved using the command "sbatch" in conjunction with a configuration file. 
2. Once an interactive job has been assigned, the amount of time you have left on a resource immediately begins to decrease. This means that if you've been granted access to a node but are not running code, then your time will still decrease. 
3. Output was written directly to the screen. In the non-interactive version, output from each job is instead written to a file. 

Non-interactive jobs are usually achieved by using the command "sbatch" in conjunction with a configuration file. This file is used to outline our job (and program) constraints. For example, the submit_job.sh script in the project folder contains the following lines:

```shell
#!/bin/bash
#SBATCH --gres=gpu:1                  # Number of GPUs (per node)
#SBATCH --mem=4000M                   # memory (per node)
#SBATCH --time=0-00:10                # time (DD-HH:MM)
#SBATCH --output=slurm-%j.out         # output filename pattern; j == jobid
source activate pytorch4
python main.py --no-progress
```

If the code or program that you have developed accepts command-line arguments (e.g. the --use-cuda flag from the previous section), they can get passed in by specifying them on the line beginning with "python". To submit a job using this configuration, in the terminal type the following:

```shell
sbatch submit_job.sh
```

You should receive confirmation that your job was submitted. You can check its status using the "squeue -u <username>" command, and after it begins to run, you should see a file called slurm-xxxxxx.out that appears in the current workspace folder. As the program executes, it will periodically write to this file; to get a quick glimpse of its contents, enter the following on the command line:

```shell
cat slurm-*.out
```

## 3.1. Task: Cancelling a Job
---

The above file was run without CUDA being enabled, so it is occupying a GPU resource, without actually running on the GPU. Let's cancel the job. In the terminal, type: 

```shell
squeue -u <username>
```

to find the job id. Once you have it, type:

```shell
cancel <jobid>
```

If successful, when you type the squeue command again, the job should no longer be in the scheduler. 

## 3.2. Task: Submit a Batch Job with CUDA flag

Modify the batch file to submit the job with the '--use-cuda' flag. Sorry - No help this time :)


## 3.3. Email Notifications on Job Updates
---

If you want notifications for when your job starts, you can specify your email address in the configuration file before running the sbatch command. This is the preferred method to constantly spamming "squeue" to see when a job starts, as it puts less stress on the job scheduler. To achieve this, the code below adds two lines that enables an email to be sent to &lt;email address&gt;

```shell
#!/bin/bash
#SBATCH --gres=gpu:1                  # Number of GPUs (per node)
#SBATCH --mem=4000M                   # memory (per node)
#SBATCH --time=0-00:30                # time (DD-HH:MM)
#SBATCH --output=slurm-%j.out         # output filename pattern; j == jobid
#SBATCH --mail-user=<email address>
#SBATCH --mail-type=ALL
python main.py
```

See [here](https://docs.computecanada.ca/wiki/Running_jobs#Monitoring_jobs) for more details. 

<img src="images/slurm_email.PNG" width="900">


# 4. Task: Transferring Data
---

Running the "main.py" file said our accuracy was fairly high -- often over 90%. But what does some of the predictions look like? 

The code has been saving a small snapshot of predictions on every epoch to the _images/_ folder. Let's retrieve some of the data

## 4.1 Windows
---

Start the program _psftp_. On the command line, type:

```shell
open graham.computecanada.ca
```

and enter your username and password. Next, move to the project directory. If you were able to work in the _scratch/_ folder, type the following:

```shell
cd /scratch/<username>hpc-demo
```

Next, retrieve the images folder. This is done using the "mget" command, and specifying the recursive "-r" flag as follows:

```shell
mget -r images
```

This will download the images to wherever your current working directory for the psftp program is. You can see the contents of your local directory using the command "!dir", and see the contents of the remote directory using the command "dir". 

## 4.2 Linux
---

There are a lot of flavours to retrieving files with Linux. 

### 4.2.1. Secure Copy Protocol
---

One of the easiest is to use the secure-copy protocol ("scp"). Open a terminal and type:

```shell
scp -r <username>@graham.sharcnet.ca:/scratch/<username>/hpc-images .
```

Lets break it down:

* scp -- protocol
* -r  -- recursive copy
* <username>@graham.sharcnet.ca -- path to the remote node
* :/scratch/&lt;username&gt/hpc-images -- destination of file on the remote node
* .  -- copy the files to the current directory

### 4.2.2. Rsync

Another common protocol for copying files is _rsync_. Try the following:

rsync -e ssh &lt;username&gt;@graham.sharcnet.ca:/scratch/&lt;username&gt;hpc-images .



# 5. Extra Task: Optimizing Network Performance

Try playing with the network a little bit. While you have seen that it can accept --use-cuda and --no-progress flaags, it also accepts a range of other ones, including:

--seed, --epochs, --lr, --momentum, --batch, and --optimizer.

These (minus "seed") are known as model _hyperparameters_. Changing them will change the performance of your trained model. Try either starting an interactive job and changing these values (e.g. python main.py --lr=0.001), or  by adding them to the configuration file and submitting a batch job. How accurate can you get the network to be?

# TODO: A More Complicated Job Submission
---

# Miscellaneous Tips
---

## Create Checkpoints for your Models While Training
---

* In some cases, your model may not be able to finish training within the allotted time. It is usually always a good idea to create _checkpoints_ of your training progress, which are snapshots of the model at certain periods of time. Later, you can use these to quickly restore the state of your model when you want to start training again, or want to switch to performing inference.  


## Estimating Resources for Running a Job
---

Each job run on SHARCNet requires (at a minimum) a specification for memory usage and time usage. 

* Estimating too many resources: the scheduler may have difficulty finding resources to run
* Over-estimating limits: the scheduler may have difficulty finding a spot to run your code
* Under-estimating limits: your code may crash if it attempts to access resources it doesn't have permission to

## Connecting with Special Flags
---

Interaction with SHARCNet is done through the Linux console. The console is a medium that allows a user to interact with the linux kernel, while simultaneously allowing the kernel to interact with the user through text-based information. This is quite a different environment then one may be used to on OS such as Windows. There are a number of [flags](https://www.freebsd.org/cgi/man.cgi?query=ssh&sektion=1) that can be entered alongside the SSH 
command, that will control how the connection is established. A common flag to include is the "-Y" flag, which performs X11 forwarding that allows programs requiring a graphical display to be run -- for example, text editors such as gVim or emacs. On linux, the command to launch with this functionality is:

```ssh -Y graham.sharcnet.ca```

To test if the X-window connection was setup properly, try opening the gVim editor by typing 'gvim' on the console. For Windows users, there is an option in Putty that needs to be checked, under: Connection > SSH > X11 > Enable X11 Forwarding. Additionally, a Windowing application such as Xming that will host the windows

<p float="left">
<img src="images/putty-x11.PNG" width="300"/> <img src="images/putty-x11-xlaunch.PNG" width="300"/> 
</p>

## Interactive Nodes are Useful for Debugging
---

Make sure your code is free of bugs. When you submit code to the scheduler, the scheduler finds a suitable time and place for your code to run. When your program executes and is found to have a bug, your remaining time for that particular resource will be forfeit, causing you to submit a new job and wait for an available resource again. 

## Screen
---

When you are working in a terminal, you are normally tethered to a single session -- once you close the terminal window, your session usually ends. There are, however, several applications that will allow you to maintain your session even when the window is closed, such as ```tmux``` and ```screen```
