# Use GPU on interactive session

# List of partitions
This will show the lost of partitions and GPU names as well with the `NODELIST`
```sh
sinfo -o "%.10P %.5a %.10l %.6D %.6t %.20N %.10G"

 PARTITION AVAIL  TIMELIMIT  NODES  STATE             NODELIST       GRES
  compute*    up 3-00:00:00      1 drain*              scs0123     (null)
  compute*    up 3-00:00:00      2  down*       scs[0022,0050]     (null)
  compute*    up 3-00:00:00     36    mix scs[0007,0009-0010,0     (null)
  compute*    up 3-00:00:00     35  alloc scs[0001-0006,0008,0     (null)
  compute*    up 3-00:00:00     48   idle scs[0049,0051-0062,0     (null)
  compute*    up 3-00:00:00      1   down              scs0100     (null)
developmen    up      30:00      1 drain*              scs0123     (null)
developmen    up      30:00      2  down*       scs[0022,0050]     (null)
developmen    up      30:00     36    mix scs[0007,0009-0010,0     (null)
developmen    up      30:00     35  alloc scs[0001-0006,0008,0     (null)
developmen    up      30:00     48   idle scs[0049,0051-0062,0     (null)
developmen    up      30:00      1   down              scs0100     (null)
       gpu    up 2-00:00:00      1    mix              scs2003 gpu:v100:2
       gpu    up 2-00:00:00      2  alloc       scs[2001-2002] gpu:v100:2
       gpu    up 2-00:00:00      1   idle              scs2004 gpu:v100:2
  accel_ai    up 2-00:00:00      2    mix       scs[2041,2043] gpu:a100:8
  accel_ai    up 2-00:00:00      3   idle  scs[2042,2044-2045] gpu:a100:8
accel_ai_d    up    2:00:00      2    mix       scs[2041,2043] gpu:a100:8
accel_ai_d    up    2:00:00      3   idle  scs[2042,2044-2045] gpu:a100:8
accel_ai_m    up   12:00:00      1   idle              scs2046 gpu:1g.5gb
s_highmem_    up 3-00:00:00      2   idle       scs[0151-0152]     (null)
s_compute_    up 3-00:00:00      1    mix              scs3001     (null)
s_compute_    up 3-00:00:00      1   idle              scs3003     (null)
s_compute_    up    1:00:00      1    mix              scs3001     (null)
s_compute_    up    1:00:00      1   idle              scs3003     (null)
 s_gpu_eng    up 2-00:00:00      1   idle              scs2021 gpu:v100:4
 ```
 In this example I will use
 * PARTITION: accel_ai (because I have access to it)
     * Go here: https://scw.bangor.ac.uk/en/projects/memberships/ to check your memberships
 * Make sure the `STATE` is `idle` or `mix` not `drain` or `down`.
 * Here, `accel_ai` has 8 Nvidia A100 40 GB GPU.
 * We can use any node from `scs[2042,2044-2045]`

# Start an interactive session
 At first use `salloc` to reserve resources. 
 ```sh
salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1

 ```
 Go here: https://scw.bangor.ac.uk/en/projects/memberships/
 
As discussed earlier we will use any node from `accel_ai`. Slurm will assign an `idle` node. I requested for only ` GPU`.

```sh
[s.1915438@sl1 experiment]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1
salloc: Granted job allocation 7133017
salloc: Waiting for resource configuration
salloc: Nodes scs2042 are ready for job
```

Now can see your hardware allocation using 
```sh
[s.1915438@sl1 experiment]$ squeue --user=s.1915438
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           7133017  accel_ai     bash s.191543  R      14:08      1 scs2042
```

# Loading Anaconda
You can see a list of modules using `module avail`. And load anaconda using `module load anaconda/3`. Otherwise just type `module load ana` and use TAB from keyboard to fill remaining characters.

Once anaconda is loaded. 
### Create a new Conda env to install Pytorch otherwise skip this section
Now that `conda` is recognisable, use `conda create --name ml` to create a new conda environment with name `ml` or any name can be used.
### Activate the Conda env
First activate the base Conda env using `source activate`. Then type `conda env list` to see a list of Conda envs. Load the newly created Conda env using `conda activate ml`.
### Install Pytorch
* Go here:https://pytorch.org/get-started/locally/
* Get the command to install a stable Pytorch with latest CUDA.
* On 15th March 2022 the latest stable release is 1.11.0
* Copy the command in the end on the selection table: `conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch`
* Run this in the `ml` Conda env and wait. It takes time to install Pytorch with CUDA 11.3.

# Run a python file
## Create a file
Write any Pytorch script . For example, I created `gpu.py`. It checks the the availibility of CUDA as well as the GPU name.

```python

(ml) [s.1915438@sl1 experiment]$ cat gpu.py 
import torch
print(torch.__version__)
print(f"Is available: {torch.cuda.is_available()}")

try:
    print(f"Current Devices: {torch.cuda.current_device()}")
except :
    print('Current Devices: Torch is not compiled for GPU or No GPU')

print(f"No. of GPUs: {torch.cuda.device_count()}")

try:
    print(f"GPU Name:{torch.cuda.get_device_name(0)}")
except :
    print('GPU Name: No GPU available')
```

## Transfer the Python file to Sunbird
The easiest way is to use `sftp`. Check this out to use FileZilla, an sftp client.

Open filezilla and type sftp://sunbird.swansea.ac.uk into the host box. Enter your username (s.1915438) and password (uni password) in the username/password boxes. And transfer this python script to a specific directory. Next time, you can go to `server` menu and click` reconnect` to login in a hastlefree fashion.

![image.png](attachment:bab0aeae-39a3-4ccd-b2c9-d081d7ab50cd.png)

## Run the python file in the interactive session
In the sunbird ssh session. go to the location where you transferred the `gpu.py` file. In the directory, run this command `srun python gpu.py`
```sh
(ml) [s.1915438@sl1 experiment]$ srun python gpu.py 
1.11.0
Is available: True
Current Devices: 0
No. of GPUs: 1
GPU Name:NVIDIA A100-PCIE-40GB
```

It took me hours to understand and do whatever is written here. Just type `exit`, to free the node.

# Two GPUs
Once you `exit` the interactive session, you purge the `conda` module. Reload the anaconda/3 module and activate the `ml` Conda env.

### Allocate 2 GPUs: 
```sh
[s.1915438@sl1 experiment]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:2
salloc: Granted job allocation 7133023
salloc: Waiting for resource configuration
salloc: Nodes scs2042 are ready for job
```

### Modify the gpu.py file using nano
```python
(ml) [s.1915438@sl1 experiment]$ cat gpu.py 
import torch
print(torch.__version__)
print(f"Is available: {torch.cuda.is_available()}")

try:
    print(f"Current Devices: {torch.cuda.current_device()}")
except :
    print('Current Devices: Torch is not compiled for GPU or No GPU')

print(f"No. of GPUs: {torch.cuda.device_count()}")

try:
    print(f"GPU Name:{torch.cuda.get_device_name(0)}")
except :
    print('GPU Name: No GPU available')
    
try:
    print(f"GPU Name:{torch.cuda.get_device_name(1)}")
except :
    print('GPU Name: No GPU available')

```

### Run the python script for 2 GPUs
```sh
(ml) [s.1915438@sl1 experiment]$ srun python gpu.py
1.11.0
Is available: True
Current Devices: 0
No. of GPUs: 2
GPU Name:NVIDIA A100-PCIE-40GB
GPU Name:NVIDIA A100-PCIE-40GB
```
