# Basic information
CSD3 runs on Rocky Linux, which is based on RedHat CentOS with a very old Linux kernel. CSD3 has multiple generations of Xeon processors such as Cascade Lake, Ice Lake and Sapphire Rapids.

```{note}
24th March 2025 onwards, you will need to update the queue names in your submission scripts as described below.
* icelake -> ukaea-icl
* icelake-himem -> ukaea-icl-himem
* sapphire -> ukaea-spr
* ampere -> ukaea-amp
* sapphire-hbm -> ukaea-spr-hbm
* sapphire-hbm-flat -> ukaea-spr-hbm-flat

```
## Login nodes
The are multiple login nodes for load balancing. There are 4 Icelake generation login nodes from `login-q-1` to `login-q-4` and 4 CascadeLake generation login nodes from `login-p-1` to `login-p-4`.

We can directly ssh into the individual login nodes using the following command:

```bash
ssh username@login-q-1.hpc.cam.ac.uk
```
# Usage credits
We can check the usage credits using the following command:

```bash
[ir-shar8@login-p-1 ~]$ mybalance
User           Usage |        Account  |    Usage   | Account Limit | Available (hours) 
---------- --------- + ----------------+------------+ --------------+ -----------------
ir-shar8           0 | UKAEA-AP002-CPU | 53,185,628 |    60,808,908 | 7,623,280
ir-shar8           0 | UKAEA-AP002-GPU |   155,208  |       183,328 |   28,120
```

Here, I have 0 usage. The usage in the third column is the hours used by all users under this project. Subsequent columns show the account limit for all users and the available hours for all users.

## Partitions
There is `cpu-p`, `cpu-q` and `cpu-r` representing Cascade Lake, Ice Lake, Sapphire Node. The GPU nodes are represented as `gpu` with ampere partition name signifying  the NVIDIA A100 GPUs. Since, we are only interested in machine learning, we will be talking about the `gpu` partition only. All parititon names with `-long` suffix are for infinite timelimit. Once can check all the above information using `sinfo` command. [Here](https://docs.hpc.cam.ac.uk/hpc/user-guide/a100.html) is full docs about the ampere partition.

Here are the specs of the GPU nodes:
* 2x AMD EPYC 7763 64-Core Processor 1.8GHz (128 cores in total)
* 1000 GiB RAM
* 4x NVIDIA A100-SXM-80GB GPUs

All the 4 GPUs can be used with an exclusive QOS.

# Jobs
## Interactive jobs
Here is a basic template for an interactive job:

```{note}
24th March 2025 onwards, the `ampere` partition will be replaced by `ukaea-amp` partition.
```

```bash
srun --account=UKAEA-AP002-GPU --partition=ampere --gres=gpu:1 --nodes=1 --time=02:00:00  --pty bash -i
```
This might give an error,
```sh
[ir-shar8@login-p-1 ~]$ srun --account=UKAEA-AP002-GPU --partition=ampere --gres=gpu:1 --nodes=1 --time=02:00:00 --pty bash -i
srun: job 5096996 queued and waiting for resources
srun: job 5096996 has been allocated resources
'abrt-cli status' timed out
[ir-shar8@gpu-q-8 ~]$ nvidia-smi
Fri Feb 14 00:18:59 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:81:00.0 Off |                    0 |
| N/A   39C    P0             67W /  500W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
```
But if the prompt changes from `login` to `gpu`, then the job is running on the GPU node. Also, as per the output, the GPU drivers are working with the CUDA version 12.4. Something to keep in mind when installing PyTorch or TensorFlow.

## Batch jobs
Here is a basic template for a batch job:

```bash
#!/bin/bash
#SBATCH --account=UKAEA-AP002-GPU
#SBATCH --partition=ampere
#SBATCH --nodes=1
#SBATCH --gres=gpu:1

echo "Hello from CSD3!" > output.txt
hostname >> output.txt
nvidia-smi >> output.txt
```


## Interactive job commands

### GPU
```bash
srun --account=UKAEA-AP002-GPU --partition=ukaea-spr --nodes=1  --time=03:30:00 --pty bash -i
```

### CPU
```bash
srun --account=UKAEA-AP002-CPU --partition=ukaea-icl --nodes=1  --time=03:30:00 --pty bash -i
```
