# HPC intro

## atools

There are many situations in which you want to run an application for (potentially many) different input parameters.  These parameters can be command line options you run your application with, or file names you provide and so on.

Of course, you could submit a job for each of the instances of your problem, but that would result in many jobs.  Moreover, quite some bookkeeping would be required if some instances fail, while others succeed.  You typically don't have a convenient way to get an overview of which instances failed, and hence have to be redone.

Alternatively, you could simply do all these instances looping over all the parameters.  This would result in potentially prohibitively long run times, and, more importantly, you would not be exploiting a supercomputers main feature: executing work in parallel.

atools has been designed to make it easy for you to run many instances of a problem in parallel, and it takes care of the bookkeeping for you as well.  An instance of the problem that you want to compute is called a *work item* in the context of atools.

### Job script

The first step is to make a few modifications to your job script.  By way of example, use a script that simply calculates and displays the product of two numbers that you also used in the [tutorial on jobs](020_jobs.ipynb).

```bash
#!/usr/bin/env -S bash -l
#SBATCH --account=lp_multiscale_physics
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1g
#SBATCH --time=00:05:00

# actual computation, a bit boring
for i in $(seq 1 10)
do
    for j in $(seq 1 10)
    do
        echo $(( $i * $j ))
    done
done
```

In this job script, you do all computations sequentially, but to speed things up, you would like to do them in parallel as independent jobs.  So you can rewrite the job script such that it only does a single multiplication.

```bash
#!/usr/bin/env -S bash -l
#SBATCH --account=lp_multiscale_physics
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1g
#SBATCH --time=00:05:00

# actual computation, a bit boring
echo $(( $i * $j ))
```

The job script has been adapted to compute a single work item.

This is where atools comes in.  You can make a few more modifications to this job script to use it.  The values of `i` and `j` will be read from a Comma Separated Value file (CSV file).

The first line of this file lists the names of the variables, each line after that the values that correspond to the work items.  So for this example, that would look as follows.  

```
i,j
1,1
1,2
1,3
...
10,8
10,9
10,10
```

You don't have to type all that, there is a data file `data.csv` available that you can copy.  You can find it in the `021_artefacts` directory.

As it is, this script would fail since at this point the variables `i` and `j` are not defined.  You have to make sure that atools can do its magic.  For that purpose, you have to make a few more modifications to the job script.

1. Load the `atools` module.
2. Log the start of the work item.
3. Make sure that the variables used in the script are initialized.
4. Log the end of the work item.

```bash
#!/usr/bin/env -S bash -l
#SBATCH --account=lp_multiscale_physics
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1g
#SBATCH --time=00:05:00

# make sure the module system starts from a clean slate and load the atools module
module purge
module load atools

# log the start of the work item
alog  --state start

# initialize the variables
source <(aenv --data data.csv)

# actual computation, a bit boring
echo $(( $i * $j ))

# log the end of the work item
alog  --state end  --exit $?
```

Now your job script is fully adapted to use atools features.  It is available in the `021_artefacts` directory as `jobscript_parallel.slurm`.   Don't forget to change the credit account name to the one you have access to.

### Job submission

You can submit an atools job almost the same way as an ordinary job, except that you need to specify the `--array` option for `sbatch`.  If you know the number of work items, 100 in the `data.csv` file you are using, you can simply use `--array=1-100`.  Otherwise, atools can help you determine it easily.

First, load the atools module.

In [1]:
module load atools

Next, submit the job as follows.

In [14]:
sbatch  --cluster=wice  --array=$(arange --data data.csv)  jobscript_parallel.slurm

Submitted batch job 60666074 on cluster wice


When you check the queue, you will notice that
* when your job is not running yet, the job ID is somewhat unusual, `_[1-100]` is appended to it;
* when your job has started to run, you will see many entries where a single number was appended to the job ID; these are the indivitual work items.

In [15]:
squeue --cluster=wice

CLUSTER: wice
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  60666074_[1-100]     batch jobscrip vsc30032 PD       0:00      1 (Priority)
          60665918 interacti sys/dash vsc30032  R      15:18      1 k28i14


In [16]:
squeue --cluster=wice

CLUSTER: wice
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  60666074_[1-100]     batch jobscrip vsc30032 PD       0:00      1 (Priority)
          60665918 interacti sys/dash vsc30032  R      15:22      1 k28i14


In [27]:
squeue --cluster=wice

CLUSTER: wice
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 60666074_[21-100]     batch jobscrip vsc30032 PD       0:00      1 (Priority)
        60666074_1     batch jobscrip vsc30032  R       0:02      1 m28c20n2
        60666074_2     batch jobscrip vsc30032  R       0:02      1 s28c11n2
        60666074_3     batch jobscrip vsc30032  R       0:02      1 n28c30n1
        60666074_4     batch jobscrip vsc30032  R       0:02      1 n28c30n1
        60666074_5     batch jobscrip vsc30032  R       0:02      1 n28c30n3
        60666074_6     batch jobscrip vsc30032  R       0:02      1 p33c30n1
        60666074_7     batch jobscrip vsc30032  R       0:02      1 p33c30n1
        60666074_8     batch jobscrip vsc30032  R       0:02      1 p33c30n2
        60666074_9     batch jobscrip vsc30032  R       0:02      1 p33c32n2
       60666074_10     batch jobscrip vsc30032  R       0:02      1 p33c32n2
       60666074_11     batch jobscrip vsc30032  R   

In [43]:
squeue --cluster=wice

CLUSTER: wice
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       60666074_81     batch jobscrip vsc30032  R       0:13      1 m28c20n2
       60666074_82     batch jobscrip vsc30032  R       0:13      1 s28c11n2
       60666074_83     batch jobscrip vsc30032  R       0:13      1 n28c30n1
       60666074_84     batch jobscrip vsc30032  R       0:13      1 n28c30n1
       60666074_85     batch jobscrip vsc30032  R       0:13      1 n28c30n3
       60666074_86     batch jobscrip vsc30032  R       0:13      1 n28c30n3
       60666074_87     batch jobscrip vsc30032  R       0:13      1 n28c30n3
       60666074_88     batch jobscrip vsc30032  R       0:13      1 n28c30n3
       60666074_89     batch jobscrip vsc30032  R       0:13      1 n28c30n3
       60666074_90     batch jobscrip vsc30032  R       0:13      1 n28c30n3
       60666074_91     batch jobscrip vsc30032  R       0:13      1 n28c30n3
       60666074_92     batch jobscrip vsc30032  R     

In [44]:
squeue --cluster=wice

CLUSTER: wice
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          60665918 interacti sys/dash vsc30032  R      20:34      1 k28i14


In [39]:
scontrol release 60665928

60665928_58,62,67-68,80,82,87,100: Job has already finished


: 1

When the job finishes, you will notice a lot of files, each containing the output of a single work item.

In [45]:
ls slurm-*

slurm-60666074_100.out	slurm-60666074_40.out  slurm-60666074_71.out
slurm-60666074_10.out	slurm-60666074_41.out  slurm-60666074_72.out
slurm-60666074_11.out	slurm-60666074_42.out  slurm-60666074_73.out
slurm-60666074_12.out	slurm-60666074_43.out  slurm-60666074_74.out
slurm-60666074_13.out	slurm-60666074_44.out  slurm-60666074_75.out
slurm-60666074_14.out	slurm-60666074_45.out  slurm-60666074_76.out
slurm-60666074_15.out	slurm-60666074_46.out  slurm-60666074_77.out
slurm-60666074_16.out	slurm-60666074_47.out  slurm-60666074_78.out
slurm-60666074_17.out	slurm-60666074_48.out  slurm-60666074_79.out
slurm-60666074_18.out	slurm-60666074_49.out  slurm-60666074_7.out
slurm-60666074_19.out	slurm-60666074_4.out   slurm-60666074_80.out
slurm-60666074_1.out	slurm-60666074_50.out  slurm-60666074_81.out
slurm-60666074_20.out	slurm-60666074_51.out  slurm-60666074_82.out
slurm-60666074_21.out	slurm-60666074_52.out  slurm-60666074_83.out
slurm-60666074_22.out	slurm-60666074_53.out  slurm-60666074_84.

In [46]:
cat slurm-*_100.out

SLURM_JOB_ID: 60666074
SLURM_JOB_USER: vsc30032
SLURM_JOB_ACCOUNT: lpt2_sysadmin
SLURM_JOB_NAME: jobscript_parallel.slurm
SLURM_CLUSTER_NAME: wice
SLURM_JOB_PARTITION: batch
SLURM_ARRAY_JOB_ID: 60666074
SLURM_ARRAY_TASK_ID: 100
SLURM_NNODES: 1
SLURM_NODELIST: n28c30n3
SLURM_JOB_CPUS_PER_NODE: 1
Date: Thu Aug 31 11:51:21 CEST 2023
Walltime: 00-00:05:00
100


Indeed, work item 100 would be the multiplication of the values 10 and 10.  It is the last line in `data.csv` and hence your last work item.

Since this is just a tutorial job, you probably would like to remove these files.

In [47]:
rm slurm-*.out

### Log file

You will remember that you enabled logging using the `alog` commands in `jobscript_parallel.slurm`.  This has resulted in a file that contains information about the work items' execution:

  1. its number
  1. when they started,
  1. the name of the compute node the work item was executed on,
  1. when they completed, and
  1. the exit status, if there was a failure.
  
The name of this job script is the name of your job, with `.log` appended, followed by the job ID.  The command below shows you the first 20 lines.

In [48]:
head -20 jobscript_parallel.slurm.log*

==> jobscript_parallel.slurm.log60665928 <==
2 started by s28c11n2 at 2023-08-31 11:37:26
2 completed by s28c11n2 at 2023-08-31 11:37:27
26 started by n28c30n3 at 2023-08-31 11:37:27
26 completed by n28c30n3 at 2023-08-31 11:37:27
1 started by m28c20n2 at 2023-08-31 11:37:27
1 completed by m28c20n2 at 2023-08-31 11:37:27
9 started by n28c30n1 at 2023-08-31 11:37:27
20 started by n28c30n1 at 2023-08-31 11:37:27
9 completed by n28c30n1 at 2023-08-31 11:37:27
20 completed by n28c30n1 at 2023-08-31 11:37:27
40 started by p33c20n1 at 2023-08-31 11:37:27
86 started by p33c20n1 at 2023-08-31 11:37:27
40 completed by p33c20n1 at 2023-08-31 11:37:27
10 started by n28c30n1 at 2023-08-31 11:37:27
86 completed by p33c20n1 at 2023-08-31 11:37:28
10 completed by n28c30n1 at 2023-08-31 11:37:28
37 started by p33c20n1 at 2023-08-31 11:37:28
37 completed by p33c20n1 at 2023-08-31 11:37:28
14 started by n28c30n1 at 2023-08-31 11:37:28
88 started by p33c30n1 at 2023-08-31 11:37:28

==> jobscript_parallel

You could of course eyeball this file to determine whether all work items completed succesfully, but atools has a command to simplify that considerably.

In [49]:
arange  --data data.csv  --log jobscript_parallel.slurm.log*  --summary

Summary:
  items completed: 100
  items failed: 0
  items to do: 0
