Gadi (NCI) Useful links, commands, and workflows

Useful links

https://my.nci.org.au/mancini/login (to check accounting, project, remaining budget, etc.)
https://opus.nci.org.au/pages/viewpage.action?pageId=90308777 (Gadi's user's manual).
https://opus.nci.org.au/display/Help/Queue+Limits. We can exploit at most 20736 Gadi cores (432 nodes). I will speak with @santiagobadia to see how we can augment this threshold.
NCI help (https://help.nci.org.au/)
NCI terms and conditions. https://nci.org.au/users/nci-terms-and-conditions-access
NEW Intel Sapphire Rapids Nodes (Mar,2023) https://opus.nci.org.au/display/Help/Sapphire+Rapids+Compute+Nodes

Know-how acquired so far

Project's budget

Use the nci_account command to check for the amount of KSUs used, granted, etc.

Disk quota

In order to check if the amount of files and the total space used by the project members are close to the disk quota.

The commands to analyse the files for the /scratch and /gdata filesystem are:

nci-files-report

lquota

For the /home filesystem, the following command should work:

quota -s

Using Julia installation on the cluster

Symply use the following commands

module unload intel-mkl
module load julia

After executing these, you should be able to execute, e.g., the julia --version command succesfully.

[am6349@gadi-login-01 ~]$ julia --version
julia version 1.6.1

Downloading Julia on the cluster

To install Julia in your directory on the cluster, first login to Gadi. Open the terminal and run: ssh abc123@gadi.nci.org.au

Change to the directory where you would like the Julia installation (it is recommended that it is installed in the /home directory) and execute the command: wget https://julialang-s3.julialang.org/bin/linux/x64/1.4/julia-1.4.2-linux-x86_64.tar.gz

This link is obtained from https://julialang.org/downloads/ (look here if a different Julia version is required).

And, finally, untar it: tar xvzf julia-1.4.2-linux-x86_64.tar.gz

Workflow for `Gridap.jl` (serial computations)

In this section, the steps for running a serial Julia process non-interactively using a single Gadi node is described.

Job script

To batch any job to the cluster, whether it be serial or parallel, we must design a shell script (which we will refer to as the job script), detailing the important specifications of our job. This is mainly so the clusters management system can appropriately allocate the required resources. A template for this job_script.sh, is shown below:

#!/bin/bash
#PBS -P abc123
#PBS -q normal 
#PBS -l walltime=00:30:00
#PBS -l ncpus=1
#PBS -l mem=4gb
#PBS -N my_test.jl 
#PBS -l software=Gridap.jl
#PBS -o /scratch/a99/abc123/stdout.txt
#PBS -e /scratch/a99/abc123/stderr.txt 
#PBS -l wd

dir=/scratch/a99/abc123/<PATH_WHERE_YOU_WANT_TO_KEEP_DATA>
cd $dir

<PATH_TO_YOUR_INSTALLATION_OF_JULIA> <PATH_TO_YOUR_JULIA_SCRIPT>

The script is divided into two parts, the header and the body. The header lines are prepended with #PBS and include the specifications for configuring the job request:

#PBS -P abc123 With -P we specify the project ID, abc123 in this example.
#PBS -q normal With -q we specify the queue we would like to enter. The option normal is used here for a regular priority job.
#PBS -l walltime=00:30:00 With this option we specify the time the job will spend on the node. Only the time used will be charged to the project. However, on the other hand, if the job exceeds the time specified here it will be terminated.
#PBS -l ncpus=1 #PBS -l mem=4gb With these options we specify the number of cpu's and memory required. For a serial job, we only require 1 cpu. Each cpu has 4GB of memory, which is ok for most serial jobs. However, if we require more memory, we can specify it as:
```
#PBS -l ncpus=1
#PBS -l mem=8gb
```
Note that, in this case, we will be charged for the use of 2 cpu's (8GB) even if we used less than 4GB of memory. In contrast to the wall- time specification, the resources here are charged based on this header information and not on that we actually use. A guide on how the resources are charged is given here: https://opus.nci.org.au/display/Help/Preparing+for+Gadi#PreparingforGadi-JobCharging-Examples
#PBS -N my_test.jl with -N we give the job a name
#PBS -l software=Gridap.jl we specify the software used here for Gadi to analyse
#PBS -o /scratch/abc123/a99/stdout.txt #PBS -e /scratch/abc123/a99/stderr.txt The results of the program (output messages to screen) are written into batch files. With -o and -e we set the location for the output of our code and any error messages. a99 should be replaced with your user ID.
#PBS -l wd This option sets the working directory to that from which the job was submitted.

On the other hand, the body of the job script is a regular Unix shell script. In the particular example:

dir=/scratch/abc123/a99/<PATH_WHERE_YOU_WANT_TO_KEEP_DATA> cd $dir Here we use standard Unix shell commands to again change the working directory if desired. abc123 and a99 should be replaced with the Project ID and user ID, respectively.
<PATH_TO_YOUR_INSTALLATION_OF_JULIA> <PATH_TO_YOUR_JULIA_SCRIPT> Finally, we execute the Julia script. This line should look something like: /home/565/a99/julia-1.4.2/bin/julia /scratch/abc123/a99/my_test.jl

Job script submission

After writing the job script, we submit it on Gadi.

To login, open the terminal and run: ssh abc123@gadi.nci.org.au

Once logged into the cluster, submit the freshly written job script using the command: qsub job_script.sh

To check on the progress of the job, use the command: qstat

Workflow for `GridapDistributed.jl` (parallel computations)

Work in progress ...

Known issues

OpenMPI 4.x.x

The (current) newest version of MPI (OpenMPI 4.x.x) experiences issues when calling the HCOLL library, which is used for collective communications. This problem is probably getting fixed for OpenMPI 5.0.0+, but until then some workarounds can be useful to avoid problems.

As a first option, one can try to set the running environment to

export HCOLL_ML_DISABLE_SCATTERV=1
export HCOLL_ML_DISABLE_BCAST=1

or as a last resort disable the library completely by running mpiexec with the flag -mca coll_hcoll_enable 0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gadi (NCI) Useful links, commands, and workflows

Useful links

Know-how acquired so far

Using Julia installation on the cluster

Downloading Julia on the cluster

Workflow for `Gridap.jl` (serial computations)

Workflow for `GridapDistributed.jl` (parallel computations)

Known issues

OpenMPI 4.x.x

Clone this wiki locally

Gadi (NCI) Useful links, commands, and workflows

Useful links

Know-how acquired so far

Using Julia installation on the cluster

Downloading Julia on the cluster

Workflow for Gridap.jl (serial computations)

Workflow for GridapDistributed.jl (parallel computations)

Known issues

OpenMPI 4.x.x

Clone this wiki locally

Workflow for `Gridap.jl` (serial computations)

Workflow for `GridapDistributed.jl` (parallel computations)