# Using HIP on Setonix

## Access to Setonix

```bash
ssh -Y username@setonix.pawsey.org.au
```

### SSH config

```text
Host setonix
    Hostname setonix.pawsey.org.au
    IdentityFile <private_key_file>
    User <username>
    ForwardX11 yes
    ForwardAgent yes
    ServerAliveInterval 300
    ServerAliveCountMax 2
    TCPKeepAlive no
```



## Hardware environment

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="../images/MI250x.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">AMD Instinct<span>&trade;</span> MI250X compute architecture. Image credit: <a href="https://hc34.hotchips.org/")>AMD Instinct<span>&trade;</span> MI200 Series Accelerator and Node Architectures | Hot Chips 34</a></figcaption>
</figure>



| Compute device | Theoretical FP32 processing power (TFlop/s) |
| :--- | ---: |
| AMD EPYC 7763 | 1.3 |
| AMD Radeon Instinct MI250 | 45.3 |

| Computer | CPU | Base clock frequency (GHz) | Cores | Hardware threads | L1 Cache (KB) | L2 Cache (KB) | L3 cache (MB) | FP SIMD width (bits) | Tflops (FP32 calculated) |
|:----:|:----:|-----:| -----: | -----: | :----: | :----: | :----: | :----: | :----: |
| Magnus | Intel Xeon 2690 v3 | 2.6 | 12 | 24 | 12x32 | 12x256 | 30 | 256 | 0.25 |
| Setonix |AMD EPYC 7763 | 2.45 | 64 | 128 | 64x32 | 64x512 | 8x32 | 256 | 1.3 |

| Card | Boost clock (GHz)| Compute Units | FP32 Processing Elements | FP64 Processing Elements (equivalent compute capacity) | L1 Cache (KB) | L2 Cache (KB) | device memory (GB) | Peak Tflops (FP32)| Peak Tflops (FP64)|
|:----:|:-----| :----- | :----- | :---- | :---- | :---- | :---- | :---- | :---- |
| NVIDIA Tesla V100 |1.530| 80 | 5120 | 2560 | 80x96 | 6144 | 16 |15.7|7.8|
| NVIDIA Tesla A100 |1.410| 108 | 6912 | 3456 | 108x164 | 40960 | 40 |19.5|9.7|
| AMD Radeon Instinct MI200 |1.7 | 208 | 13312 | 13312 | 208x16 | 16000 | 128 | 45.3 | 45.3 |
| AMD Radeon Instinct MI250x |1.7 | 220 | 14080 | 14080 | 220x16 | 16000 | 128 | 47.9 | 47.9 |


## Job queues

On Setonix the following queues are available for general use:

|Queue| Max time limit| Processing elements (CPU) | Socket| Cores| processing elements per CPU core | Host memory (GB) | Number of GPU's | Memory per GPU (GB) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| work | 24 hours | 256 | 2 | 64 | 2 | 256 | 0 | 0 |
| long | 96 hours | 256 | 2 | 64 | 2 | 256 | 0 | 0 |
| debug | 1 hour | 256 | 2 | 64 | 2 | 256 | 0 | 0 |
| himem | 24 hours | 256 | 2 | 64 | 2 | 1000 | 0 | 0 |
| gpu | 24 hours | 128 | 1 | 64 | 2 | 128 | 4 | 128 |

## Interactive jobs on GPU nodes

```bash
salloc --account ${PAWSEY_PROJECT} --ntasks 1 --mem 4GB --cpus-per-task 1 --time 1:00:00 --partition gpu
```

## The Pawsey software environment

### Compiler and MPI environment

There are three main programming environments available. Each provides C/C++ and Fortran compilers that build software with knowledge of of the MPI libraries available on Setonix. The **PrgEnv-GNU** programming environment uses the GNU compilers, **PrgEnv-aocc** uses the AMD aocc optimising compiler to try and get the best performance from the AMD CPU's on Setonix, and the **PrgEnv-cray** compilers use the compilers from Cray. You can use these commands to find which module to load.

| Programming environment | command to use |
| :--- | :--- |
| AMD | ```module avail PrgEnv-aocc``` |
| Cray | ```module avail PrgEnv-cray``` |
| GNU | ```module avail PrgEnv-gnu``` |

Then the following compiler wrappers are available for use to compile source files:

| Command | Explanation |
| :--- | :--- |
| cc | C compiler |
| CC | C++ compiler |
| ftn | FORTRAN compiler |

In order to use an MPI library that is also aware of the GPU's you also need to load the **craype-accel-amd-gfx90a** module. To see which version to load run this command.

```bash
module avail craype-accel-amd-gfx90a
```

Load the module **craype-accel-amd-gfx90a** then set the environment variable

```bash
export MPICH_GPU_SUPPORT_ENABLED=1
```

### Compiling software with HIP and MPI

According to this [documentation](https://docs.amd.com/bundle/HIP-Programming-Guide-v5.0/page/Transitioning_from_CUDA_to_HIP.html) the AMD compiler wrapper **hipcc** can be use for compiling HIP source files and is the suggested linker for program objects. ROCM tools like **hipcc** and the debugger **rocgdb** are available with the ROCM module. To see which **rocm** module to load, run this command:

```bash
module avail rocm
```

#### Compiling and linking with the **hipcc** compiler wrapper

If using the **hipcc** compiler to compile HIP source files or link code objects you can use these compiler flags to bring in the MPI headers and make sure **hipcc** compiles kernels for the MI250X architecture on Setonix.

| Function | flags |
| :--- | :--- |
| Compile | ```-I${MPICH_DIR}/include --offload-arch=gfx90a``` |
| Link | ```-L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}``` |
| Debug (compile and link) | ```-ggdb``` |
| OpenMP (compile and link)| ```-fopenmp``` |

#### Compiling and linking with the Cray compiler wrappers 

If you are using the Cray compiler wrappers you can use these flags to compile and link HIP code.

| Function | flags |
| :--- | :--- |
| Compile | ```-std=c++11 -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --rocm-path=${ROCM_PATH} --offload-arch=gfx90a -x hip``` |
| Link | ```--rocm-path=${ROCM_PATH} -L${ROCM_PATH}/lib -lamdhip64``` |
| Debug (compile and link) | ```-ggdb``` |
| OpenMP (compile and link)| ```-fopenmp``` |

#### Mixing hipcc and Cray compilation

From this [documentation](https://docs.amd.com/bundle/HIP-Programming-Guide-v5.0/page/Transitioning_from_CUDA_to_HIP.html) it is important that all code links back to the same C++ standard libaries. The command ```hipconfig --cpp_config``` generates extra compile flags that might be useful for including in your build process.

## Batch jobs on GPU nodes

## Exercise: compiling your first HIP application with MPI 

At the location [hello_devices.cpp](hello_devices.cpp) is a complete HIP application to obtain the size of on-device memory and the maximum Buffer size that is possible within that memory. 

* **Step 1.** From the Jupyter launcher start a Terminal and use cd to navigate to the src/L1_Introduction directory in the course material

```bash
cd src/L1_Introduction
```

* **Step 2.** You need to know where the HIP ICD loader and HIP header files are located. For this particular example the locations are as follows:

| File | Directory |
| :--- | :--- |
| ICD loader (libHIP.so) | /usr/lib/x86_64-linux-gnu |
| HIP C++ headers directory (CL) | /usr/include |


In the Terminal use **ls** to list the contents of these directories and locate the **CL** directory in which the HIP header files are located. 

* **Step 3.** Compile the application source file **hello_devices.cpp** using the **g++** compiler. The compilation command should look like this:

```bash
g++ -g -O2 -I/usr/include -I../include -L/usr/lib/x86_64-linux-gnu hello_devices.cpp\
    -o hello_devices.exe -lHIP
```

On Linux you can add the location of the **CL** directory to your **CPATH** environment variable, and the location of **libHIP.so** to both your **LIBRARY** and **LD_LIBRARY_PATH** environment variables. Then you won't need to explicity tell the compiler where the HIP resources are.

```bash
g++ -g -O2 -I../include hello_devices.cpp -o hello_devices.exe -lHIP
```

* **Step 4.** Now run the application

```bash
./hello_devices.exe
```

You should see at least one device printed with the name and memory sizes. Now that you know how to let the compiler know about HIP you can use the **make** command within that directory to compile the example. 

In [1]:
!make clean; make

rm -r *.exe
g++ -std=c++11 -g -O2 -fopenmp -I/usr/include -I../include -L/usr/lib64 hello_devices.cpp\
	-o hello_devices.exe -lOpenCL -lomp
In file included from [01m[Khello_devices.cpp:2:0[m[K:
[01m[K../include/cl_helper.hpp:[m[K In function ‘[01m[K_cl_command_queue** h_create_command_queues(_cl_device_id**, _cl_context**, cl_uint, cl_uint, cl_bool, cl_bool)[m[K’:
         [01;35m[K)[m[K;
         [01;35m[K^[m[K
In file included from [01m[K/usr/include/CL/opencl.h:24:0[m[K,
                 from [01m[K../include/cl_helper.hpp:15[m[K,
                 from [01m[Khello_devices.cpp:2[m[K:
[01m[K/usr/include/CL/cl.h:1906:1:[m[K [01;36m[Knote: [m[Kdeclared here
 [01;36m[KclCreateCommandQueue[m[K(cl_context                     context,
 [01;36m[K^~~~~~~~~~~~~~~~~~~~[m[K


In [2]:
!rocminfo -l

[37mROCk module is loaded[0m
HSA System Attributes    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

HSA Agents               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 6800H with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 6800H with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                   