# Introduction to HIP

HIP stands for **Heterogeneous Interface for Portability**. HIP comes as part of the [ROCm](https://www.amd.com/en/graphics/servers-solutions-rocm) bundle of packages. ROCm is AMD's competitor to CUDA, and provides both a programming and computing environment for running applications across multiple cores and across multiple devices. HIP is a software library to accelerate computing on devices that are of the **same type** (i.e GPU's), and from one vendor at a time. 

## GPU's for scientific computing?

Graphics processing units (GPU's) were originally designed to perform the complex calculations for computing the values for pixels in graphical applications such as 3D rendering. As this process is readily parallelisable, the rendering calculations were offloaded to specialised hardware pipelines to be performed in parallel. Eventually this specialised hardware became generalised and programmable, and GPU's became capable of other tasks like performing scientific calculations. Commercial pressure to achieve the best frame rates in games led to GPU designs that incorporate high bandwidth memory and the ability to parallelise calculations over thousands of discrete processing elements. These days GPU's have floating point performance and memory bandwidth that exceeds CPU's by as much as an order of magnitude. Below is a table of the estimated capacity of the compute devices on Setonix.

| Compute device | Peak FP32 processing power (TFLOP/s) |
| :--- | ---: |
| AMD EPYC 7763 | 1.8 |
| AMD Radeon Instinct MI250X | 2 x 23.95 |

Like CUDA or OpenCL, HIP is a software framework that provides a way to harness the compute power of modern GPU's.

## HIP features from a distance

HIP aims to achieve code portability primarily over AMD and CUDA platforms, however there is an effort to also make HIP available for Intel platforms using the [CHIP-SPV](https://github.com/CHIP-SPV/chip-spv) project. HIP does this by having function calls and data types that produce **very similar** behaviour to their equivalents in CUDA. For this reason HIP can serve as a **very thin** layer over CUDA and can utilise either HIP-Clang (AMD) or CUDA (NVIDIA) compute backends. When either backend is in use it enables all the benefits of using a CUDA platform, such as highly-tuned performance and the ability to use CUDA performance measurement and debugging tools. Likewise the HIP-Clang backend allows for the use of AMD hardware natively along with AMD performance and debugging tools. The C++ programming environment for HIP permits compilation of device code and host code where the sources for both are in **a single source file**. 

## Is HIP right for your project?

In supercomputing there is a welcome change in diversity among available hardware options. This means software must adapt to run on compute devices from different vendors. HIP is an ideal choice for when performance and portability is required across both AMD and NVIDIA devices. It provides much of the benefits of using CUDA, including:

* Similar/identical programming environment to CUDA, with the opportunity to inherit best practices and useful tips from CUDA literature and previous experience.
* Can make use of a CUDA backend and all the nice NVIDIA development tools that go with it.
* Can make use of an AMD backend and all the nice AMD development tools that go with it.
* Easy to port code from CUDA to HIP due to similarity
* Easy to write code that works across AMD and NVIDIA backends
* Optimal performance whatever the backend in use.
* Avoids the need to explicitly compile kernels (unlike OpenCL)
* Development on the HIP/CUDA API is led by companies that make hardware (unlike OpenCL).

Some challenges include:

* The number of devices that are officially supported in ROCm is quite low. Many more AMD devices work unofficially with ROCm but it is difficult to get issues addressed on unsupported hardware. 
* ROCm on AMD is still stabilising, and you may encounter bugs, undefined behaviour, and frustration with features that don't work properly.
* Very few Linux distributions or AMD graphics cards are offically supported. The rest have unknown or unreported levels of support or compatibility with HIP.
* HIP functions are at the mercy of CUDA API changes (portability can be fragile). In order to keep portability the HIP library needs to keep up with API changes in CUDA. For example I installed CUDA 12 and it was not compatible with HIP on ROCM 5.4.1. I had to use CUDA 11.8 instead.
* Not all CUDA abilities are replicated in HIP, and not all HIP-Clang abilities are replicated in HIP.
* Only one vendor's backend is supported at a time. Can only use one type of compute device from within the same program. 

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:70%;">
    <img src="../images/hip_clang_cuda.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Feature list of HIP-Clang, CUDA, and HIP itself. Portability means that not every feature is supported across both platforms.</figcaption>
</figure>

Wether or not to maintain portability in HIP is a choice, however if you are willing to target specific platforms using preprocessor directives then you can make use of platform-specific code.

## How does HIP work?

### Kernels in software threads

HIP is a framework to support running lightweight pieces of code, called kernels, **in parallel** over the available cores of a compute device. Below is an example kernel to compute the floating point absolute value of a single element in an array of floating point numbers.

```C
__global__ void vec_fabs(
        // Memory allocations that are on the compute device
        float *src, 
        float *dst,
        // Number of elements in the memory allocations
        size_t length) {

    // Get our position in the array
    size_t offset = blockIdx.x * blockDim.x + threadIdx.x;

    // Get the absolute value of 
    if (gid0 < length) {
        dst[offset] = fabs(src[offset]);
    }
}
```

In order to take the absolute value of every element we need to run this kernel at every point in the array. A **software thread** can be thought of as the execution of a sequence of compute instructions independently from other threads. In that sense a kernel is **run** in a software thread.

### Hardware threads

A **hardware thread** is a pipeline of physical machinery that executes the instructions in a software thread. Compute devices have a number of cores to manage memory and execute software threads. In AMD terminology these cores are called **Compute Units**. Every compute unit makes available to the OS a number of hardware threads for running kernels. 

#### CPU specifics

In CPUs each compute unit makes available to the OS a number of hardware threads - usually 2-4. CPU hardware threads have access to SIMD math units to perform vector math operations in parallel, however this hardware is only accessed through special vector instructions that the compiler conservatively generates *if it deems it is safe to do so*.

#### GPU specifics

GPU's use a SIMT (Single Instruction Multiple Threads) processing model, where instructions are executed on the **Compute Unit**, by a hardware thread, in lock-step over a team of math units. These units are known as **shader cores** (AMD) or **CUDA cores** (NVIDIA) and can be thought of as the physical machinery that executes the (math) instructions in a kernel. Unlike the SIMD units in a CPU, the SIMT units in a GPU each have a measure of freedom in that they have access to their own data and can execute a code path independently of other units. However since all math units in a team are operating in lock-step, the *whole team* must follow every code path. For AMD GPUs the team is usually 64 *lanes* wide and is called a **wavefront**. For NVIDIA GPUs a team is usually 32 lanes wide and is called a **warp**. There are many thousands of math units in a GPU, and this feature, along with greater memory bandwidth, is responsible for the significant performance advantage that GPU's have over CPU's.

The example below a graphical layout of an AMD MI250X GPU processor. Each processor contains two GPU dies; each die contains 8 shader engines; and each shader engine contains ~14 compute units, for a total of 110 Compute Units per die. Every compute unit commands a wavefront of 64 shader cores (lanes), therefore on this processor there are two unique compute devices, each with $110\times64 = 7040$ available shader cores for use in compute applications. 

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="../images/MI250x.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">AMD Instinct<span>&trade;</span> MI250X compute architecture. Image credit: <a href="https://hc34.hotchips.org/")>AMD Instinct<span>&trade;</span> MI200 Series Accelerator and Node Architectures | Hot Chips 34</a></figcaption>
</figure>

### Blocks and Threads as part of the Grid

A HIP implementation is a way to map kernel instances (software threads) to the available hardware threads  on a compute device. The implementation also provides the means to **upload** and **download** memory to and from compute devices. We specify how many kernel instances we want at runtime by defining a 3D execution space called a **Grid**, and specifying its size at kernel launch. After launch every point in the Grid is "visited" by exactly one instance of the kernel. In HIP and CUDA terminology an instance of a kernel is called a **Thread**. In OpenCL it is called a *work-item*.

<figure style="margin-left:auto; margin-right:auto; width:70%;">
    <img style="vertical-align:middle" src="../images/grid.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Three-dimensional Grid with Threads and Blocks.</figcaption>
</figure>

Threads are grouped into **Blocks**, otherwise known as **Workgroups** in AMD and OpenCL terminology. All threads in a block have access to resources such as shared memory. There may be more than 1 wavefront in a block, and it is good practice to make the block size big enough to fit an integer number of threads.

In the example above, the grid is of size (10, 8, 2) and each block is of size (5,4,1). The number of blocks in each dimension is then (2,2,2). Every thread has access to device memory that it can use exclusively (**private** and **local** memory); access to memory that all threads in the block can use (**shared memory**); and access to memory that threads in other blocks can use (**global**, **constant**, and **texture** memory). During execution every kernel can query its location within the **Grid** and use that position as a reference to access allocated memory on the compute device at an appropriately calculated offset.

<figure style="margin-left:auto; margin-right:auto; width:70%;">
    <img style="vertical-
                align:middle" src="../images/mem_access.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Using the location within the Grid to access memory within a memory allocation on a GPU compute device.</figcaption>
</figure>

The above concepts form the core ideas surrounding HIP. Everything that follows in forthcoming modules is supporting information on how to prepare compute devices, manage memory, invoke kernels, and how best to use these concepts together to get the best performance out of your compute devices. 

### Elements of an accelerated application

In every accelerated application there is the concept of a host computer with one or more **compute devices**. The host usually has the *largest memory space available* and the compute device usually has the *most compute power* and memory bandwidth. This is why we say the application is *accelerated* by the compute device/s.

At compilation the kernel code and the host code are separated out. The kernel code gets compiled to an intermediate representation for further compilation at runtime and the host code gets compiled to machine code. 

At runtime, the host executes the application and the HIP runtime either selects a binary kernel or uses Just In Time (JIT) compilation techniques to ready the kernel for the compute device. Therefore the first run of the kernel can *take longer than other runs*. The host program manages memory allocations on the compute device/s and launches  kernels on the compute device. For instances where the compute device is a CPU; the host CPU and the compute device are the same thing.

Accelerated applications follow the same logical progression of steps: 

1. Compute resources discovered
1. Memory allocated on compute device/s
1. Memory is copied from the host to the compute devices
1. Kernels are JIT compiled for first use and then run on the compute device/s. In HIP and CUDA this can take place in the background without intervention from the programmer.
1. The host waits for kernels to finish
1. Memory is copied back from the compute device/s to the host
1. Repeat steps 3 - 8 as many times as necessary
1. Clean up resources and exit

We now discuss the HIP components that make these steps possible.

### Taxonomy of an HIP application

Below is a representation of the core software components that are available to an HIP application. 

<figure style="margin-left:auto; margin-right:auto; width:70%;">
    <img style="vertical-
                align:middle" src="../images/hip_components.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Components of a HIP application.</figcaption>
</figure>

The first is the **Platform**. This is a software representation of a HIP implementation. A platform provides access to all **devices** that the platform supports. In OpenCL more than one platform is accessible, and as part of device discovery the available platforms are queried and devices are discovered. In HIP it appears only one platform is available at any one time.

A **Device** provides a way to query the capabilites of the compute device and provides a foundation to build a **Context**. A context can be thought of as a process that is associated with a compute device and provides a space in which resources (i.e kernels and memory allocations) can be managed on the device. Every device has at least one context, the default or primary context. More than one context can be created for a device, however current best practice relies on using the primary context for each device and selecting which device to use. Therefore in HIP programs a device and its context are **largely synonymous**.

Within the control of the Context are **Buffers**. Buffers are memory allocations that exist on the compute device and are managed by the context. The word `Buffer` is not HIP terminology, but is borrowed from OpenCL.

A **Kernel** is the snippets of code that is run by **Threads** within **Blocks** of the **Grid**. The host uses the macro **hipLaunchKernelGGL** to launch kernels and handle kernel arguments. At compilation, kernels are compiled (either to intermediate source form or to binary) for each architecture. At runtime the appropriate kernel is selected (and possibly compiled) for the device in use.

Once a context has been created and devices are known, then **Streams** can be constructed for each device. A stream is a queue to submit work to, very much like a command queue in OpenCL. There is a default stream for compute devices, and this stream is used when no other stream is available. Multiple streams can facilitate either concurrent IO or compute on the device. 

An **Event** is a way to keep track of how work in streams is progressing, and provides a way to time kernel executions as well as establish dependencies between streams.

In summary we have the following components:

* **Platform**: provides access to devices (transparent to the programmer)
* **Device**: represents a way to access the compute device and to query device capabilities
* **Context**: provides a way to manage resources and track kernel executions on compute devices
* **Buffer**: a memory allocation on a device or the host
* **Stream**: provides a place to send work, such as memory copy commands and kernel executions
* **Kernel**: is code that executes within a software thread over a hardware thread of a compute device
* **Event**: provides a way to keep track of work submitted to a stream

## HIP installation

HIP is bundled with the AMD toolset ROCm, and you can find out how to install ROCm [here](https://docs.amd.com/). I would recommend using one of the supported operating systems to avoid pain with installation. If you would like to use the NVIDIA backend for HIP programs you also need to install CUDA. There might be compatibility issues between HIP and CUDA if CUDA has recently had a major release, so preference an established version of CUDA instead of the bleeding edge one. Here are some other HIP implementations that might be useful.

* HIP CPU runtime [https://github.com/ROCm-Developer-Tools/HIP-CPU](https://github.com/ROCm-Developer-Tools/HIP-CPU). This is a header-only library that allows for HIP application development using a CPU backend.
* CHIP-SPV runtime for running HIP over OpenCL and Intel Level Zero backends [https://github.com/CHIP-SPV/chip-spv](https://github.com/CHIP-SPV/chip-spv).

## Getting help for HIP

### API Documentation

Documentation for HIP is sparse, and at the time of writing there are few books on the subject. The book [**Accelerated Computing with HIP**](https://amzn.asia/d/3OcHFGU) has been a helpful resource, and the AMD documentation portal [docs.amd.com](https://docs.amd.com/) is held up as the best resource for documenation on HIP versions 5.0 and above. Some specific documentation on the HIP API (version 5) is [here](https://docs.amd.com/bundle/HIP_API_5/page/group___a_p_i.html). The legacy site [rocmdocs.amd.com](https://rocmdocs.amd.com/en/latest/) has a wealth of older (and possibly outdated) information for ROCm and HIP.

One of the best sources of documentation on the HIP API that I found so far is the Doxygen-generated documentation within the ROCm distribution itself. It can usually be found at the location `share/doc/hip` within the ROCm installation directory (i.e /opt/rocm).

### Simple examples

Some simple examples for using HIP are located in the `share/hip/samples` directory of a ROCm installation. This is a go to resource if you need to see a quick example of something basic in HIP. You can also find this resource on Github at the following location.

[https://github.com/ROCm-Developer-Tools/HIP/tree/main/samples](https://github.com/ROCm-Developer-Tools/HIP/tree/main/samples)

### Supported API compatibility with CUDA

There is not much concept of a specification roadmap in HIP. It appears to be a process of "catch up" with the CUDA API functionality. The documentation pages in the [HIP repository](https://github.com/ROCm-Developer-Tools/HIPIFY/blob/master/doc/markdown/CUDA_Runtime_API_functions_supported_by_HIP.md) tracks the addition and deprecation of functions in CUDA as well as the state of support in HIP.

### Low level kernel instructions

Documentation for the assembly instructions in HIP kernels can be found here:

* [MI100 Instruction Set](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf)
* [MI2000 Instruction Set](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf)

## Compiling HIP programs

In a similar fashion to CUDA's compiler wrapper `nvcc`, HIP uses a compiler wrapper script `hipcc` to compile programs in two stages. One stage is for host code and another is for kernel code. The kernel code is either built as a binary or is compiled to an intermediate code for further compilation at runtime. All kernel code is included in the application binary. 

All that a C/C++ program needs to do is include the header file **hip/hip_runtime.h**.

```C++
#include <hip/hip_runtime.h>
```

and then compile the source code using the **hipcc** compiler wrapper. For example there is a program called [hello_devices.cpp](hello_devices.cpp) in this folder. The program polls all compute devices on the machine and prints the total amount of memory found on each compute device. Log into the compute resource or interactive job, change directory and compile.


```bash
cd course_material/L1_Introduction
hipcc -I../include hello_devices.cpp -o hello_devices.exe
```

### Switching between AMD and NVIDIA backends

The **HIP_PLATFORM** environment variable determines which backend is used. If you specify

```bash
export HIP_PLATFORM=amd
```

then it will use the AMD backend, but if you specify

```bash
export HIP_PLATFORM=nvidia
```

then it will use the NVIDIA backend. 


> **Note:** 
> When the NVIDIA backend has had a recent major version change, it is advisable to not use the latest CUDA toolkit, as there can be API changes (such as deprecations) that HIP has yet to catch up with.





## Summary

In this section we have covered the fundamentals of HIP, what it is, and how it applies to your workloads in the context of the hardware capabilites of CPU and GPU compute devices. We revised the elements of a hardware accelerated application, and how HIP ensures a kernel is launched in a software thread for every element in a compute space called the **grid**. Then we covered the practicalities of HIP, such as installation and getting help with API. Finally, we covered the basics of compiling HIP software for both AMD and NVIDIA platforms.

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the <a href="https://pawsey.org.au">Pawsey Supercomputing Centre</a>.<br>
</address>