# Introduction to HIP

HIP stands for Heterogeneous Interface for Portability. It is AMD's equivalent of CUDA, and provides both a runtime and C++ programming environment for running applications across multiple cores and across multiple devices from the **same vendor**. HIP is an open source project and is actively maintained by AMD.

## Why use GPU's?

## HIP features at a glance

HIP aims to achieve portability of code primarily over AMD and CUDA platforms, however there is an effort to also make it available for Intel platforms using the [CHIP-SPV](https://github.com/CHIP-SPV/chip-spv) project. HIP does this by having function calls and data types that produces **very similar** behaviour to their equivalents in CUDA. HIP can utilise either HIP-Clang (AMD) or CUDA (NVIDIA) compute backends. Not every CUDA function has a HIP equivalent, however due to similarity HIP functions can be a **very thin** layer over CUDA when a CUDA backend is in use. This enables all the benefits of using a CUDA platform, such as highly-tuned performance and the ability to use CUDA performance measurement and debugging tools. Likewise the HIP-Clang backend allows for the use of AMD hardware and performance and debugging tools. The C++ programming environment for HIP permits compilation of device code and host code where the sources for both are in **a single source file**. Some benefits of the HIP approach include:

* Easy to port code from CUDA to HIP
* Similar/identical programming environment to CUDA, with the opportunity to inherit best practices from CUDA literature.
* Easy to write code that works across AMD and NVIDIA backends
* Optimal performance on AMD and NVIDIA backends
* Ability to use performance and debugging tools for whatever backend is in use.
* Avoids the need to explicitly compile kernels (unlike OpenCL)
* Development is led by a company that makes hardware (unlike OpenCL).

Some challenges include:

* HIP functions are at the mercy of CUDA API changes (slightly fragile). In order to keep portability the HIP library needs to keep up with API changes in CUDA. For example I installed CUDA 12 and it was not compatible with HIP on ROCM 5.4.1. 
* Not all CUDA abilities are replicated in HIP, and not all HIP-Clang abilities are replicated in HIP. 

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:70%;">
    <img src="../images/hip_clang_cuda.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Feature list of HIP-Clang, CUDA, and HIP itself. Portability means that not every feature is supported across both platforms.</figcaption>
</figure>


## How does HIP work?

In order to explore how a HIP implementation works, 








we need to start thinking about how hardware and software interact. A software thread can be thought of as the execution of a sequence of compute instructions independently from other threads. A hardware thread is a pipeline of physical machinery that executes the instructions. In every compute device there are a number of cores to manage memory and the execution of software threads. In AMD and OpenCL terminology these cores are called **Compute Units**. Every compute unit makes available a number of hardware threads and has access to a number of specialised floating point and integer units for perfoming math operations in parallel. These special units are called **shader cores** for AMD GPU's, **CUDA cores** for NVIDIA GPU's, and SIMD vector units for CPU's. 

GPU's use a SIMT (Single Instruction Multiple Threads) processing model, where instructions are executed by the **Compute Unit** over teams of threads that operate in *lock-step* with each other and *in parallel*. For AMD GPU's the team is 64 threads wide and is called a **wavefront**. For NVIDIA GPU's a team is 32 threads wide and is called a **warp**. The shader cores (CUDA cores) enable the hardware threads to perform math operations in parallel. 

In CPU's each compute unit also makes available a number of hardware threads. These threads are more "independent" than their GPU equivalents and are not constrained to operate in lock-step. Hardware threads have access to SIMD vector units to perform vector math operations, however this hardware is only accessed through special vector instructions that the compiler conservatively generates if it deems it is safe to do so.

In the example below a graphical layout of an AMD MI250X GPU processor. Each processor contains two GPU dies; each die contains 8 shader engines; and each shader engine contains ~14 compute units, for a total of 110 Compute Units per die. Every compute unit commands a wavefront of 64 hardware threads, so in this example there are $2\times110\times64 = 14080$ hardware threads available for use in compute applications. 

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="../images/MI250x.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">AMD Instinct<span>&trade;</span> MI250X compute architecture. Image credit: <a href="https://hc34.hotchips.org/")>AMD Instinct<span>&trade;</span> MI200 Series Accelerator and Node Architectures | Hot Chips 34</a></figcaption>
</figure>

During execution of an HIP program compute units execute a user-specified piece of compiled code called a **kernel**. A kernel is a set of lightweight compute instructions executed within a software thread. Below is an example kernel to compute the absolute value of a single element in an array of floating point numbers.

```C
__global__ void vec_fabs(
        // Memory allocations that are on the compute device
        float *src, 
        float *dst,
        // Number of elements in the memory allocations
        size_t length) {

    // Get our position in the array
    size_t offset = blockIdx.x * blockDim.x + threadIdx.x;

    // Get the absolute value of 
    if (gid0 < length) {
        dst[offset] = fabs(src[offset]);
    }
}
```

We want to run this kernel at every point in the array. A HIP implementation is a way to map kernel instances (software threads) to the available hardware threads on a compute device. The implementation also provides the means to **upload** and **download** memory to and from compute devices. We specify how many kernel instances we want at runtime by defining a 3D execution space called a **Grid** and specifying it's size at kernel launch. Following launch, every point in the Grid is "visited" by exactly one instance of the kernel. In HIP and CUDA terminology an instance of a kernel is called a **Thread**. In OpenCL it is called a *work-item*.

<figure style="margin-left:auto; margin-right:auto; width:70%;">
    <img style="vertical-align:middle" src="../images/grid.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Three-dimensional Grid with Threads and Blocks.</figcaption>
</figure>

Threads are executed in teams called **Blocks**. In the example above, the grid is of size (10, 8, 2) and each block is of size (5,4,1). The number of blocks in each dimension is then (2,2,2). Every thread has access to device memory that it can use exclusively (**registers** and **local memory**); access to memory the team can use (**shared memory**); and access to memory that other teams use (**global**, **constant**, and **texture** memory). Every kernel invocation or work-item can query its location within the **Grid** and use that position as a reference to access allocated memory on the compute device at an appropriately calculated offset.

<figure style="margin-left:auto; margin-right:auto; width:70%;">
    <img style="vertical-
                align:middle" src="../images/mem_access.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Using the location within the Grid to access memory within a memory allocation on a GPU compute device.</figcaption>
</figure>

The above concepts form the core ideas surrounding HIP. Everything that follows the forthcoming modules is supporting information on how to prepare compute devices, manage memory, invoke kernels, and how best to use these concepts together to get the best performance out of your compute devices. 

### Elements of an accelerated application

In every accelerated application there is the concept of a host computer with one or more **compute devices**. The host usually has the largest memory space available and the compute device usually has the most compute power and memory bandwidth. This is why we say the application is "accelerated" by the compute device.

At runtime, the host executes the application. The HIP runtime uses a Just In Time (JIT) strategy to compile kernels for available compute devices on-demand. The host program manages memory allocations on the compute device/s and launches  kernels on the compute device. For instances where the compute device is a CPU the host CPU and the compute device are the same thing.

Accelerated applications follow the same logical progression of steps: 

1. Compute resources discovered
1. Kernels compiled for compute device/s
1. Memory allocated on compute device/s
1. Memory is copied from the host to the compute device/s
1. Kernels run on the compute device/s
1. The host waits for kernels to finish
1. Memory is copied back from the compute device/s to the host
1. Repeat steps 3 - 8 as many times as necessary
1. Clean up resources and exit

We now discuss the HIP components that make these steps possible.

### Taxonomy of an HIP application

Below is a representation of the core software components that are available to an HIP application.

<figure style="margin-left:auto; margin-right:auto; width:50%;">
    <img style="vertical-
                align:middle" src="../images/HIP_components.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Components of an HIP application.</figcaption>
</figure>

The first is the **Platform**. This is a software representation of a vendors implementation. A platform provides access to all **devices** that the platform supports. During device discovery, available platforms are queried first. A platform provides access to one or more compute devices and possibly even a mixture of accelerator devices from the same vendor.

A **Device** provides a way to query the capabilites of the compute device and provides a foundation to build a **Context**.

Surrounding the devices is a **Context**. A Context is like a registry that keeps track of everything (i.e kernel executions and memory allocations) that are happening on the compute device/s. A Context is constructed on using both a platform and one or more devices on the platform. There are some benefits (such as memory copies) that could be obtained by encapsulating one or more devices under the same context, however this assumes that devices must belong to the same platform - an assumption which may not be true. A simpler and more general design is to create a unique context for every compute device.

Within the control of the Context are **Buffers**. Buffers are memory allocations managed under the context, and may exist on either the host or the compute device. At runtime memory is migrated to where is needed, but you can have some control over where a Buffer "lives". 

At runtime, source code for the kernels are collated into **Programs**. This is repeated for every utilised context. In a subsequent step programs are built for every utilised compute device in a context

Once a context has been created and devices are known, then one can create one or more **Command queue/s** for each device. A command queue is a place to submit work, such as kernel invocations and memory copies. 

A **Kernel** is a component of a compiled **Program**. At runtime we set the arguments of compiled kernels and then submit kernels to command queues for execution. We can keep track of the status of a command submitted to the command queue using an **Event**.

In summary we have the following components:

* **Platform**: provides access to devices
* **Device**: represents a way to access the compute device and to query device capabilities
* **Context**: provides a way to create Buffers and keep track of what is happening on compute devices
* **Buffer**: provides a way to allocate memory on devices
* **Program**: provides a way to aggregate kernels for each context and then build those kernels for each compute device in the context
* **Command queue**: provides a place to send work such as memory copy commands and kernel executions
* **Kernel**: provides a way to do work on a compute device
* **Event**: provides a way to keep track of work submitted to a command queue

## Specification Roadmap



<table>
    
<tr>
<th>Specification</th>
<th>Release year</th>
<th>Specifics</th>
</tr>

<tr>
    <td>1.0</td>
    <td>2008</td>
    <td>Initial implementation</td>
</tr>
    
</table>

### Vendor implementations

All of the major vendors have HIP implementations at varying levels of support for the HIP specification. The table below shows the latest known level of support for each version of the specification, along with links to the vendor's HIP developer page.

|Vendor| 1.2 | 2.0 | 2.1 | 2.2 | 3.0 |
| :- | :- | :- | :- | :- | :- |
| [AMD](https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-programming-guide.html) | Y | Y | Y | Some | N |
| [Apple](https://developer.apple.com/HIP) | Y | N | N | N | N |
| [ARM](https://developer.arm.com/solutions/graphics-and-gaming/apis/HIP) | Y | Y | Y | N | Y |
| [Intel](https://www.intel.com/content/www/us/en/developer/tools/HIP-sdk/overview.html) | Y | Y | Y | Some | Y |
| [NVIDIA](https://developer.nvidia.com/HIP) | Y | N | N | N | Y |
| [Portable HIP](http://portablecl.org) | Y | Some | N | N | N |

**[Apple](https://developer.apple.com/HIP)** was the original vendor for HIP and it comes baked into the MacOS operating system. However the company has since moved on to their proprietary framework **Metal** and they haven't invested in HIP beyond specification 1.2. Support for HIP is built in to **[NVIDIA](https://developer.nvidia.com/HIP)'s** CUDA toolkit, though after an initial flurry of development activity up to version 1.2, development stalled until version 3.0. Support for HIP with **[AMD](https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-programming-guide.html)** is part of the **[ROCM](https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-programming-guide.html)** suite. **[Intel](https://www.intel.com/content/www/us/en/developer/tools/HIP-sdk/overview.html)** strongly supports HIP development for CPU's and GPU's with its [oneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=linux&distributions=aptpackagemanager) toolkit. The CPU implementation also works for AMD CPU's, which is really good! **[ARM](https://developer.arm.com/solutions/graphics-and-gaming/apis/HIP)** has solid support for HIP on its Mali GPU's. The open source [POCL (Portable HIP)](http://portablecl.org/) implementation has a CPU implementation as well as support for HIP on CUDA and HIP on MacOS.

#### Conformance

A conformant HIP implementation is an implementation of HIP that has passed Khronos' [test suite](https://github.com/KhronosGroup/HIP-CTS). The number of vendors with conformant implementations is an evolving list, click [here](https://www.khronos.org/conformance/adopters/conformant-products/HIP) to see the latest conformant implementations.

## Getting help for HIP

The best source of help for HIP is [Khronos HIP registry](https://www.khronos.org/registry/HIP/). There you can find excellent documentation on  the latest specification that your vendor supports. As an exercise, download the latest **API specification** in PDF format and have it ready as reference material.

### Exercise: 

Download from the Khronos HIP registry the latest HIP API and C language specifications to your computer.

## Is HIP right for you?

This is sometimes a difficult question to answer. Researchers often have diverse computing needs, in such cases HIP is a good fit as it will provide a solid foundation for your research tools. However if you are looking for the best possible performance and can live with vendor lock-in, then using vendor-specific tools will help with that. 

**Drawbacks to using HIP**

* Have to separately call functions in vendor libraries for device-specific hardware (i.e tensor or matrix cores).
* When vendors have their own accelerator libraries it creates a financial incentive to prioritise development and performance of their libraries over their HIP implementation.
* Buffer allocations are sometimes limited to $1/4$ or more of available device memory (vendor specific)
* Lots of code required to set up the computation, increased potential for error
* Paucity of vendor-supported tools for debugging and profiling

**Benefits of using HIP**

* Straightforward well-defined C API with good documentation
* Ability to use a wide variety of hardware
* Data types to facilitate consistent precision across implementations
* Consistent math across implementations
* Support for vectors of up to 16 elements
* Open standard - the standard is not (explicitly) contingent on the wellbeing of a single vendor
* Mature, production quality HIP implementations

## Compiling HIP programs

Just to avoid confusion there are two compilation steps for HIP applications: 

1. Compiling the application itself before execution
2. Compiling kernels from within an application during execution

During program execution, kernels are combined into programs and the programs are compiled for each compute device using the vendor's kernel compiler. Thankfully, when compiling an HIP application prior to execution (Step 1), we don't need to link against every available implementation. We just need to link against a single library file called the **Installable Client Driver (ICD)** that may be provided by any vendor. At runtime, calls to HIP functions are **routed** to the vendor's implementation for that function.

The ICD has the name (**HIP.dll**) on Windows and (**libHIP.so**) on Linux. Accompanying the ICD are header files (**HIP.h** for C and **cl.hpp** for C++) that must be "included" from the C/C++ source code. The ICD takes care of intercepting all HIP library calls and routing them to the appropriate vendor implementation. The routing process happens transparently to the user. 

## Switching between AMD and NVIDIA backends

The **HIP_PLATFORM** environment variable determines which backend is used. If you specify

```bash
export HIP_PLATFORM=amd
```

then it will use the AMD backend, but if you specify

```bash
export HIP_PLATFORM=nvidia
```

then it will use the NVIDIA backend. 


> **Note:** 
> When the NVIDIA backend has had a recent major version change, it is advisable to not use the latest CUDA toolkit, as there can be API changes (such as deprecations) that HIP has yet to catch up with.





## Exercise: compiling your first HIP application

At the location [hello_devices.cpp](hello_devices.cpp) is a complete HIP application to obtain the size of on-device memory and the maximum Buffer size that is possible within that memory. 

* **Step 1.** From the Jupyter launcher start a Terminal and use cd to navigate to the src/L1_Introduction directory in the course material

```bash
cd src/L1_Introduction
```

* **Step 2.** You need to know where the HIP ICD loader and HIP header files are located. For this particular example the locations are as follows:

| File | Directory |
| :--- | :--- |
| ICD loader (libHIP.so) | /usr/lib/x86_64-linux-gnu |
| HIP C++ headers directory (CL) | /usr/include |


In the Terminal use **ls** to list the contents of these directories and locate the **CL** directory in which the HIP header files are located. 

* **Step 3.** Compile the application source file **hello_devices.cpp** using the **g++** compiler. The compilation command should look like this:

```bash
g++ -g -O2 -I/usr/include -I../include -L/usr/lib/x86_64-linux-gnu hello_devices.cpp\
    -o hello_devices.exe -lHIP
```

On Linux you can add the location of the **CL** directory to your **CPATH** environment variable, and the location of **libHIP.so** to both your **LIBRARY** and **LD_LIBRARY_PATH** environment variables. Then you won't need to explicity tell the compiler where the HIP resources are.

```bash
g++ -g -O2 -I../include hello_devices.cpp -o hello_devices.exe -lHIP
```

* **Step 4.** Now run the application

```bash
./hello_devices.exe
```

You should see at least one device printed with the name and memory sizes. Now that you know how to let the compiler know about HIP you can use the **make** command within that directory to compile the example. 

In [1]:
!make clean; make

rm -r *.exe
g++ -std=c++11 -g -O2 -fopenmp -I/usr/include -I../include -L/usr/lib64 hello_devices.cpp\
	-o hello_devices.exe -lOpenCL -lomp
In file included from [01m[Khello_devices.cpp:2:0[m[K:
[01m[K../include/cl_helper.hpp:[m[K In function ‘[01m[K_cl_command_queue** h_create_command_queues(_cl_device_id**, _cl_context**, cl_uint, cl_uint, cl_bool, cl_bool)[m[K’:
         [01;35m[K)[m[K;
         [01;35m[K^[m[K
In file included from [01m[K/usr/include/CL/opencl.h:24:0[m[K,
                 from [01m[K../include/cl_helper.hpp:15[m[K,
                 from [01m[Khello_devices.cpp:2[m[K:
[01m[K/usr/include/CL/cl.h:1906:1:[m[K [01;36m[Knote: [m[Kdeclared here
 [01;36m[KclCreateCommandQueue[m[K(cl_context                     context,
 [01;36m[K^~~~~~~~~~~~~~~~~~~~[m[K


This application is rather rudimentary, however there is a far more sophisticated HIP query application called **clinfo**. You can use it to query a great deal on information on the available devices. Here we use clinfo to query available platforms and devices.

In [1]:
!clinfo -l

Platform #0: Intel(R) FPGA Emulation Platform for OpenCL(TM)
 `-- Device #0: Intel(R) FPGA Emulation Device
Platform #1: Intel(R) OpenCL
 `-- Device #0: AMD EPYC 7571
Platform #2: AMD Accelerated Parallel Processing


## Resources

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the Pawsey Supercomputing Centre<br>
</address>