# Porting CUDA programs to HIP

HIP API calls are designed to closely match their CUDA equivalents. This enables HIP to function as a thin layer over CUDA and allows for reasonably easy porting of CUDA code to HIP code. Often it is just a matter of replacing **cuda -> hip** in the function calls, but not always. The ROCM suite provides two different tools **hipify-perl** and **hipify-clang** to help with the porting process. The tool **hipify-perl** is robust and uses perl to perform an intelligent (meaning it is aware of the CUDA and HIP API differences) search and replace of cuda calls with hip calls, while the **hipify-clang** tool uses the clang preprocessor and is intended to produce a high quality port. The perl-based method is better for quick ports of small codes, while the clang-based method is intended for ports of large codebases. The hipify-clang tool is more **fragile** though and fails easily unless it has access to all the header files used in the compilation of the CUDA code.

## Supported API's

The hipify tools will port a majority of CUDA calls as well as calls to CUDA libraries like **cuBLAS**. Tables in [this Github site](https://github.com/ROCm-Developer-Tools/HIPIFY/blob/amd-staging/docs/supported_apis.md) provides some guidance as to what is supported.

## Documentation on porting

The documentation from AMD at this [site](https://rocm.docs.amd.com/projects/HIP/en/develop/user_guide/hip_porting_guide.html) provides in-depth knowledge about the porting process and preprocessor directives. The guide below is intended to provde complementary information to the official documentation.

## Setup and installation

From [this source](https://sep5.readthedocs.io/en/latest/Programming_Guides/HIP-porting-guide.html) it is recommended to attempt porting on a machine that has access to both CUDA and HIP libraries. This usually means doing the port on a machine with an NVIDIA GPU and access to CUDA. Then one can try porting portions of the code at a time and compare results. For best results during porting you need to have a version of CUDA that is compatible with your installed version of hipify-clang. The table at [this resource](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/hipify-clang.html) provides information on which version of **hipify-clang** is compatible with which version of CUDA.

Once the code is ported and compiles with the HIP compiler, it is important to be aware that HIP functions may try to access ROCm libraries on the backend without prior warning of this dependency. It is then a good idea to make sure you have a complete installation of ROCm, in addition to having a version of ROCm that is API compatible with your code's version of CUDA. 

It is also important to be aware that some HIP libraries like **hipBLAS** are built to use the corresponding library from ROCm by default. If you need to use these libaries with a CUDA backend you might need to recompile those libraries for use with CUDA.

The code below shows what version of hipify-clang that you are using.

In [1]:
!hipify-clang --version

AMD LLVM version 17.0.0git
  Optimized build.


The [HIPIFY Documentation](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/hipify-clang.html) describes compatibility between CUDA and your version of hipify-clang. 

## General porting process

The general porting process proceeds as follows:

1. Compile with CUDA to verify that the program compiles.
1. Run a hipify tool to convert sources.
    * Use the flag `-hip-kernel-execution-syntax` to convert CUDA kernel launch syntax to HIP kernel launch syntax.
1. Adjust the compilation environment to use **hipcc**.
    * This has to be manually ported!
1. Fix compilation errors.
    * It can be the most difficult part of porting!
    * Use environment variable `HIP_PLATFORM=nvidia` or `HIP_PLATFORM=amd` to switch between backends.
    * Preprocessor directives can separate CUDA code from hip-clang code.
        * Use the directive `__HIP_PLATFORM_NVIDIA__` for CUDA-specific code.
        * Use the directive `__HIP_PLATFORM_AMD__` for separate AMD-specific code.
1. Verify code correctness.
    * Test on CUDA and AMD architectures to ensure portability
1. Re-tune optimisations for new architecture, but only once you know everything works!
1. Document the changes made.

The step of running the **hipify** tool is the **easiest part** of the process.  If the code uses simple and well-supported CUDA API calls, then this has the greatest chance of succeeding. If the codebase contains CUDA-specific complexity, or relies on functionality that is no longer supported by recent version of CUDA, then the level of difficulty in porting can increase **dramatically**. Adjusting the compilation environment for **hipcc** often requires knowledge of  build tools, usually this means a working knowledge of **make** or **cmake**.
Familiarity with C++ and what the compiler warnings and errors mean is then crucial to massaging the codebase to accept the new compiler.


## Example setup

In this example we are going to follow the steps above to port a CUDA version of the matrix multiplication code to use HIP. The CUDA version is located in the subdirectory **cuda_mat_mult**, and in the subdirectory **hip_mat_mult** is the corresponding HIP version for reference. On an NVIDIA system (it won't work without CUDA) you can change directory to **cuda_mat_mult** and run `make` to build the software. 

In [2]:
!cd cuda_mat_mult; make clean; make; ./mat_mult.exe

rm -r *.exe
rm: cannot remove '*.exe': No such file or directory
make: *** [Makefile:33: clean] Error 1
nvcc -g -O2 -Xcompiler -fopenmp -x cu mat_mult.cpp -o mat_mult.exe -lcuda -lgomp
make: nvcc: No such file or directory
make: *** [Makefile:29: mat_mult.exe] Error 127
/bin/bash: line 1: ./mat_mult.exe: No such file or directory


If you list the sources in the directory you can see the C++ file `mat_mult.cpp`.

In [3]:
!ls cuda_mat_mult

cuda_helper.hpp  Makefile  mat_helper.hpp  mat_mult.cpp  mat_size.hpp


Ordinarily CUDA source files would need to end in `.cu` otherwise the `nvcc` compiler won't interpret them as CUDA source. However since the `-x cu` flag is in the makefile then `nvcc` treats them as CUDA source. 

Let's now make a temporary copy of this directory for conversion purposes.

In [4]:
!mkdir -p temp_mat_mult; cp -r cuda_mat_mult/* temp_mat_mult/ 

## Porting techniques with hipify tools

### Port a single file

The **hipify-perl** command can port a single file to use the HIP API. We use it to port the file **mat_mult.cpp** in the directory **temp_mat_mult**. The flag `-hip-kernel-execution-syntax` changes kernel launch syntax from the CUDA-style triple Chevron `<<< >>>` method to the ANSI C++ compliant method of **hipLaunchKernelGGL**. The following command dumps the output to the command line, but you can use the `-o` flag to specify an output file.

In [5]:
!cd temp_mat_mult; hipify-perl -hip-kernel-execution-syntax mat_mult.cpp

#include "hip/hip_runtime.h"
/* Code to perform a Matrix multiplication using cuda
Written by Dr Toby M. Potter
*/

// Setup headers
#include <cassert>
#include <cmath>
#include <iostream>

// Bring in the size of the matrices
#include "mat_size.hpp"

// Bring in a library to manage matrices on the CPU
#include "mat_helper.hpp"

// Bring in helper header to manage boilerplate code
#include "cuda_helper.hpp"

// standard matrix multiply kernel 
__global__ void mat_mult (
        float* A, 
        float* B, 
        float* C, 
        size_t N1_A, 
        size_t N0_C,
        size_t N1_C) { 
            
    // A is of size (N0_C, N1_A)
    // B is of size (N1_A, N1_C)
    // C is of size (N0_C, N1_C)   
    
    // i0 and i1 represent the coordinates in Matrix C 
    // We use row-major ordering for the matrices
    
    size_t i0 = blockIdx.y * blockDim.y + threadIdx.y;
    size_t i1 = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Scratch variable
    float temp=0.0f; 

    // G

If we use the `-inplace` flag, **hipify-perl** copies the file [mat_mult.cpp](temp_mat_mult/mat_mult.cpp) first to [mat_mult.cpp.prehip](temp_mat_mult/mat_mult.cpp.prehip) **if that file doesn't already exist**. Then it performs the conversion from [mat_mult.cpp.prehip](temp_mat_mult/mat_mult.cpp.prehip) to [mat_mult.cpp](temp_mat_mult/mat_mult.cpp). You can make changes to `*.prehip` files and run the conversion as many times as you wish.

In [6]:
!cd temp_mat_mult; hipify-perl -inplace -print-stats -hip-kernel-execution-syntax mat_mult.cpp


[HIPIFY] info: file 'mat_mult.cpp' statistics:
  CONVERTED refs count: 15
  TOTAL lines of code: 190
[HIPIFY] info: CONVERTED refs by names:
  cudaDeviceSynchronize => hipDeviceSynchronize: 1
  cudaFree => hipFree: 3
  cudaGetLastError => hipGetLastError: 1
  cudaMalloc => hipMalloc: 3
  cudaMemcpy => hipMemcpy: 3
  cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
  cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 2


Subsequent edits to [mat_mult.cpp.prehip](temp_mat_mult/mat_mult.cpp.prehip) will be propagated across to [mat_mult.cpp](temp_mat_mult/mat_mult.cpp). This allows for an iterative porting process. Use the `--help` flag on hipify commands for more porting options.

### Examine a directory structure for porting potential

We use the scripts **hipexamine-perl.sh** or **hipexamine.sh** to recursively search through a directory and examine the potential for porting a code. Note there is a summary produced for each file, showing what API calls were converted.

In [7]:
!hipexamine-perl.sh cuda_mat_mult -exclude-dirs=".ipynb_checkpoints"

/opt/rocm-6.0.2/bin/hipexamine-perl.sh: line 13: /opt/rocm-6.0.2/bin/../../libexec/hipify/findcode.sh: No such file or directory


If we try the hip-clang version we see that it doesn't handle preprocessor directives very well. The following errors with `_aligned_malloc` are due to it not picking up the windows-specific `#define` clauses.

In [8]:
!cd cuda_mat_mult; hipexamine.sh . -I.

/opt/rocm-6.0.2/bin/hipexamine.sh: line 23: /opt/rocm-6.0.2/bin/../../libexec/hipify/findcode.sh: No such file or directory

[HIPIFY] error: Must specify at least 1 positional argument for source file


### Porting a directory structure inplace

Both the **hipconvertinplace-perl.sh** and **hipconvertinplace.sh** scripts have the ability to convert a code tree inplace. The additional option **-hip-kernel-execution-syntax** replaces CUDA triple Chevron kernel calls with the equivalent call to **hipLaunchKernelGGL** macro.

#### Porting inplace with hipify-perl

In [9]:
!hipconvertinplace-perl.sh temp_mat_mult -hip-kernel-execution-syntax

/opt/rocm-6.0.2/bin/hipconvertinplace-perl.sh: line 18: /opt/rocm-6.0.2/bin/../../libexec/hipify/findcode.sh: No such file or directory


#### Porting inplace with hipify-clang

Here is the same port with **hipify-clang**. I have commented it out because it doesn't produce a port without access to all the necessary include files.

In [10]:
#!hipconvertinplace.sh temp_mat_mult -hip-kernel-execution-syntax

### Building the ported code

If we examine the source tree we see that every source file that has been hipified has been first copied to a file with suffix `*.prehip`. Then the converted code is overwritten in place of the old file.

In [11]:
!ls -l temp_mat_mult

total 60
-rw-r--r-- 1 tpotter unix_users 24629 May  7 16:15 cuda_helper.hpp
-rw-r--r-- 1 tpotter unix_users   738 May  7 16:15 Makefile
-rw-r--r-- 1 tpotter unix_users  4497 May  7 16:15 mat_helper.hpp
-rw-r--r-- 1 tpotter unix_users  5975 May  7 16:15 mat_mult.cpp
-rw-r--r-- 1 tpotter unix_users  5944 May  7 16:15 mat_mult.cpp.prehip
-rw-r--r-- 1 tpotter unix_users   107 May  7 16:15 mat_size.hpp


Try making the ported code with hipcc.

In [12]:
!cd temp_mat_mult; make clean; make CXX="hipcc"

rm -r *.exe
rm: cannot remove '*.exe': No such file or directory
make: *** [Makefile:33: clean] Error 1
hipcc -g -O2 -fopenmp mat_mult.cpp -o mat_mult.exe 
In file included from mat_mult.cpp:18:
[1m./cuda_helper.hpp:31:10: [0m[0;1;31mfatal error: [0m[1m'cuda.h' file not found[0m
#include <cuda.h>
[0;1;32m         ^~~~~~~~
[0m1 error generated when compiling for gfx906.
make: *** [Makefile:29: mat_mult.exe] Error 1


In the original file **cuda_mat_mult/cuda_helper.hpp** we had overloaded the **h_errchk** function to accept errorcodes of both type **CUResult** and **cudaError_t**. Following conversion to HIP the errorcode has been replaced with just **hipError_t**. Therefore we need to manually delete the duplicate **h_errchk** function in **[temp_mat_mult/cuda_helper.hpp.prehip](temp_mat_mult/cuda_helper.hpp.prehip)**. Then rerun the conversion and the make. 

In [13]:
# Run this after modifying cuda_helper.hpp.prehip
!cd temp_mat_mult; hipify-perl -inplace -hip-kernel-execution-syntax cuda_helper.hpp
!cd temp_mat_mult; make CXX="hipcc"; ./mat_mult.exe

hipcc -g -O2 -fopenmp mat_mult.cpp -o mat_mult.exe 
In file included from mat_mult.cpp:18:
[1m./cuda_helper.hpp:54:6: [0m[0;1;31merror: [0m[1mredefinition of 'h_errchk'[0m
void h_errchk(hipError_t errcode, const char* message) {
[0;1;32m     ^
[0m[1m./cuda_helper.hpp:39:6: [0m[0;1;30mnote: [0mprevious definition is here[0m
void h_errchk(hipError_t errcode, const char* message) {
[0;1;32m     ^
[0m1 error generated when compiling for gfx906.
make: *** [Makefile:29: mat_mult.exe] Error 1
/bin/bash: line 1: ./mat_mult.exe: No such file or directory


Now we should have a successful port of the CUDA code to HIP!

## Case studies in porting codes

Apart from the toy example above, it is informative to try porting some real-world applications and see what learnings can be derived from the process. Here are a few dot points that stood out from. 

### Visualising shipwrecks

From [this resource](https://blogs.nvidia.com/blog/2022/11/18/3d-shipwrecks-perth/) researchers at Curtin University are using open source visualisation software to render 3D environments of shipwrecks. Some feedback from their work was:

* Fragility between CUDA and HIP API's. They had to have a version of ROCm that closely matches the CUDA library.
* Some CUDA functions were just wrappers that call other CUDA functions and thus were unable to be ported.
* HIP functions would call ROCm functions on an AMD backend. It was a trial and error process to work out what ROCm libraries were requried.

### FiCoS

From [this github site](https://gitlab.com/andrea-tango/ficos) FiCoS is a simulator for biochemical networks. It uses CUDA to solve Ordinary Differential Equations (ODE's) in parallel over a GPU. As the project is quite small the **hipconvertinplace-perl.sh** script was used to easily port the code. During compilation I encountered numerous errors of the following type:

* Arrays of the data type **hipDoubleComplex** were incompatible with required input arguments of type **hipBlasDoubleComplex** for hibBlas routines when trying to use the NVIDIA backend. This was not an issue with the AMD backend.
* Dynamic parallelism (kernels launching kernels) is a design feature that is employed in this code. Dynamic parallelism is not supported on the AMD backend, which means that the codebase must be refactored in order to use AMD.

### CURC

[CURC](https://github.com/BioinfoSZU/CURC) is a CUDA-based bioinformatics tool to compress and decompress genome information from FASTQ files. 

* The port itself proceeded smoothly, however there were a lot of warnings about texture reference API calls, which are no longer supported in CUDA 12 in favour of texture object API calls.
* Cmake build system was CUDA-specific and required modification to use HIP.

### Miluphcuda

From their [Github page](https://github.com/christophmschaefer/miluphcuda) Miluphcuda is a CUDA-based Smoothed Particle Hydrodynamics code for modelling astrophysical impacts. Porting this source tree was a **complete success** as it did not use any deprecated or CUDA-specific features, and required only minor syntactical changes. One flag that was required during compilation was `-fgpu-rdc`, and that is because some source files relied on constant memory whose symbols were uploaded in another source file. The relocatable device flag enabled compiliation to be okay with the absent symbols. The ported source tree is available [here](https://github.com/drtpotter/miluph-hip).

## Learnings from the porting process

### Code complexity can be an enemy of progress!

The quest for greater performance often comes at the price of greater complexity. Porting efforts in the case studies were often thwarted because I would encounter assembly code or esoteric CUDA features like texture reference calls that are no longer supported by either CUDA or HIP. When developing codes it is important to weigh in tradeoffs between small increases in efficiency versus the developer cost of maintaining this complexity.

### Naming conventions

Vendor diversity is good for business. This means that vendors will change, as will the  functions used to access the accelerator hardware. Therefore incorporating the words `cuda` or `hip`, or even `gpu` in function or class names isn't going to port well to new platforms, and may confuse future porting tools. 

### Hardware differences between GPU's

#### Thread team size

From [Yuhsiang et al (2020)](https://www.researchgate.net/publication/342464640_Preparing_Ginkgo_for_AMD_GPUs_--_A_Testimonial_on_Porting_CUDA_Code_to_HIP) the primary architectural difference between AMD and NVIDIA is AMD's use of 64 threads in a thread team versus 32 threads for NVIDIA devices.

#### Available registers at peak occupancy

We saw from Lesson 7 that the MI250X GPU may have fewer available registers per kernel thread in order to maintain peak occupancy. Combined with difference in compiler and runtime maturity, this may mean that a kernel running on AMD hardware may not achieve the same occupancy as it did on NVIDIA hardware. The tool Omniperf is good at showing the occupancy of your kernels. If you see reduced occupancy in ported code then see some of the tips in Lesson 7 on <a href="../L7_Kernel_Optimisation/Optimisation.ipynb">optimising kernels</a> to try and reduce register pressure.

### Software differences between CUDA and hip-clang

CUDA has the notion of a driver API and a runtime API. HIP fuses the two into one API and then supports a subset of the fused API. If your CUDA code provides functionality for using both API's then you might encounter some redundant code in the overlap. This [table](https://rocm.docs.amd.com/projects/HIP/en/latest/user_guide/faq.html#what-apis-and-features-does-hip-support) has the most up-to-date listing of features that are supported and not supported in HIP. In particular, here are some notable points of difference between CUDA and HIP.

* Launching kernels from kernels (dynamic parallelism) is not supported in HIP. Avoid using this design pattern when building cross-platform code.
* Context management is deprecated in HIP. CUDA code that uses contexts can be migrated to HIP's simpler method of using primary contexts and switching devices from threads.
* Graphics interopability with OpenGL is not yet supported.
* The CUDA API has undergone some major changes such as the removal (since CUDA 12) of the texture reference API in favour of the texture object API.
* Any inline `PTX` assembly instructions for CUDA kernels will need to be ported across to `hsaco` assembly in AMD kernels. Use the preprocessor directives `__HIP_PLATFORM_NVIDIA__` and `__HIP_PLATFORM_AMD__` to enclose vendor specific code.
* CUDA doesn't appear to support math operations on vector types. Use preprocessor directives to index into individual vector components when using the CUDA backend.
* The memory types `cudaMemoryType*` evaluates to different enum values when porting from CUDA to HIP. Hopefully these are not hardcoded into your CUDA code, but it is something to be aware of. 

## Miscellaneous porting tips

### Relocatable device code

From [this source](https://docs.amd.com/projects/HIP/en/latest/user_guide/hip_porting_driver_api.html) The linker option `-fgpu-rdc` allows for kernels to call functions that are compiled for different translation units. This is useful for instances where a kernel might not be aware of things like allocations in constant memory where symbols are uploaded in another file. At the [Pawsey P'Con 23 Hackathon](https://pawsey.org.au/event/pacer-conference-2023-pcon23-registration/) a team found that the use of this flag generated excessively long link times though. 

### Preprocessor directives

During compilation the preprocessor directive `__HIP_PLATFORM_NVIDIA__` is defined when using an NVIDIA backend, and the preprocessor directive `__HIP_PLATFORM_AMD__` is defined when using an AMD backend.

### Switching between backends

The environment variable `HIP_PLATFORM` controls what backend to use when compiling HIP source. Set `HIP_PLATFORM=nvidia` to use the CUDA backend and set `HIP_PLATFORM=amd` or leave it unset to use the AMD backend.

## Alternative approaches to porting

Maintaining a codebase to work across both AMD and NVIDIA backends is a difficult task, especially if you are trying to ensure the code is both performant and portable between platforms. An easier task might be to use abstraction to separate the science from the hardware access. Then you can construct vendor-specific libraries to access the hardware and do the math transformations, and choose between them at compiler time. For an example of this see [this code](https://github.com/pelahi/profile_util) from Pawsey staff member Pascal Elahi. Alternatively, you can use libraries like [kokkos](https://github.com/kokkos/kokkos) or [raja](https://computing.llnl.gov/projects/raja-managing-application-portability-next-generation-platforms) to manage hardware access at the cost of transparency.  

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the <a href="https://pawsey.org.au">Pawsey Supercomputing Research Centre</a> with significant input from the Pawsey team. All trademarks mentioned are the property of their respective owners.
</address>