# Porting CUDA programs to HIP

HIP API calls are designed to closely match their CUDA equivalents. This enables HIP to function as a thin layer over CUDA and allows for reasonably easy porting of CUDA code to HIP code. Often it is just a matter of replacing **cuda -> hip** in the function calls. The ROCM library provides two different tools **hipify-perl** and **hipify-clang** to help with the porting process. The tool **hipify-perl** is robust and uses perl to perform an intelligent search and replace of cuda calls with hip calls, while the **hipify-clang** tool uses the clang preprocessor to produce a high quality port. The perl-based method is better for quick ports of small codes, while the clang-based method is intended for ports of large codebases. The hipify-clang tool is much more picky though and fails easily unless it has access to all the header files used in the compilation of the CUDA code.

## Setup and installation

From [this source](https://sep5.readthedocs.io/en/latest/Programming_Guides/HIP-porting-guide.html) it is recommended to attempt porting on a machine that has access to both CUDA and HIP libraries. This usually means doing the port on a machine with an NVIDIA GPU. Then one can try porting portions of the code at a time and compare results. For best results with hipify-clang you need to have a version of CUDA that is compatible with your installed version of hipify-clang. 

In [1]:
!hipify-clang --version

AMD LLVM version 16.0.0git
  Optimized build.


Here is a page which describes compatibility between CUDA and hipify-clang.

[HIPIFY Documentation](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/hipify-clang.html)

## Trial setup

There are two sub-directories in this module:

* cuda_mat_mult
* hip_mat_mult

In the directory **cuda_mat_mult** is a CUDA version of the HIP matrix multiplication code in **hip_mat_mult**. It was manually ported from HIP to CUDA. We are going to use the HIP tools to try and port back the CUDA code to HIP code. 

## Porting techniques

## Examine the code for porting potential

We use the scripts **hipexamine-perl.sh** or **hipexamine.sh** to recursively search through a directory and examine the potential for porting a code. Note the summary that is produced for each file.

In [18]:
!hipexamine-perl.sh cuda_mat_mult -exclude-dirs=cuda_mat_mult/.ipynb_checkpoints


[HIPIFY] info: file 'cuda_mat_mult/cuda_helper.cu' statistics:
  CONVERTED refs count: 55
  TOTAL lines of code: 789
[HIPIFY] info: CONVERTED refs by names:
  CUDA_SUCCESS => hipSuccess: 4
  CUresult => hipError_t: 4
  cuGetErrorString => hipDrvGetErrorString: 1
  cuInit => hipInit: 1
  cuda.h => hip/hip_runtime.h: 2
  cudaDevAttrManagedMemory => hipDeviceAttributeManagedMemory: 1
  cudaDeviceGetAttribute => hipDeviceGetAttribute: 1
  cudaDeviceProp => hipDeviceProp_t: 2
  cudaDeviceReset => hipDeviceReset: 1
  cudaDeviceSynchronize => hipDeviceSynchronize: 1
  cudaError_t => hipError_t: 4
  cudaEventCreate => hipEventCreate: 2
  cudaEventDestroy => hipEventDestroy: 2
  cudaEventElapsedTime => hipEventElapsedTime: 1
  cudaEventRecord => hipEventRecord: 3
  cudaEventSynchronize => hipEventSynchronize: 2
  cudaEvent_t => hipEvent_t: 3
  cudaGetDevice => hipGetDevice: 1
  cudaGetDeviceCount => hipGetDeviceCount: 2
  cudaGetDeviceProperties => hipGetDeviceProperties: 2
  cudaGetErrorStrin

If we try the hip-clang version we see that it doesn't handle preprocessor directives very well. The following errors with `_aligned_malloc` are due to it not picking up the windows-specific `#define` clauses.

In [3]:
!hipexamine.sh ./cuda_mat_mult 

error: unsupported architecture 'nvptx64' for host compilation
[1m/tmp/cuda_helper.cu-1a3a21.hip:95:5: [0m[0;1;31merror: [0m[1munknown type name 'SYSTEM_INFO'[0m
    SYSTEM_INFO sys_info;
[0;1;32m    ^
[0m[1m/tmp/cuda_helper.cu-1a3a21.hip:381:20: [0m[0;1;31merror: [0m[1muse of undeclared identifier '_aligned_malloc'; did you mean 'aligned_alloc'?[0m
    void* buffer = _aligned_malloc(nbytes, alignment);
[0;1;32m                   ^~~~~~~~~~~~~~~
[0m[0;32m                   aligned_alloc
[0m[1m/usr/include/stdlib.h:592:14: [0m[0;1;30mnote: [0m'aligned_alloc' declared here[0m
extern void *aligned_alloc (size_t __alignment, size_t __size)
[0;1;32m             ^
[0m[1m/tmp/cuda_helper.cu-1a3a21.hip:383:11: [0m[0;1;31merror: [0m[1mredefinition of 'buffer'[0m
    void* buffer = aligned_alloc(alignment, nbytes);
[0;1;32m          ^
[0m[1m/tmp/cuda_helper.cu-1a3a21.hip:381:11: [0m[0;1;30mnote: [0mprevious definition is here[0m
    void* buffer = _aligned_m

### Porting inplace

Both the **hipconvertinplace-perl.sh** and **hipconvertinplace.sh** scripts have the ability to convert a code tree inplace. The additional option **-hip-kernel-execution-syntax** replaces CUDA triple chevron kernel calls with the equivalent call to **hipLaunchKernelGGL** macro.

#### Porting inplace with hipify-perl

In [21]:
!rm -rf temp_mat_mult; cp -r cuda_mat_mult temp_mat_mult 
!hipconvertinplace-perl.sh temp_mat_mult -exclude-dirs=temp_mat_mult/.ipynb_checkpoints -hip-kernel-execution-syntax


[HIPIFY] info: file 'temp_mat_mult/mat_mult.cu' statistics:
  CONVERTED refs count: 16
  TOTAL lines of code: 193
[HIPIFY] info: CONVERTED refs by names:
  cudaDeviceSynchronize => hipDeviceSynchronize: 1
  cudaFree => hipFree: 3
  cudaGetLastError => hipGetLastError: 1
  cudaLaunchKernel => hipLaunchKernel: 1
  cudaMalloc => hipMalloc: 3
  cudaMemcpy => hipMemcpy: 3
  cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
  cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 2

[HIPIFY] info: file 'temp_mat_mult/cuda_helper.cu' statistics:
  CONVERTED refs count: 56
  TOTAL lines of code: 789
[HIPIFY] info: CONVERTED refs by names:
  CUDA_SUCCESS => hipSuccess: 4
  CUresult => hipError_t: 4
  cuGetErrorString => hipDrvGetErrorString: 1
  cuInit => hipInit: 1
  cuda.h => hip/hip_runtime.h: 2
  cudaDevAttrManagedMemory => hipDeviceAttributeManagedMemory: 1
  cudaDeviceGetAttribute => hipDeviceGetAttribute: 1
  cudaDeviceProp => hipDeviceProp_t: 2
  cudaDeviceReset => hipDeviceReset: 1
  cudaD

#### Porting inplace with hipify-clang

Here is the same port with **hipify-clang**.

In [17]:
!rm -rf temp_mat_mult; cp -r cuda_mat_mult temp_mat_mult 
!hipconvertinplace.sh temp_mat_mult -hip-kernel-execution-syntax

error: unsupported architecture 'nvptx64' for host compilation

[HIPIFY] info: file 'temp_mat_mult/mat_mult.cu' statistics:
  CONVERTED refs count: 15
  UNCONVERTED refs count: 0
  CONVERSION %: 100.0
  REPLACED bytes: 277
  TOTAL bytes: 6060
  CHANGED lines of code: 13
  TOTAL lines of code: 193
  CODE CHANGED (in bytes) %: 4.6
  CODE CHANGED (in lines) %: 6.7
  TIME ELAPSED s: 0.55
[HIPIFY] info: CONVERTED refs by type:
  error: 1
  device: 1
  memory: 9
  numeric_literal: 3
  kernel_launch: 1
[HIPIFY] info: CONVERTED refs by API:
  CUDA RT API: 15
[HIPIFY] info: CONVERTED refs by names:
  cudaDeviceSynchronize: 1
  cudaFree: 3
  cudaGetLastError: 1
  cudaLaunchKernel: 1
  cudaMalloc: 3
  cudaMemcpy: 3
  cudaMemcpyDeviceToHost: 1
  cudaMemcpyHostToDevice: 2
[1m/tmp/cuda_helper.cu-7d4a31.hip:95:5: [0m[0;1;31merror: [0m[1munknown type name 'SYSTEM_INFO'[0m
    SYSTEM_INFO sys_info;
[0;1;32m    ^
[0m[1m/tmp/cuda_helper.cu-7d4a31.hip:381:20: [0m[0;1;31merror: [0m[1muse of un

#### Building the ported code

If we examine the source tree we see that every source file that has been hipified has been first copied to a file with suffix `*.prehip`. Then the converted code is overwritten in place of the old file.

In [22]:
!ls -l temp_mat_mult

total 2376
-rw-rw-r-- 1 toby toby  262144 Sep 14 17:39 array_A.dat
-rw-rw-r-- 1 toby toby  262144 Sep 14 17:39 array_B.dat
-rw-rw-r-- 1 toby toby  262144 Sep 14 17:39 array_C.dat
-rw-rw-r-- 1 toby toby   23835 Sep 14 17:40 cuda_helper.cu
-rw-rw-r-- 1 toby toby   24629 Sep 14 17:39 cuda_helper.cu.prehip
-rw-rw-r-- 1 toby toby     273 Sep 14 17:39 Makefile
-rw-rw-r-- 1 toby toby    4497 Sep 14 17:39 mat_helper.hpp
-rw-rw-r-- 1 toby toby    4497 Sep 14 17:39 mat_helper.hpp.prehip
-rw-rw-r-- 1 toby toby    6090 Sep 14 17:39 mat_mult.cu
-rw-rw-r-- 1 toby toby    6060 Sep 14 17:39 mat_mult.cu.prehip
-rwxrwxr-x 1 toby toby 1545672 Sep 14 17:39 mat_mult.exe
-rw-rw-r-- 1 toby toby     107 Sep 14 17:39 mat_size.hpp
-rw-rw-r-- 1 toby toby     107 Sep 14 17:39 mat_size.hpp.prehip


Try making the ported code with hipcc.

In [23]:
!cd temp_mat_mult; make clean; make CXX="hipcc"

rm -r *.exe
hipcc -g -O2  mat_mult.cu -o mat_mult.exe -lcuda


In the original file **cuda_mat_mult/cuda_helper.cu** we had overloaded the **h_errchk** function to accept errorcodes of both type **CUResult** and **cudaError_t**. Following the conversion to HIP the errorcode has been replaced with just **hipError_t**. Therefore we need to manually delete the duplicate **h_errchk** function.

In [24]:
!cd temp_mat_mult; ./mat_mult.exe

Device id: 0
	name:                                    NVIDIA GeForce RTX 3060 Laptop GPU
	global memory size:                      6226 MB
	available registers per block:           65536 
	maximum shared memory size per block:    49 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,64)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65535,65535)
Maximum error (infinity norm) is: 1.52588e-05


Now we have a successful port of the CUDA code to HIP!

## API differences between CUDA and HIP

CUDA has the notion of a driver API and a runtime API. HIP combines the two into one API and supports a subset of either.