# Porting CUDA programs to HIP

HIP API calls are designed to closely match their CUDA equivalents. This enables HIP to function as a thin layer over CUDA and allows for reasonably easy porting of CUDA code to HIP code. Often it is just a matter of replacing **cuda -> hip** in the function calls. The ROCM library provides two different tools **hipify-perl** and **hipify-clang** to help with the porting process. The tool **hipify-perl** is robust and uses perl to perform an intelligent search and replace of cuda calls with hip calls, while the **hipify-clang** tool uses the clang preprocessor to produce a high quality port. The perl-based method is better for quick ports of small codes, while the clang-based method is intended for ports of large codebases. The hipify-clang tool is much more picky though and fails easily unless it has access to all the header files used in the compilation of the CUDA code.

## Setup and installation

From [this source](https://sep5.readthedocs.io/en/latest/Programming_Guides/HIP-porting-guide.html) it is recommended to attempt porting on a machine that has access to both CUDA and HIP libraries. This usually means doing the port on a machine with an NVIDIA GPU. Then one can try porting portions of the code at a time and compare results. For best results with hipify-clang you need to have a version of CUDA that is compatible with your installed version of hipify-clang. 

In [1]:
!hipify-clang --version

AMD LLVM version 16.0.0git
  Optimized build.


Here is a page which describes compatibility between CUDA and hipify-clang.

[HIPIFY Documentation](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/hipify-clang.html)

## Trial setup

There are two sub-directories in this module:

* cuda_mat_mult
* hip_mat_mult

In the directory **cuda_mat_mult** is a CUDA version of the HIP matrix multiplication code in **hip_mat_mult**. It was manually ported from HIP to CUDA. We are going to use the HIP tools to try and port back the CUDA code to HIP code. 

## Porting techniques

## Examine the code for porting potential

We use the scripts **hipexamine-perl.sh** or **hipexamine.sh** to recursively search through a directory and examine the potential for porting a code. Note the summary that is produced for each file.

In [15]:
!hipexamine-perl.sh cuda_mat_mult -exclude-dirs=cuda_mat_mult/.ipynb_checkpoints


[HIPIFY] info: file 'cuda_mat_mult/cuda_helper.cu' statistics:
  CONVERTED refs count: 55
  TOTAL lines of code: 789
[HIPIFY] info: CONVERTED refs by names:
  CUDA_SUCCESS => hipSuccess: 4
  CUresult => hipError_t: 4
  cuGetErrorString => hipDrvGetErrorString: 1
  cuInit => hipInit: 1
  cuda.h => hip/hip_runtime.h: 2
  cudaDevAttrManagedMemory => hipDeviceAttributeManagedMemory: 1
  cudaDeviceGetAttribute => hipDeviceGetAttribute: 1
  cudaDeviceProp => hipDeviceProp_t: 2
  cudaDeviceReset => hipDeviceReset: 1
  cudaDeviceSynchronize => hipDeviceSynchronize: 1
  cudaError_t => hipError_t: 4
  cudaEventCreate => hipEventCreate: 2
  cudaEventDestroy => hipEventDestroy: 2
  cudaEventElapsedTime => hipEventElapsedTime: 1
  cudaEventRecord => hipEventRecord: 3
  cudaEventSynchronize => hipEventSynchronize: 2
  cudaEvent_t => hipEvent_t: 3
  cudaGetDevice => hipGetDevice: 1
  cudaGetDeviceCount => hipGetDeviceCount: 2
  cudaGetDeviceProperties => hipGetDeviceProperties: 2
  cudaGetErrorStrin

In [16]:
!hipexamine.sh ./cuda_mat_mult 

error: unsupported architecture 'nvptx64' for host compilation
[1m/tmp/cuda_helper.cu-059629.hip:95:5: [0m[0;1;31merror: [0m[1munknown type name 'SYSTEM_INFO'[0m
    SYSTEM_INFO sys_info;
[0;1;32m    ^
[0m[1m/tmp/cuda_helper.cu-059629.hip:381:20: [0m[0;1;31merror: [0m[1muse of undeclared identifier '_aligned_malloc'; did you mean 'aligned_alloc'?[0m
    void* buffer = _aligned_malloc(nbytes, alignment);
[0;1;32m                   ^~~~~~~~~~~~~~~
[0m[0;32m                   aligned_alloc
[0m[1m/usr/include/stdlib.h:592:14: [0m[0;1;30mnote: [0m'aligned_alloc' declared here[0m
extern void *aligned_alloc (size_t __alignment, size_t __size)
[0;1;32m             ^
[0m[1m/tmp/cuda_helper.cu-059629.hip:383:11: [0m[0;1;31merror: [0m[1mredefinition of 'buffer'[0m
    void* buffer = aligned_alloc(alignment, nbytes);
[0;1;32m          ^
[0m[1m/tmp/cuda_helper.cu-059629.hip:381:11: [0m[0;1;30mnote: [0mprevious definition is here[0m
    void* buffer = _aligned_m

### Porting inplace

Both the **hipconvertinplace-perl.sh** and **hipconvertinplace.sh** scripts have the ability to convert a code tree inplace. The additional option **-hip-kernel-execution-syntax** replaces CUDA triple chevron kernel calls with the equivalent call to **hipLaunchKernelGGL** macro.

In [38]:
!rm -rf temp_mat_mult; cp -r cuda_mat_mult temp_mat_mult 
!hipconvertinplace-perl.sh temp_mat_mult -exclude-dirs=temp_mat_mult/.ipynb_checkpoints -hip-kernel-execution-syntax


[HIPIFY] info: file 'temp_mat_mult/mat_mult.cu' statistics:
  CONVERTED refs count: 16
  TOTAL lines of code: 193
[HIPIFY] info: CONVERTED refs by names:
  cudaDeviceSynchronize => hipDeviceSynchronize: 1
  cudaFree => hipFree: 3
  cudaGetLastError => hipGetLastError: 1
  cudaLaunchKernel => hipLaunchKernel: 1
  cudaMalloc => hipMalloc: 3
  cudaMemcpy => hipMemcpy: 3
  cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
  cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 2

[HIPIFY] info: file 'temp_mat_mult/cuda_helper.cu' statistics:
  CONVERTED refs count: 56
  TOTAL lines of code: 789
[HIPIFY] info: CONVERTED refs by names:
  CUDA_SUCCESS => hipSuccess: 4
  CUresult => hipError_t: 4
  cuGetErrorString => hipDrvGetErrorString: 1
  cuInit => hipInit: 1
  cuda.h => hip/hip_runtime.h: 2
  cudaDevAttrManagedMemory => hipDeviceAttributeManagedMemory: 1
  cudaDeviceGetAttribute => hipDeviceGetAttribute: 1
  cudaDeviceProp => hipDeviceProp_t: 2
  cudaDeviceReset => hipDeviceReset: 1
  cudaD

If we examine the source tree we see that every source file that has been hipified has been first copied to a file with suffix `*.prehip`. Then the converted code is overwritten in place of the old file.

In [40]:
!ls -l temp_mat_mult

total 2380
-rw-rw-r-- 1 toby toby  262144 Sep 14 15:34 array_A.dat
-rw-rw-r-- 1 toby toby  262144 Sep 14 15:34 array_B.dat
-rw-rw-r-- 1 toby toby  262144 Sep 14 15:34 array_C.dat
-rw-rw-r-- 1 toby toby   24660 Sep 14 15:34 cuda_helper.cu
-rw-rw-r-- 1 toby toby   24629 Sep 14 15:34 cuda_helper.cu.prehip
-rw-rw-r-- 1 toby toby     273 Sep 14 15:34 Makefile
-rw-rw-r-- 1 toby toby    4497 Sep 14 15:34 mat_helper.hpp
-rw-rw-r-- 1 toby toby    4497 Sep 14 15:34 mat_helper.hpp.prehip
-rw-rw-r-- 1 toby toby    6090 Sep 14 15:34 mat_mult.cu
-rw-rw-r-- 1 toby toby    6060 Sep 14 15:34 mat_mult.cu.prehip
-rwxrwxr-x 1 toby toby 1545672 Sep 14 15:34 mat_mult.exe
-rw-rw-r-- 1 toby toby     107 Sep 14 15:34 mat_size.hpp
-rw-rw-r-- 1 toby toby     107 Sep 14 15:34 mat_size.hpp.prehip


We try copying the Makefile from hip_mat_mult to see if the conversion has worked.

In [41]:
!cd temp_mat_mult; make clean; make CXX="hipcc"

rm -r *.exe
hipcc -g -O2  mat_mult.cu -o mat_mult.exe -lcuda
[01m[0m[01mcuda_helper.cu(54)[0m: [01;31merror[0m: function [01m"h_errchk"[0m has already been defined
  void h_errchk(hipError_t errcode, const char* message) {
       ^

1 error detected in the compilation of "mat_mult.cu".
make: *** [Makefile:16: mat_mult.exe] Error 2


In [37]:
!cd temp_mat_mult; ./mat_mult.exe

Device id: 0
	name:                                    NVIDIA GeForce RTX 3060 Laptop GPU
	global memory size:                      6226 MB
	available registers per block:           65536 
	maximum shared memory size per block:    49 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,64)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65535,65535)
Error, cuda runtime api call failed at mat_mult.cu:148, error string is: unknown error


In the original file **cuda_helper.cu** we had overloaded the **h_errchk** function to accept errorcodes of type **CUResult** and **cudaError_t**. With the conversion to HIP the errorcode has been replaced with just **hipError_t**. Therefore we need to manually delete the duplicate **h_errchk** function.

Available porting tools are **hipify-perl** and **hipify-clang**. Hipify-perl is a perl-based search and replace tool, whereas hipify-clang is a clang preprocessor based tool.


* hipexamine.sh
* hipexamine-perl.sh
* hipconvertinplace.sh
* hipconvertinplace-perl.sh
* hipify-perl
* hipify-clang

### Learnings

* Need to have all the headers available


### Porting with hipify-perl

### Porting with hipify-inplace

## API differences between CUDA and HIP

CUDA has the notion of a driver API and a runtime API.