# A complete Hipfort application

In the **Fortran Refresher** section we covered the essentials of the Fortran language and how to use `subroutines`, `functions`, `pointers`, `modules` as well as how to call C code from Fortran. If this is unfamiliar, then it might be useful to review the material in that section first.

From **GPU Computing Fundamentals** section, every accelerated application has the same basic design:

1. At program launch compute devices are discovered and initialized.
2. Memory spaces are allocated on the compute device.
3. Kernels are prepared.
4. Memory is copied from the host to the compute device.
5. Kernels are run to perform whatever compute operation is required.
6. The output from kernel runs is copied back from the compute device to the host. IO may then occur before the next iteration.
  
**Steps 4-6** are repeated as many times as neccessary until the program is done, then at completion of the program

7. Deallocate memory, 
8. Release resources and exit.

## Tensor addition math

In this section we are going to walk through each of these steps as part of a complete example with Hipfort, using 2D tensor addition as the basic algorithm. For 2D tensors **A**, **B**, and **C**, each of size (M,N), the following relationship holds true at each index (i,j) in the tensors.

$$
A(i,j)+B(i,j)=C(i,j)
$$

In the prior **Fortran Refresher** section we used CPU code in Fortran and C to compute the answer $C(i)$ for 1D tensor addition. In this example we are going to use a HIP Kernel on the GPU to compute the answer $C(i,j)$ at every location in **C**.


## Example applications

In HIP we need a way to get a handle on the memory allocations that are on the compute device. Hipfort can use either a C pointer (`type(c_ptr)`) or a Fortran `pointer` as a handle to the memory allocations on the GPU. The methods of working with each type are subtly different though. In the applications 

* [tensoradd_hip_cptr.f90](tensoradd_hip_cptr.f90)
* [tensoradd_hip_fptr.f90](tensoradd_hip_fptr.f90)

we use C pointers and Fortran pointers to perform 2D tensor addition. It will be helpful to have **both files open** at the same time for comparison.

## Use the Hipfort API

Access to all Hipfort functions is via the `hipfort` and `hipfort_check` modules. We bring those modules in along with others at the beginning of the program.

```Fortran
    ! HIP modules
    use hipfort
    use hipfort_check
```

## Check HIP API calls

Hipfort functions usually have a **return type** that we can check to make sure everything worked ok. If these checks are **not performed** some functions will continue even though there has been a **silent failure**.  It is therefore **best practice** to **always** the check the return type from HIP calls. The `hipfort_check` module defines a subroutine called `hipcheck` that we can use to wrap around a HIP API call. It then checks the return type and exits the program if there has been an error. For example we wrap a `hipmalloc` call with hipcheck as follows:

```
call hipcheck(hipmalloc(A_d, M, N))
```

## Code validation

It is important to make sure that the output of the compuation is accurate for every element in the output. A wrong answer can be computed very quickly but it is of no use! In the file [math_utils.f90](math_utils.f90) is a function called `check_tensor_addition_2D` that iterates over every point in the output tensor $C(i,j)$ and checks to see each point is within an error margin of $A(i,j)+B(i,j)$. The function has the following signature, where **A**, **B**, and **C** are arrays on the host. It has the following signature:

```Fortran
function check_tensor_addition_2D(A, B, C, eps_mult) result(success)
            !! Function to check the outcome of tensor addition
            !! only check the host arrays

            real(kind=c_float), dimension(:,:), intent(in), pointer :: A, B, C
        
            real, intent(in) :: eps_mult
                !! Epsilon multiplier, how many floating point spacings
                !! can the computed answer be from our benchmark answer
```

and we import it as the function `check` within the two programs

```Fortran
! Maths check
    use math_utils, only : check => check_tensor_addition_2D
```

## Fortran interface to kernel launch function

Hipfort doesn't yet have a way to launch kernels, however passing pointers from Fortran to C/C++ functions is straightforward, and from C/C++ code we can launch kernels. In the file [kernel_code.cpp](kernel_code.cpp) is a function called `launch_kernel_hip` that does the job of launching kernels. It has the following signature:

```Fortran
    void launch_kernel_hip(
            float_type* A, 
            float_type* B,
            float_type* C,
            int M,
            int N) {
```

and we have type defined `float_type` as `float` earlier in the file.

```C++
typedef float float_type;
```

In order to call this function from Fortran we define an `interface` to the function within the programs of [tensoradd_hip_cptr.f90](tensoradd_hip_cptr.f90) and [tensoradd_hip_fptr.f90](tensoradd_hip_fptr.f90) as follows:

```Fortran
    interface
        ! A C function with void return type
        ! is regarded as a subroutine in Fortran 
        subroutine launch_kernel_hip(A, B, C, M, N) bind(C)
            use iso_c_binding
            ! Fortran passes arguments by reference as the default
            ! Arguments must have the "value" option present to pass by value
            ! Otherwise launch_kernel will receive pointers of type void**
            ! instead of void*
            type(c_ptr), intent(in), value :: A, B, C
            integer(c_int), intent(in), value :: M, N
        end subroutine
        
    end interface

```

Note the presence of the `value` option for the input arguments. This is so we pass arguments by `value` instead of by **reference** (the default). If we didn't have the value keyword the C function would receive a reference (or pointer to the variables) instead of a copy of the variables. In the case of `launch_kernel_hip` without the `value` keyword in the interface then A would be of type `void**` instead of `void*`.

## Select and manage a HIP device

Each HIP compute device has a resource manager called a `primary context` that keeps track of all the resources allocated on that device. Host threads share access to the primary contexts in a way that is (or at least is intended to be!) thread safe. This means that each host thread in an application is **free to choose** which device to use. Usually the HIP runtime is initialised (primary contexts are created) and a host thread is **connected** to the first available device (device 0) whenever that host thread makes its first call to a HIP function. For environments where there are multiple devices it is **good practice** to explicity initialize the HIP API and be specific about which device you would like the host thread to connect to. In the file [hip_utils.f90](hip_utils.f90) are two subroutines `init_gpu` and `reset_gpu` that provide a way to choose a GPU and reset (release all resources) in the selected device's primary context. The first statement after variable declarations in  [tensoradd_hip_cptr.f90](tensoradd_hip_cptr.f90) and [tensoradd_hip_fptr.f90](tensoradd_hip_fptr.f90) is to initialize HIP and choose the GPU.

```Fortran
    ! Find and set the GPU device. Use device 0 by default
    call init_gpu(0)   
```

The argument to init_gpu is the desired index of the device that we'd like to use. Device indices start at 0 and in this instance we select the first available gpu with id 0.

Inside the function `init_gpu` we initialize the HIP API using a call to `hipinit`.

```Fortran
call hipcheck(hipinit(0))
```

A call to `hipinit` only needs to be done once, so we have a variable `acquired` within the module to make sure of this. 

Next, we call `hipgetdevicecount` to poll the number of valid devices. If the desired index (the input argument to `init_gpu`) falls within the range of valid device then we call `hipsetdevice` to set the HIP device according to the desired device index. Any subsequent HIP calls from a host thread will then use the selected GPU.

```Fortran
 ! Get the number of compute devices
        call hipcheck(hipgetdevicecount(ndevices))
            
        if ((dev_id .ge. 0) .and. (dev_id .lt. ndevices)) then
            ! Choose a compute device
            call hipcheck(hipsetdevice(dev_id))
        else
            write(error_unit,*) 'Error, dev_id was not inside the range of available devices.'
            stop 1
        end if
```

The function `reset_gpu` in [hip_utils.f90](hip_utils.f90) calls `hipdevicesynchronize` to make sure the selected GPU device is finished with all pending activity, then it calls `hipdevicereset` to release all resources in the primary context. 

```Fortran
        ! Release all resources on the gpu
        if (acquired) then
            ! Make sure the GPU is finished
            ! with all pending activity
            call hipcheck(hipdevicesynchronize())

            ! Now free all resources on the primary context
            ! of the selected GPU
            call hipcheck(hipdevicereset())
        end if
```

It is **best practice** to reset the compute device at the end of the computation, but make sure that no other threads are using resources on that GPU when you do it!

## Memory on the device

### Standard data types on the host

Next, we allocate memory for the tensors on both the host and the compute device. Fortran has the ability to change, with a compiler flag, how many bytes are used `real` and `integer` types. HIP kernels need fixed data types, so when you allocate arrays that will be used to interact with device allocations, it is **best practice** in Fortran code to use array data types whose number of bytes **do not change**. We use the `c_float` kind from the `iso_c_binding` module to make sure the host arrays are of the data type that is synonymous with `float` in C code.

```Fortran
real(kind=c_float), dimension(:,:), pointer :: A_h, B_h, C_h

! Allocate memory on host 
allocate(A_h(M,N), B_h(M, N), C_h(M,N))
```

### Variable naming convention

Notice the `_h` suffix on variable names. In this module we choose to put a `_h` suffix on memory allocations that reside on the host and a `_d` suffix for memory allocations that reside on the compute device. It is a variable naming convention that makes it easier to see what memory is allocated where.

### C pointers and Fortran pointers

Both C pointers (`type(c_ptr)`) and Fortran pointers can be used as handles to memory allocations on the compute device. C pointers are flexible but not very safe. Fortran pointers are also not very safe but additionally  retain information on the shape, data type, and size of the allocation.

In [tensoradd_hip_cptr.f90](tensoradd_hip_cptr.f90) we use C pointers for memory allocations to tensors **A**, **B**, and **C** on the compute device

```Fortran
    ! C Pointers to memory allocations on the device
    type(c_ptr) :: A_d, B_d, C_d
```

and in [tensoradd_hip_fptr.f90](tensoradd_hip_fptr.f90) we use Fortran pointers.

```Fortran
    ! Fortran pointers to memory allocations on the device
    real(kind=c_float), dimension(:,:), pointer :: A_d, B_d, C_d
```

### Allocate device memory

The `hipmalloc` function allocates memory in the **global** memory space on the compute device. This memory is the largest (and slowest) memory on the compute device. Memory allocated with `hipmalloc` is accessible from every kernel that runs on the compute device but not from the host. 

When using hipmalloc with **C pointers** we need to specify how many bytes to reserve. The `sizeof` function returns the number of bytes allocated for a Fortran pointer. In [tensoradd_hip_cptr.f90](tensoradd_hip_cptr.f90) we use the bytes allocated for the host arrays as an input argument when allocating **A_d**, **B_d**, and **C_d**.

```Fortran
    ! Allocate tensors on the GPU
    call hipcheck(hipmalloc(A_d, sizeof(A_h)))
    call hipcheck(hipmalloc(B_d, sizeof(B_h)))
    call hipcheck(hipmalloc(C_d, sizeof(C_h)))
```

Fortran pointers need **elements** (not bytes) as the input argument for allocation with `hipmalloc`. In [tensoradd_hip_fptr.f90](tensoradd_hip_fptr.f90) we specify the size of the arrays to allocate in elements along each dimension.

```Fortran
    ! Allocate memory on the GPU
    call hipcheck(hipmalloc(A_d, M, N))
    call hipcheck(hipmalloc(B_d, M, N))
    call hipcheck(hipmalloc(C_d, M, N))
```

There are additional ways to allocate memory with Fortran pointers. For example we could have used the `hipmalloc_r4_c_size_t` function to allocate the 2D arrays, each element using 4 bytes, and having integer variables of kind `c_size_t` to specify dimensions.


```Fortran
    ! Could have also done this for the allocate instead
    call hipcheck(hipmalloc_r4_2_c_size_t(A_d, int(M_in, c_size_t), int(N_in, c_size_t)))
    call hipcheck(hipmalloc_r4_2_c_size_t(B_d, int(M_in, c_size_t), int(N_in, c_size_t)))
    call hipcheck(hipmalloc_r4_2_c_size_t(C_d, int(M_in, c_size_t), int(N_in, c_size_t)))
```

### De-allocate device memory

When device memory is no longer needed, the **hipfree** function deallocates device memory with both C and Fortran pointers.

```Fortran
    ! Free allocations on the GPU
    call hipcheck(hipfree(A_d))
    call hipcheck(hipfree(B_d))
    call hipcheck(hipfree(C_d))
```

It is **best practice** to make sure pointers are set to null when they no longer point to something. For Fortran pointers ([tensoradd_hip_fptr.f90](tensoradd_hip_fptr.f90)) we use the `nullify` function

```Fortran
    ! It is best practice to nullify all pointers 
    ! once we are done with them 
    nullify(A_h, B_h, C_h, A_d, B_d, C_d)
```

and for C pointers ([tensoradd_hip_cptr.f90](tensoradd_hip_cptr.f90)) we set them to `c_null_ptr`.

```Fortran
    ! Set C pointers to null as well
    A_d = c_null_ptr
    B_d = c_null_ptr
    C_d = c_null_ptr
```

## Memory copies between host and device

Memory can be copied between host and device allocations, or between device allocations. After filling arrays **A_h** and **B_h** we proceed to copy them to the device allocations **A_d** and **B_d**.

### Copy from host to device

The `hipmemcpy` function can use either C pointers or Fortran pointers. Here is the code to copy from host to device using C pointers.

```Fortran
    ! Copy memory from the host to the device 
    call hipcheck(hipmemcpy(A_d, c_loc(A_h), sizeof(A_h), hipmemcpyhosttodevice))
    call hipcheck(hipmemcpy(B_d, c_loc(B_h), sizeof(B_h), hipmemcpyhosttodevice))
```

Each `hipmemcpy` call has a additional flag to specify the direction of the copy. There are five options available:

* `hipmemcpyhosttohost`
* `hipmemcpyhosttodevice`
* `hipmemcpydevicetohost`
* `hipmemcpydevicetodevice`
* `hipmemcpydefault`

The `hipmemcpydefault` option tries to infer the direction of transfer from the memory spaces of the input pointers. It is less readable however.

Hipmemcpy also works with Fortran pointers, though when specifying the size to copy we specify **elements** instead of **bytes**! Notice the use of `size` instead of `sizeof` to specify elements instead of bytes.

```Fortran
    call hipcheck(hipmemcpy(A_d, A_h, size(A_h), hipmemcpyhosttodevice))
    call hipcheck(hipmemcpy(B_d, B_h, size(B_h), hipmemcpyhosttodevice))
```

In the case of Fortran pointers we could have also used `hipmemcpy` functions that are specific to the arrays in question, for example we could also have done this.

```Fortran
    ! Could also have done this for the copy instead
    !call hipcheck(hipmemcpy_r4_2_c_size_t(A_d, A_h, &
    !    int(size(A_h), c_size_t), hipmemcpyhosttodevice))
    !call hipcheck(hipmemcpy_r4_2_c_size_t(B_d, B_h, &
    !    int(size(B_h), c_size_t), hipmemcpyhosttodevice))
```

### Copy from device to host

After running the kernel, we copy **C_d** back to **C_h**, using either C pointers,

```Fortran
    ! Copy memory from the device to the host
    call hipcheck(hipmemcpy(c_loc(C_h), C_d, sizeof(C_h), hipmemcpydevicetohost))
```
or Fortran pointers

```Fortran
    ! Copy from the device result back to the host
    call hipcheck(hipmemcpy(C_h, C_d, size(C_d), hipmemcpydevicetohost))
```

## Kernel source and launch

### Call the kernel launch function

Since the Hipfort API doesn't have the functionality to define and launch kernels, we use the C function `launch_kernel_hip` to launch the kernel. This function has as the input argument C pointers for **A**, **B**, and **C** on the device and the integers **M** and **N** for the array sizes. In [tensoradd_hip_cptr.f90](tensoradd_hip_cptr.f90) we can just use the pointers **A_d**, **B_d**, and **C_d** directly while taking special care to convert the integer arguments to the type required by the function.

```Fortran
    ! Call the C function that launches the kernel
    call launch_kernel_hip( &
        A_d, &
        B_d, &
        C_d, &
        int(M, c_int), &
        int(N, c_int) &
    )
```

In [tensoradd_hip_fptr.f90](tensoradd_hip_fptr.f90) we must use `c_loc` to get the C pointer that underlies the Fortran pointers.

```Fortran
    ! Call the C function that launches the kernel
    call launch_kernel_hip( &
        c_loc(A_d), &
        c_loc(B_d), &
        c_loc(C_d), &
        int(M, c_int), &
        int(N, c_int) &
    )
```

### Kernel launch function

Let's examine the file [kernel_code.cpp](kernel_code.cpp). 

#### C linkage

The kernel launch function `launch_kernel_hip` is wrapped in an `extern "C"` code block to ensure the function is compiled with C linkage, meaning it's name doesn't get mangled and is accessible from Fortran.

```C++
// C function to call the tensoradd_2D kernel
extern "C" {

    void launch_kernel_hip(
            float_type* A, 
            float_type* B,
            float_type* C,
            int M,
            int N) {
```

#### Role of the kernel launch function

From the **GPU Computing Fundamentals** section we have the following diagram of a Grid that is made up of blocks.

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:90%;">
    <img src="../images/Grid.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">A Grid in the context of GPU computing. Grids are made up of Blocks and Blocks are made up of Threads</figcaption>
</figure>

It is the job of the kernel launch function to: 

* Pass arguments to the kernel 
* Determine the block size, (number of threads along each dimension of the block)
* Determine the grid size, (number of blocks along each dimension of the grid)
* Launch the kernel and examine launch errors
* Optionally synchronize the device

#### Define block size and grid size

The `dim3` structure (with fields `x`, `y`, `z`) is used to specify the block size and the number of blocks per dimension.

```C++       
        // Grid size
        dim3 global_size = {
            (uint32_t)(M), 
            (uint32_t)(N)
        }; 
        
        // Block size, 
        dim3 block_size = {8,8,1};
        
        // Number of blocks in each dimension
        dim3 nblocks = {
            global_size.x/block_size.x,
            global_size.y/block_size.y,
            1
        };
```

We must always have an integer number of blocks along the grid. Sometimes this means making a grid that is larger than we need. This is ok provided we build protection into the kernel so we don't run off the end of the arrays. 
    
```C++
        // Make sure there are enough blocks
        if (global_size.x % block_size.x) nblocks.x += 1;
        if (global_size.y % block_size.y) nblocks.y += 1;
        if (global_size.z % block_size.z) nblocks.z += 1;
```

#### Shared memory

HIP provides the ability to define a small amount of **shared memory** that is available to all threads in a block. This memory is fast and can be used as a small scratch space. We don't need shared memory for this example so we specify `0` as the number of bytes to allocate for shared memory.


```C++
        // Decide on the number of bytes to allocate for shared memory
        size_t sharedMemBytes = 0;
```

#### Kernel launch with hipLaunchKernelGGL

Finally we get to launch the kernel itself. There are a few ways to do this, here we use the **hipLaunchKernelGGL** macro to launch the kernel `tensoradd_2D` with the specified block and grid size along with kernel arguments. A `stream` in HIP can be thought of as a work queue to which we submit work, we use stream 0 which is the default or null stream.

```C++
        // Launch the kernel
        hipLaunchKernelGGL(
                // Kernel name
                tensoradd_2D,
                // Number of blocks per dimension
                nblocks,
                // Number of threads along each dimension of the block
                block_size,
                // Number of bytes dynamically allocated for shared memory
                sharedMemBytes,
                // Stream to use (0 is the default or null stream)
                0,
                // Kernel arguments
                A, B, C,
                M, N);
```

#### Kernel launch with CUDA syntax

One can also use the CUDA-like triple-chevron syntax to launch a kernel. This is not ANSI C++ compliant though, which isn't too much of a problem because only compilers (hipcc, nvcc) that understand triple chevrons will be used to compile this source file anyway.

```C++
        // The triple-chevron (non C++ compliant) way of launching kernels
        tensoradd_2d<<<nblocks, block_size, sharedMemBytes, 0>>>(A, B, C, M, N);
```

#### Check kernel launch

We use the `hipGetLastError` function to see if there were any problems arising from kernel launch. The macro `HIPCHECK` is defined earlier in the file [kernel_code.cpp](kernel_code.cpp) and behaves similarly to the Fortran function `hipcheck` from `hipfort_check`.

```C++
        // Make sure the kernel launch went ok
        HIPCHECK(hipGetLastError());
```

#### Synchronize the compute device

Finally, we use the `hipDeviceSynchronize` function to make sure that the kernel is finished before continuing. This step is not strictly necessary because the subsequent copy of **C_d** to **C_h** will use the same (null) stream and will block until the kernel is finished. 

```C++
    	// Wait for the kernel to finish
    	HIPCHECK(hipDeviceSynchronize()); 
    }
}
```

## Code validation

## Resource cleanup

## Use object oriented types for memory safety