# A complete Hipfort application

In the **Fortran Refresher** section we covered the essentials of the Fortran language and how to use `subroutines`, `functions`, `pointers`, `modules` as well as how to call C code from Fortran. If this is unfamiliar, then it might be useful to review the material in that section first.

From **GPU Computing Fundamentals** section, every accelerated application has the same basic design:

1. At program launch compute devices are discovered and initialized.
2. Memory spaces are allocated on the compute device.
3. Kernels are prepared.
4. Memory is copied from the host to the compute device.
5. Kernels are run to perform whatever compute operation is required.
6. The output from kernel runs is copied back from the compute device to the host. IO may then occur before the next iteration.
  
**Steps 4-6** are repeated as many times as neccessary until the program is done, then at completion of the program

7. Deallocate memory, 
8. Release resources and exit.

## Tensor addition

In this section we are going to walk through each of these steps as part of a complete example with Hipfort, using 2D tensor addition as the basic algorithm. For 2D tensors **A**, **B**, and **C**, each of size (M,N), the following relationship holds true at each index (i,j) in the tensors.

$$
A(i,j)+B(i,j)=C(i,j)
$$

In the prior **Fortran Refresher** section we used CPU code in Fortran and C to compute the answer $C(i)$ for 1D tensor addition. In this example we are going to use a HIP Kernel on the GPU to compute the answer $C(i,j)$ at every location in **C**.


## Example applications

In HIP we need a way to get a handle on the memory allocations that are on the compute device. Hipfort can use either a C pointer (`type(c_ptr)`) or a Fortran `pointer` as a handle to the memory allocations on the GPU. The methods of working with each type are subtly different though. In the applications 

* [tensoradd_hip_cptr.f90](tensoradd_hip_cptr.f90)
* [tensoradd_hip_fptr.f90](tensoradd_hip_fptr.f90)

we use C pointers and Fortran pointers to perform 2D tensor addition. It will be helpful to have **both files open** at the same time for comparison.

## Use the Hipfort API

Access to all Hipfort functions is via the `hipfort` and `hipfort_check` modules. We bring those modules in along with others at the beginning of the program.

```Fortran
    ! HIP modules
    use hipfort
    use hipfort_check
```

## Checks on API calls

Hipfort functions usually have a **return type** that we can check to make sure everything worked ok. If these checks are **not performed** some functions will continue even though there has been a **silent failure**.  It is therefore **best practice** to **always** the check the return type from HIP calls. The `hipfort_check` module defines a subroutine called `hipcheck` that we can use to wrap around a HIP API call. It then checks the return type and exits the program if there has been an error. For example we wrap a `hipmalloc` call with hipcheck as follows:

```
call hipCheck(hipmalloc(A_d, M, N))
```

## Validation check

It is important to make sure that the output of the compuation is accurate for every element in the output. A wrong answer can be computed very quickly but it is of no use! In the file [math_utils.f90](math_utils.f90) is a function called `check_tensor_addition_2D` that iterates over every point in the output tensor $C(i,j)$ and checks to see each point is within an error margin of $A(i,j)+B(i,j)$. The function has the following signature, where **A**, **B**, and **C** are arrays on the host. It has the following signature:

```Fortran
function check_tensor_addition_2D(A, B, C, eps_mult) result(success)
            !! Function to check the outcome of tensor addition
            !! only check the host arrays

            real(kind=c_float), dimension(:,:), intent(in), pointer :: A, B, C
        
            real, intent(in) :: eps_mult
                !! Epsilon multiplier, how many floating point spacings
                !! can the computed answer be from our benchmark answer
```

and we import it as the function `check` within the two programs

```Fortran
! Maths check
    use math_utils, only : check => check_tensor_addition_2D
```

## Fortran interface to kernel launch function

Hipfort doesn't yet have a way to launch kernels, however passing pointers from Fortran to C/C++ functions is straightforward, and from C/C++ code we can launch kernels. In the file [kernel_code.cpp](kernel_code.cpp) is a function called `launch_kernel_hip` that does the job of launching kernels. It has the following signature:

```Fortran
    void launch_kernel_hip(
            float_type* A, 
            float_type* B,
            float_type* C,
            int M,
            int N) {
```

and we have type defined `float_type` as `float` earlier in the file.

```C++
typedef float float_type;
```

In order to call this function from Fortran we define an `interface` to the function within the programs of [tensoradd_hip_cptr.f90](tensoradd_hip_cptr.f90) and [tensoradd_hip_fptr.f90](tensoradd_hip_fptr.f90) as follows:

```Fortran
    interface
        ! A C function with void return type
        ! is regarded as a subroutine in Fortran 
        subroutine launch_kernel_hip(A, B, C, M, N) bind(C)
            use iso_c_binding
            ! Fortran passes arguments by reference as the default
            ! Arguments must have the "value" option present to pass by value
            ! Otherwise launch_kernel will receive pointers of type void**
            ! instead of void*
            type(c_ptr), intent(in), value :: A, B, C
            integer(c_int), intent(in), value :: M, N
        end subroutine
        
    end interface

```

Note the presence of the `value` option for the input arguments. This is so we pass arguments by `value` instead of by **reference** (the default). If we didn't have the value keyword the C function would receive a reference (or pointer to the variables) instead of a copy of the variables. In the case of `launch_kernel_hip` without the `value` keyword in the interface then A would be of type `void**` instead of `void*`.

## Select and manage a HIP device

Each GPU has a resource manager called a `primary context` that keeps track of all the resources allocated on that device. Host threads share access to the primary contexts in a way that is (or at least is intended to be!) thread safe. This means that each host thread in an application is **free to choose** which GPU to use. Usually the HIP runtime is initialised (primary contexts are created) and a host thread is **connected** to the first available GPU (device 0) whenever that host thread makes its first call to a HIP function. For environments where there are multiple GPU's it is **good practice** to be specific about what GPU you would like the host thread to connect to, and explicitly initialize the HIP runtime. In the file [hip_utils.f90](hip_utils.f90) are two subroutines `init_gpu` and `reset_gpu` that provide a way to choose a GPU and reset (release all resources) in the selected device's primary context. The first statement after variable declarations in  [tensoradd_hip_cptr.f90](tensoradd_hip_cptr.f90) and [tensoradd_hip_fptr.f90](tensoradd_hip_fptr.f90) is to initialize HIP and choose the GPU.

```Fortran
    ! Find and set the GPU device. Use device 0 by default
    call init_gpu(0)   
```

Inside the function `init_gpu` we initialize the HIP API using a call to `hipinit`.

```Fortran
call hipcheck(hipinit(0))
```

A call to `hipinit` only needs to be done once, so we have a variable `acquired` within the module to make sure of this. 

Integers are used to select HIP devices, starting at 0 for the first available HIP device. Then we call `hipgetdevicecount` to poll the number of valid devices. If the desired index (the input argument to `init_gpu`) falls within the range of valid device then we call `hipsetdevice` to set the HIP device. Any subsequent HIP calls from a host thread will then use the selected GPU.

The function `reset_gpu` in [hip_utils.f90](hip_utils.f90) calls `hipdevicereset` to release all resources in the primary context for the selected GPU. It is **best practice** to reset the compute device at the end of the computation, but make sure that no other threads are using that GPU's resources when you do it!

## Memory allocation and de-allocation

### C pointers and Fortran pointers

## Memory copies

## Kernel source and launch

## Resource cleanup

## Object oriented types for memory safety