# Exercise: Convert the GPU chessboard code to managed memory

In order to solidify the topics learned in this module it is helpful to fill in the missing components of a Hipfort program. Below is a standard 8x8 chess board:

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:70%;">
    <img src="../images/Chess_board.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">A chess board of size 8x8.</figcaption>
</figure>

You may have already completed the CPU version of the chessboard exercise. In this exercise the goal is to use a HIP kernel to fill the chessboard. 

## The exercise (TLDR version)

In the file [chessboard_GPU.f90](chessboard_GPU.f90) is the Fortran source, and in the file [kernel_code.cpp](kernel_code.cpp) is the C++ source that contains the `fill_chessboard` kernel. Both source files have the basics already filled in. Your task is to insert the required Hipfort machinery to make all the pieces work. The steps required are:

0. Finish implementing the `fill_chessboard` kernel in [kernel_code.cpp](kernel_code.cpp).
1. Initialize the GPU in [chessboard_GPU.f90](chessboard_GPU.f90)
2. Allocate memory for the chessboard on the compute device at Fortran pointer `B_d`.
3. Launch the `fill_chessboard` kernel.
4. Copy memory from `B_d` on the compute device to `B_h` on the host.
5. Release memory for Fortran pointer `B_d`.
6. Reset the compute device.

## Choose your own adventure!

Each task may be skipped by uncommenting the `include` statement for the shortcut solution. For example, in the file [kernel_code.cpp](kernel_code.cpp), the shortcut solution may be included by uncommenting the following line of text:

```C++
    // Uncomment this for the shortcut solution to Step 0.
    //#include "step0_kernel.h"
```

In this way you can choose which parts of the exercise you want to complete. Wether it is one part or all parts. The choice is yours! If you are going to use a shortcut though, try to understand what the code in the `.h` file is doing.

## The exercise (step by step)

### Step 0 - Finish the kernel

Fill out the missing pieces of the `fill_chessboard` kernel in [kernel_code.cpp](kernel_code.cpp). You will need to: 

* Ensure a guard is in place to prevent the GPU running off the end of the array, if the grid happens to be larger than the chessboard.
* Use multidimensional indexing to compute an offset into `B` at coordinates (i0, i1). Remember we are using column-major ordering to compute the index into the allocation `B`
* Something that might help with the math is to use modulo arithmetic. If we define `k` as an integer such that:

```C++
int k = i0 + i1 % 2;
```
and `light` and `dark` contain floating point values for light and dark cells, then we can use this formula to compute the value inside a chessboard:

```C++
float_type scratch = ((k+1)%2)*light + (k%2)*dark;
```

### Step 1 - Initialize the compute device

In this step we need to intialize HIP and choose the GPU compute device. Open the source file [chessboard_GPU.f90](chessboard_GPU.f90) and look for the comment for `Step 1`. You may either use `init_device` from the `hip_utils` module or implement your own solution.

### Step 2 - Allocate device memory for the chessboard

Here we need to allocate memory for the chessboard and make it available through the Fortran pointer `B_d`. You can use `hipmalloc` for this call.

### Step 3 - Call the C function to launch the kernel

This step is similar to the example. In [kernel_code.cpp](kernel_code.cpp) there is a C function called `launch_kernel_hip` with the following signature:

```C++
    void launch_kernel_hip(
            float_type* B, 
            float_type light,
            float_type dark,
            int M,
            int N) {
```

Because this function is compiled with external C linkage, and we have an interface to it in [chessboard_GPU.f90](chessboard_GPU.f90) we can call it from Fortran. Use the `c_loc` function to get the C pointer from `B_d` and pass it to `launch_kernel_hip`.

### Step 4 - Copy the chessboard back from the compute device.

In this step the task is to use `hipmemcpy` to copy the chessboard from `B_d` on the device to `B_h` on the host. You can use either the C or Fortran pointer methods for this task, keeping in mind that if using C pointers you need to specify **bytes** and if using Fortran pointers you need to specify **elements**.

### Step 5 - Free the device allocation

After the memory is copied, the buffer `B_d` on the compute device is no longer needed. You can use `hipfree` to release the memory allocation on the GPU.

### Step 6 - Reset the compute device

When the program finishes, best practice is to reset the compute device and release all resources that have been allocated. You can either use `hipdevicesynchronize` combined wth `hipdevicereset` to reset the compute device's primary context, or you use use the `reset_device` subroutine from `hip_utils`.

## Compile and run the exercise

The code below compiles, installs and runs the `chessboard_GPU` program. Until all the pieces are in place  the code doesn't produce meaningful output.

In [3]:
!source ../../env; ../../install.sh; chessboard_GPU

-- hip::amdhip64 is SHARED_LIBRARY
-- Configuring done
-- Generating done
-- Build files have been written to: /home/toby/Pelagos/Projects/Hipfort_Course/build
[35m[1mScanning dependencies of target memcpy_bench[0m
[  4%] Built target memcpy_bench
[35m[1mScanning dependencies of target tensoradd_simple[0m
[  8%] Built target tensoradd_simple
[35m[1mScanning dependencies of target tensoradd_allocatable[0m
[ 13%] Built target tensoradd_allocatable
[35m[1mScanning dependencies of target tensoradd_pointer[0m
[ 17%] Built target tensoradd_pointer
[35m[1mScanning dependencies of target tensoradd_function[0m
[ 21%] Built target tensoradd_function
[35m[1mScanning dependencies of target tensoradd_module[0m
[35m[1mConsolidate compiler generated dependencies of target tensoradd_module[0m
[ 30%] Built target tensoradd_module
[35m[1mScanning dependencies of target tensoradd_cfun[0m
[35m[1mConsolidate compiler generated dependencies of target tensoradd_cfun[0m
[ 39%] Built

## Compile and run the answer

In the code [chessboard_answer.f90](chessboard_answer.f90) is a simple solution to the problem. You're welcome to check the code for any help you might need.

In [5]:
!source ../../env; ../../install.sh; chessboard_GPU_answer

-- hip::amdhip64 is SHARED_LIBRARY
-- Configuring done
-- Generating done
-- Build files have been written to: /home/toby/Pelagos/Projects/Hipfort_Course/build
[35m[1mScanning dependencies of target memcpy_bench[0m
[  4%] Built target memcpy_bench
[35m[1mScanning dependencies of target tensoradd_simple[0m
[  8%] Built target tensoradd_simple
[35m[1mScanning dependencies of target tensoradd_allocatable[0m
[ 13%] Built target tensoradd_allocatable
[35m[1mScanning dependencies of target tensoradd_pointer[0m
[ 17%] Built target tensoradd_pointer
[35m[1mScanning dependencies of target tensoradd_function[0m
[ 21%] Built target tensoradd_function
[35m[1mScanning dependencies of target tensoradd_module[0m
[35m[1mConsolidate compiler generated dependencies of target tensoradd_module[0m
[ 30%] Built target tensoradd_module
[35m[1mScanning dependencies of target tensoradd_cfun[0m
[35m[1mConsolidate compiler generated dependencies of target tensoradd_cfun[0m
[ 39%] Built