# Exercise - Hadamard (elementwise) matrix multiplication

In this exercise we are going to solidify our understanding of HIP applications using Hadamard matrix multiplication. Hadamard multiplication is elementwise multiplication. The values in matrices **D** and **E** at coordinates (i0,i1) are multiplied together to set the value at coordinates (i0,i1) in matrix **F**.

<figure style="margin-left:auto; margin-right:auto; width:80%;">
    <img style="vertical-align:middle" src="../../images/elementwise_multiplication.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Elementwise multiplication of matrices D and E to get F.</figcaption>
</figure>

The source code is located in [exercise_elementwise.cpp](exercise_elementwise.cpp) and is similar to the matrix multiplication example <a href="../../L3_Matrix_Multiplication/mat_mult.cpp">mat_mult.cpp</a> in almost every aspect. The steps are: 

1. Device discovery and selection
1. Matrices **D_h** and **E_h** allocated on the host and filled with random numbers.
1. Matrices **D_d** and **E_d** allocated on the compute device
1. Matrices **D_h** and **E_h** uploaded to device allocations **D_d** and **E_d**
1. The kernel **mat_elementwise** is run on the device to compute **F_d** from **D_d** and **E_d**.
1. **F_d** is copied to **F_h** and compared with the solution **F_answer_h** from sequential CPU code.
1. Memory and device cleanup

## Import the environment

The command below brings the `run` and `build` commands within reach of the Jupyter notebook.

In [5]:
import os
os.environ['PATH'] = f"{os.environ['PATH']}:../../install/bin"

# At a Bash terminal you need to do this instead from this directory
# source ../../env

## Compile and run the answer

We compile and run the solution [exercise_elementwise_answer.cpp](exercise_elementwise_answer.cpp) as shown below. We are using a really small matrix (8,4) so that the exercise may be done from the command line. 

In [7]:
!build exercise_elementwise.exe; run exercise_elementwise_answer.exe

[ 50%] Built target hip_helper
[100%] Built target exercise_elementwise.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"
Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65536,65536)
The output array F_h (as computed with HIP) is
----
|  4.50e-01  1.70e-01  2.77e-01  2.21e-02 |
|  2.46e-02  3.48e-02  4.41e-02  2.05e-01 |
|  7.57e-01  4.06e-03  3.90e-01  2.74e-01 |
|  3.16e-01  3.38e-05  9.45e-02  9.03e-01 |
|  1.60e-02  6.24e-03  9.69e-02  4.00e-01 |
|  4.89e-01  4.12e-01  8.46e-01  8.93e-02 |
|  

This is what we expect to see. The residual (last matrix shown) is the result of subtracting the HIP output (**F_h**) from the CPU output and is 0 everywhere.

## Compile and run the exercise

The file [exercise_elementwise.cpp](exercise_elementwise.cpp) is missing a number of crucial pieces of software. If we compile and run we see a non-zero residual. Copy this command to the command line (without the !) and run it as follows:

## Import the environment

The command below brings the `run` and `build` commands within reach of the Jupyter notebook.

In [8]:
import os
os.environ['PATH'] = f"{os.environ['PATH']}:../../install/bin"

# At a Bash terminal you need to do this instead from this directory
# source ../../env

In [9]:
!build exercise_elementwise.exe; run exercise_elementwise.exe

[ 50%] Built target hip_helper
[100%] Built target exercise_elementwise.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"
Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65536,65536)
The output array F_h (as computed with HIP) is
----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  

As you can see the residual is non-zero, which means that the output is incorrect. The challenge is to edit [exercise_elementwise.cpp](exercise_elementwise.cpp) so that it produces the same result as the solution [exercise_elementwise_answer.cpp](exercise_elementwise_answer.cpp).

## Choose your own coding adventure!

As it stands the exercise [exercise_elementwise.cpp](exercise_elementwise.cpp) is missing a number of **crucial** steps. Each step is clearly marked out in the code using quadruple slash `////` comments.

1. Kernel code to perform the actual elementwise multiplication.
    * Make sure you have a guard statement in place so that you don't overrun the bounds of F. 
    * Use multi-dimensional indexing as shown in the Survival C++ Lesson to index into arrays.
    * If you get stuck you can just uncomment the include statement, similar to the following:
    <br><br>
    ```C++
    // Uncomment for the shortcut answer
    // #include "step1_kernel.cpp"
    ```
    <br>
1. Initialise HIP and set the compute device.
    * Call **hipInit** to initialise HIP.
    * Get the device count with **hipGetDeviceCount** and perform sanity checking on the device index.
    * Call **hipSetDevice** to set the compute device.
    * Always check the return code of API calls with **H_ERRCHK**.
1. Allocate memory for arrays **D_d**, **E_d** and **F_d** on the compute device.
    * Call **hipMalloc** to allocate memory.
1. Upload memory from arrays **D_h** and **E_h** on the host to **E_d** and **F_d** on the device.
    * Call **hipMemcpy** to copy memory from host to device.
1. Launch the kernel and wait for it to complete
    * Use the **hipLaunchKernelGGL** macro.
    * Use **hipDeviceSynchronize** to wait for the kernel.
1. Copy the solution **F_d** on the compute device back to **F_h** on the host.
    * Call **hipMemcpy** to copy memory from device to host.
1. Free memory for **D_d**, **E_d**, and **F_d** on the compute device.
    * Call **hipFree** to free up device memory.
1. Reset the compute device and destroy context
    * Call **hipDeviceSynchronize** to wait for devices.
    * Call **hipDeviceReset** to release the primary context on each device.

As a HIP developer your job is to provide source code to fill in the missing steps, using either the template code in <a href="../../L3_Matrix_Multiplication/mat_mult.cpp">mat_mult.cpp</a> or by peeking at the solution in [exercise_elementwise_answer.cpp](exercise_elementwise_answer.cpp). You can make these tasks as easy or as challenging as you wish. Each of the steps has a **shortcut solution** that you can access by uncommenting the code snippet that has the answer. For example see these lines for step 2 in [exercise_elementwise.cpp](exercise_elementwise.cpp).

```C++
    //// Step 2. Discover resources //// 
    //// Call hipInit to intialise HIP ////
    //// Call hipGetDeviceCount to fill num_devices ////
    //// Make sure dev_index is sane
    //// Call hipSetDevice to set the compute device ///

    // Uncomment for the shortcut answer
    // #include "step2_resources.hpp"
        
    //// End code: ////
```

If you get stuck you can just uncomment the **#include** statement to bring in the solution for that step and move on to another, for example uncommenting this line brings in the solution to discover resources:


```C++
    //// Step 2. Discover resources //// 
    //// Call hipInit to intialise HIP ////
    //// Call hipGetDeviceCount to fill num_devices ////
    //// Make sure dev_index is sane
    //// Call hipSetDevice to set the compute device ///

    // Uncomment for the shortcut answer
    #include "step2_resources.hpp"
        
    //// End code: ////
```

The goal is to become familiar with looking up and implementing **HIP** API calls as well as familiarising yourself with best practices. For your exercise you can choose to focus on as little as a **single task** or try to implement **all the tasks** yourself. It is up to you and how much you can accomplish within the time allotted. Some steps may depend on others, so if you skip a step just make sure you uncomment the include for the shortcut. Each time you make changes to the code you can run the following to test it out.

In [11]:
!build exercise_elementwise.exe; run exercise_elementwise.exe

[ 50%] Built target hip_helper
[100%] Built target exercise_elementwise.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"
Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65536,65536)
The output array F_h (as computed with HIP) is
----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  

Have fun!

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the Pawsey Supercomputing Centre.
</address>