# Exercise - Hadamard (elementwise) matrix multiplication

In this exercise we are going to solidify our understanding of the process of OpenCL using a sister example of Hadamard matrix multiplication. Hadamard multiplication is elementwise multiplication. The values in matrices **D** and **E** at coordinates (i0,i1) are multiplied together to set the value at coordinates (i0,i1) in matrix **F**.

<figure style="margin-left:auto; margin-right:auto; width:80%;">
    <img style="vertical-align:middle" src="../images/elementwise_multiplication.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Elementwise multiplication of matrices D and E to get F.</figcaption>
</figure>

The source code is located in [mat_elementwise.cpp](mat_elementwise.cpp) and is similar to the matrix multiplication example <a href="../L3_Matrix_Multiplication/mat_mult.cpp">mat_mult.cpp</a> in almost every aspect. The steps are: 

1. Device discovery and selection
1. Command queues created
1. Matrices **D_h** and **E_h** allocated on the host and filled with random numbers
1. Matrices **D_d** and **E_d** allocated on the compute device
1. Programs built, kernels created and kernel arguments selected
1. Matrices **D_h** and **E_h** uploaded to device allocations **D_d** and **E_d**
1. The kernel **mat_elementwise** is run on the device to compute **F_d** from **D_d** and **E_d**
1. **F_d** is copied to **F_h** and compared with the solution **F_answer_h** from sequential CPU code
1. Memory and device cleanup

## Compile and run the answer

We compile and run the solution [mat_elementwise_answer.cpp](mat_elementwise_answer.cpp) as shown below. We are using a really small matrix (8,4) so that the exercise may be done from the command line. 

In [1]:
!make clean; make; ./mat_elementwise_answer.exe

rm -r *.exe
CC -g -fopenmp -O2 -I../include mat_elementwise.cpp -o mat_elementwise.exe -lOpenCL
CC -g -fopenmp -O2 -I../include mat_elementwise_answer.cpp -o mat_elementwise_answer.exe -lOpenCL
	               name: gfx1035 
	     Device version: OpenCL 2.0  
	 global memory size: 536 MB
	    max buffer size: 456 MB
	     max local size: (1024,1024,1024)
	     max work-items: 256
The output array F_h (as computed with OpenCL) is
----
|  4.50e-01  1.70e-01  2.77e-01  2.21e-02 |
|  2.46e-02  3.48e-02  4.41e-02  2.05e-01 |
|  7.57e-01  4.06e-03  3.90e-01  2.74e-01 |
|  3.16e-01  3.38e-05  9.45e-02  9.03e-01 |
|  1.60e-02  6.24e-03  9.69e-02  4.00e-01 |
|  4.89e-01  4.12e-01  8.46e-01  8.93e-02 |
|  3.23e-01  3.19e-02  2.84e-01  4.18e-01 |
|  2.02e-02  3.38e-01  2.30e-01  1.49e-01 |
----
The CPU solution (F_answer_h) is 
----
|  4.50e-01  1.70e-01  2.77e-01  2.21e-02 |
|  2.46e-02  3.48e-02  4.41e-02  2.05e-01 |
|  7.57e-01  4.06e-03  3.90e-01  2.74e-01 |
|  3.16e-01  3.38e-05  9.45e-02  9

This is what we expect to see. The residual (last matrix shown) is the result of subtracting the OpenCL output (**F_h**) from the CPU output and is 0 everywhere.

## Compile and run the exercise

The exercise [mat_mult_elementwise.cpp](mat_mult_elementwise.cpp) is missing a number of lines of code. If we compile and run we see a non-zero residual. Copy this command to the command line (without the !) and run it as follows:

In [2]:
!make; ./mat_elementwise.exe

make: Nothing to be done for 'all'.
	               name: gfx1035 
	     Device version: OpenCL 2.0  
	 global memory size: 536 MB
	    max buffer size: 456 MB
	     max local size: (1024,1024,1024)
	     max work-items: 256
The output array F_h (as computed with OpenCL) is
----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
----
The CPU solution (F_answer_h) is 
----
|  4.50e-01  1.70e-01  2.77e-01  2.21e-02 |
|  2.46e-02  3.48e-02  4.41e-02  2.05e-01 |
|  7.57e-01  4.06e-03  3.90e-01  2.74e-01 |
|  3.16e-01  3.38e-05  9.45e-02  9.03e-01 |
|  1.60e-02  6.24e-03  9.69e-02  4.00e-01 |
|  4.89e-01  4.12e-01  8.46e-01  8.93e-02 |
|  3.23e-01  3.19e-02  2.84e-01  4.18e-01 |
|  2.02e-02  3.3

As you can see the residual is non-zero, which means that the output is incorrect. The challenge is to edit the sources [mat_elementwise.cpp](mat_elementwise.cpp) and [kernels_elementwise.c](kernels_elementwise.c) so the program **mat_elementwise.exe** produces the same result as the solution [mat_elementwise_answer.cpp](mat_elementwise_answer.cpp) and [kernels_elementwise_answer.c](kernels_elementwise_answer.c).

## Choose your own coding adventure!

As it stands the exercise [mat_elementwise.cpp](mat_elementwise.cpp) is missing a number of **crucial** steps. Each step is clearly marked out in the code using quadruple `////` comments. It may also be helpful to download the latest [OpenCL C specification](https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/OpenCL_C.pdf) and have it ready before you start.

1. Finish the kernel code to perform the actual elementwise multiplication in [kernels_elementwise.c](kernels_elementwise.c).
    * Make sure you have a guard statement in place so that you don't overrun the bounds of F. 
    * Use multi-dimensional indexing as shown in the Survival C++ Lesson to index into arrays.
    * If you get stuck you can just copy the kernel from the answer in [kernels_elementwise_answer.c](kernels_elementwise_answer.c).
1. Create OpenCL buffers for matrices **D_d**, **E_d**, and **F_d**.
    * Call **clCreateBuffer** to create the buffer.
    * Don't forget to check the result of API calls with H_ERRCHK.
1. Create the kernel and set kernel arguments
    * Call **clCreateKernel** to create a kernel from **program**.
    * Use **clSetKernelArg** to set kernel arguments.
1. Upload memory from arrays **D_h** and **E_h** on the host to **D_d** and **E_d** on the device.
    * Call **clEnqueueWriteBuffer** to copy memory from host to device.
    * Use the variable *command_queue* as the command_queue
1. Launch the kernel and wait for it to complete
    * Use **clEnqueueNDRangeKernel** to enqueue the kernel.
    * Use **clWaitForEvents** to wait on the kernel event
1. Copy the solution **F_d** on the compute device back to **F_h** on the host.
    * Call **clEnqueueReadBuffer** to copy memory from device to host.
1. Release the buffers for **D_d**, **E_d**, and **F_d** on the compute device.
    * Call **clReleaseMemObject** to release OpenCL buffers.

As an OpenCL developer your job is to provide source code to fill in the missing steps, using either the template code in <a href="../L3_Matrix_Multiplication/mat_mult.cpp">mat_mult.cpp</a> or by peeking at the solution in [mat_elementwise_answer.cpp](mat_elementwise_answer.cpp). You can make these tasks as easy or as challenging as you wish. Each of the steps has a **shortcut solution** that you can access by uncommenting the code snippet that has the answer. For example see these lines for step 2 in [mat_elementwise.cpp](mat_elementwise.cpp).

```C++
    //// Step 2. Use clCreateBuffer to allocate OpenCL buffers
    //// for arrays D_d, E_d, and F_d. ////

    // Uncomment for the shortcut answer
    // #include "step2_create_buffers.cpp"

    //// End code: ////
```

If you get stuck you can just uncomment the **#include** statement to bring in the solution for that step and move on to another, for example uncommenting this line brings in the shortcut solution to create buffers:


```C++
    //// Step 2. Use clCreateBuffer to allocate OpenCL buffers
    //// for arrays D_d, E_d, and F_d. ////

    // Uncomment for the shortcut answer
    #include "step2_create_buffers.cpp"

    //// End code: ////
```

The goal is to become familiar with looking up and implementing **OpenCL** API calls as well as solidifying the understanding achieved over the past few modules. For your exercise you can choose to focus on as little as a **single task** or try to implement **all the tasks** yourself. It is up to you and how much you can accomplish within the time allotted. Some steps may depend on others, so if you skip a step just make sure you uncomment the shortcut. Each time you make changes to the code you can run the following to test it out.

In [4]:
!make; ./mat_elementwise.exe

make: Nothing to be done for 'all'.
	               name: gfx1035 
	     Device version: OpenCL 2.0  
	 global memory size: 536 MB
	    max buffer size: 456 MB
	     max local size: (1024,1024,1024)
	     max work-items: 256
The output array F_h (as computed with OpenCL) is
----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
----
The CPU solution (F_answer_h) is 
----
|  4.50e-01  1.70e-01  2.77e-01  2.21e-02 |
|  2.46e-02  3.48e-02  4.41e-02  2.05e-01 |
|  7.57e-01  4.06e-03  3.90e-01  2.74e-01 |
|  3.16e-01  3.38e-05  9.45e-02  9.03e-01 |
|  1.60e-02  6.24e-03  9.69e-02  4.00e-01 |
|  4.89e-01  4.12e-01  8.46e-01  8.93e-02 |
|  3.23e-01  3.19e-02  2.84e-01  4.18e-01 |
|  2.02e-02  3.3

Have fun!

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the <a href="https://pawsey.org.au">Pawsey Supercomputing Research Centre</a>. All trademarks mentioned are the property of their respective owners.
</address>