# Exercise - Hadamard (elementwise) matrix multiplication

In this exercise we are going to solidify our understanding of the process of OpenCL using a sister example of Hadamard matrix multiplication. Hadamard multiplication is elementwise multiplication. The values in matrices **D** and **E** at coordinates (i0,i1) are multiplied together to set the value at coordinates (i0,i1) in matrix **F**.

<figure style="margin-left:auto; margin-right:auto; width:80%;">
    <img style="vertical-align:middle" src="../images/elementwise_multiplication.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Elementwise multiplication of matrices D and E to get F.</figcaption>
</figure>

The source code is located in [mat_elementwise.cpp](mat_elementwise.cpp) and is similar to [mat_mult.cpp](mat_mult.cpp) in almost every aspect. 

1. Device discovery and selection.
1. Command queues created.
1. Matrices **D_h** and **E_h** allocated on the host and filled with random numbers.
1. Matrices **D_d** and **E_d** allocated on the compute device.
1. Programs built and kernel arguments selected.
1. Matrices **D_h** and **E_h** uploaded to device allocations **D_d** and **E_d**.
1. The kernel **mat_elementwise** is run on the device to compute **F_d** from **D_d** and **E_d**.
1. **F_d** is copied to **F_h** and compared with the solution **F_answer_h** from sequential CPU code.
1. Memory and device cleanup

The code is missing some elements:

* The source code in [mat_elementwise.cpp](mat_elementwise.cpp) is missing the OpenCL machinery to upload memory from arrays **D_h** and **E_h** on the host to arrays **D_d** and **E_d** on the compute device.

* In addition, the kernel source **mat_elementwise** is missing some code to perform the actual elementwise multiplication.

As an OpenCL developer your task is to fill in the necessary source to enable the program to work correctly. The source code is located in [mat_elementwise.cpp](mat_elementwise.cpp) and is similar to [mat_mult.cpp](mat_mult.cpp) in almost every aspect, but the code is missing some elements:

* The source code in [mat_elementwise.cpp](mat_elementwise.cpp) is missing the OpenCL machinery to write memory from **array_D** on the host to **buffer_D** on the device, and from **array_E** on the host to **buffer_E** on the device.

* In addition, the source code for the kernel in [kernels_elementwise.c](kernels_elementwise.c) is missing some code to perform the actual elementwise multiplication.

As an OpenCL developer your task is to fill in the necessary source to enable the program to work correctly.

## Constructing the inputs and solution

We compile and run the program as shown below. The code works with a really small matrix (8,4) and pretty prints matrices so this exercise can be used from a command line.

In [1]:
!make clean; make; ./mat_elementwise.exe

rm -r *.exe
g++ -std=c++11 -g -O2 -fopenmp -I/usr/include -I../include -L/usr/lib/x86_64-linux-gnu mat_mult.cpp\
	-o mat_mult.exe -lOpenCL
In file included from [01m[Kmat_mult.cpp:18[m[K:
[01m[K../include/cl_helper.hpp:[m[K In function ‘[01m[K_cl_command_queue** h_create_command_queues(_cl_device_id**, _cl_context**, cl_uint, cl_uint, cl_bool, cl_bool)[m[K’:
  337 |         command_queues[n] = [01;35m[KclCreateCommandQueue([m[K
      |                             [01;35m[K~~~~~~~~~~~~~~~~~~~~^[m[K
  338 | [01;35m[K            contexts[n % num_devices],[m[K
      |             [01;35m[K~~~~~~~~~~~~~~~~~~~~~~~~~~[m[K           
  339 | [01;35m[K            devices[n % num_devices],[m[K
      |             [01;35m[K~~~~~~~~~~~~~~~~~~~~~~~~~[m[K            
  340 | [01;35m[K            queue_properties,[m[K
      |             [01;35m[K~~~~~~~~~~~~~~~~~[m[K                    
  341 | [01;35m[K            &errcode[m[K
      |             [01

Since there is no upload from host to device for matrices **D_d** and **E_d** the output just contains whatever values happened to be in the OpenCL buffers at allocation. Furthermore the kernel is lacking the machinery to compute the actual transformation from **D_d** and **E_d** to **F_d**. The output array **F_d** is therefore full of errors, as seen above. 

## The desired answer

The source code [mat_elementwise_answer.cpp](mat_elementwise_answer.cpp) contains the full solution. By all means take a peek at the source code if you get stuck. If we run the solution and check the result we get no residual anywhere in the matrix **F**.

In [5]:
!make; ./mat_elementwise_answer.exe

make: Nothing to be done for 'all'.
	               name: NVIDIA GeForce RTX 3060 Laptop GPU 
	 global memory size: 6226 MB
	    max buffer size: 1556 MB
	     max local size: (1024,1024,64)
	     max work-items: 1024
The output array F_h (as computed with OpenCL) is
----
|  3.05e-02  5.43e-02  5.59e-02  2.57e-02 |
|  2.27e-03  3.22e-01  6.93e-01  6.88e-01 |
|  2.82e-03  2.59e-03  1.37e-01  2.94e-01 |
|  1.19e-01  2.73e-01  1.54e-04  6.40e-02 |
|  8.55e-02  1.83e-01  5.86e-01  7.41e-01 |
|  2.61e-02  3.98e-01  1.62e-01  2.39e-01 |
|  3.91e-02  3.85e-01  4.58e-01  8.27e-02 |
|  2.38e-01  3.10e-01  1.26e-01  1.16e-01 |
----
The CPU solution (F_answer_h) is 
----
|  3.05e-02  5.43e-02  5.59e-02  2.57e-02 |
|  2.27e-03  3.22e-01  6.93e-01  6.88e-01 |
|  2.82e-03  2.59e-03  1.37e-01  2.94e-01 |
|  1.19e-01  2.73e-01  1.54e-04  6.40e-02 |
|  8.55e-02  1.83e-01  5.86e-01  7.41e-01 |
|  2.61e-02  3.98e-01  1.62e-01  2.39e-01 |
|  3.91e-02  3.85e-01  4.58e-01  8.27e-02 |
|  2.38e-01  3.10e-01  

## Tasks

In these set of tasks the aim is to solidify some of the understanding developed in the walkthrough of the code. We are going to read through the documentation of a function and implement some very simple kernel code.

1. In the source file [mat_elementwise.cpp](mat_elementwise.cpp) (line 156), re-enable the memory copy from **D_h** and **E_h** on the host to **D_d** and **E_d** using the function [clEnqueueWriteBuffer](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueWriteBuffer.html). Read the [documentation](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueWriteBuffer.html) for that function and implement the copies. It may also be helpful to download the latest [OpenCL C specification](https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/OpenCL_C.pdf) and find that function in there.
1. Complete the kernel source code in [kernels_elementwise.c](kernels_elementwise.c) so that a new value in buffer F at coordinates (i0,i1) is constructed from the corresponding values in buffers D and E.
    * Make sure you have a guard statement in place so that you don't overrun the bounds of buffer F. See the source code in [kernels_mat_mult.c](kernels_mat_mult.c) for an example.
    * Use multi-dimensional indexing as shown in the <a href="../L2_Survival_C++/Lesson - Survival C++.ipynb">Survival C++</a> Lesson to index into arrays.
    * If you get stuck you can just use the kernel from the answer in [kernels_elementwise_answer.c](kernels_elementwise_answer.c), just change line 124 of [mat_elementwise.cpp](mat_elementwise.cpp) to read in the kernel from the new source file.

In [7]:
!make; ./mat_elementwise.exe

make: Nothing to be done for 'all'.
	               name: NVIDIA GeForce RTX 3060 Laptop GPU 
	 global memory size: 6226 MB
	    max buffer size: 1556 MB
	     max local size: (1024,1024,64)
	     max work-items: 1024
The output array F_h (as computed with OpenCL) is
----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
----
The CPU solution (F_answer_h) is 
----
|  4.55e-01  6.62e-01  2.14e-01  5.35e-01 |
|  2.26e-01  2.06e-01  5.21e-01  3.82e-01 |
|  2.95e-02  5.09e-02  7.98e-01  7.77e-01 |
|  2.32e-01  1.04e-01  3.40e-01  2.14e-01 |
|  7.33e-01  2.11e-02  4.83e-01  8.52e-01 |
|  4.99e-01  5.21e-01  5.49e-02  3.71e-01 |
|  1.30e-01  1.03e-02  2.98e-03  3.74e-02 |
|  1.31e-01  5.39e-02  

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the Pawsey Supercomputing Centre
</address>