# Exercise - Using rectangular copies for Hadamard matrix multiplication

Hadamard matrix multiplication is where the values in matrices **D** and **E** at coordinates (i0,i1) are multiplied together to set the value at coordinates (i0,i1) in matrix **F**.

<figure style="margin-left:auto; margin-right:auto; width:80%;">
    <img style="vertical-align:middle" src="../images/elementwise_multiplication.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Elementwise multiplication of matrices D and E to get F.</figcaption>
</figure>

The steps are: 

1. Device discovery and selection.
1. Command queues created.
1. Matrices **D_h** and **E_h** allocated on the host and filled with random numbers.
1. Matrices **D_d** and **E_d** allocated on the compute device.
1. Programs built, kernels created and kernel arguments selected.
1. Matrices **D_h** and **E_h** uploaded to device allocations **D_d** and **E_d**.
1. The kernel **mat_elementwise** is run on the device to compute **F_d** from **D_d** and **E_d**.
1. **F_d** is copied to **F_h** and compared with the solution **F_answer_h** from sequential CPU code.
1. Memory and device cleanup

Using rectangular copies is an important skill to master, especially when you are decomposing your problem into sections that are to be handled by different devices. In this exercise we are going enable the elementwise matrix multiplication code to use a **rectangular copy** to copy the memory allocation **F_d** back to the host (**F_h**). The source code to edit is located in [mat_elementwise.cpp](mat_elementwise.cpp) and the kernel is in [kernels_elementwise.c](kernels_elementwise.c). Your task is to make the necessary change so that copies back from **F_d** uses a **rectangular** copy ([clEnqueueReadBufferRect](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueReadBufferRect.html)) instead of the normal copy.

## Run the exercise code

As it stands the code produces the right answer, but it is using a standard contiguous copy to copy **F_d** back to **F_h**.

In [1]:
!make clean; make ./mat_elementwise.exe; ./mat_elementwise.exe

rm -rf *.exe
CC -g -fopenmp -O2 -I../include mat_elementwise.cpp -o mat_elementwise.exe -lOpenCL
	               name: gfx1035 
	     Device version: OpenCL 2.0  
	 global memory size: 536 MB
	    max buffer size: 456 MB
	     max local size: (1024,1024,1024)
	     max work-items: 256
The output array F_h (as computed with OpenCL) is
----
|  4.50e-01  1.70e-01  2.77e-01  2.21e-02 |
|  2.46e-02  3.48e-02  4.41e-02  2.05e-01 |
|  7.57e-01  4.06e-03  3.90e-01  2.74e-01 |
|  3.16e-01  3.38e-05  9.45e-02  9.03e-01 |
|  1.60e-02  6.24e-03  9.69e-02  4.00e-01 |
|  4.89e-01  4.12e-01  8.46e-01  8.93e-02 |
|  3.23e-01  3.19e-02  2.84e-01  4.18e-01 |
|  2.02e-02  3.38e-01  2.30e-01  1.49e-01 |
----
The CPU solution (F_answer_h) is 
----
|  4.50e-01  1.70e-01  2.77e-01  2.21e-02 |
|  2.46e-02  3.48e-02  4.41e-02  2.05e-01 |
|  7.57e-01  4.06e-03  3.90e-01  2.74e-01 |
|  3.16e-01  3.38e-05  9.45e-02  9.03e-01 |
|  1.60e-02  6.24e-03  9.69e-02  4.00e-01 |
|  4.89e-01  4.12e-01  8.46e-01  8.93e-02 |

## Tasks

1. Load up the documentation for [clEnqueueReadBufferRect](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueReadBufferRect.html).
1. In [mat_mult_local.cpp:190](mat_mult_local.cpp) there is an example for performing a rectangular copy using [clEnqueueWriteBufferRect](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueWriteBufferRect.html). Copy-paste that code to [mat_elementwise.cpp](mat_elementwise.cpp) and begin modifications.

### Answer

You can of course always look at the answer in [mat_elementwise_answer.cpp](mat_elementwise_answer.cpp) and run the code. But then try to understand why the solution is working.

In [2]:
!make mat_elementwise_answer.exe; ./mat_elementwise_answer.exe

CC -g -fopenmp -O2 -I../include mat_elementwise_answer.cpp -o mat_elementwise_answer.exe -lOpenCL
	               name: gfx1035 
	     Device version: OpenCL 2.0  
	 global memory size: 536 MB
	    max buffer size: 456 MB
	     max local size: (1024,1024,1024)
	     max work-items: 256
The output array F_h (as computed with OpenCL) is
----
|  4.50e-01  1.70e-01  2.77e-01  2.21e-02 |
|  2.46e-02  3.48e-02  4.41e-02  2.05e-01 |
|  7.57e-01  4.06e-03  3.90e-01  2.74e-01 |
|  3.16e-01  3.38e-05  9.45e-02  9.03e-01 |
|  1.60e-02  6.24e-03  9.69e-02  4.00e-01 |
|  4.89e-01  4.12e-01  8.46e-01  8.93e-02 |
|  3.23e-01  3.19e-02  2.84e-01  4.18e-01 |
|  2.02e-02  3.38e-01  2.30e-01  1.49e-01 |
----
The CPU solution (F_answer_h) is 
----
|  4.50e-01  1.70e-01  2.77e-01  2.21e-02 |
|  2.46e-02  3.48e-02  4.41e-02  2.05e-01 |
|  7.57e-01  4.06e-03  3.90e-01  2.74e-01 |
|  3.16e-01  3.38e-05  9.45e-02  9.03e-01 |
|  1.60e-02  6.24e-03  9.69e-02  4.00e-01 |
|  4.89e-01  4.12e-01  8.46e-01  8.93e-02 

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the <a href="https://pawsey.org.au">Pawsey Supercomputing Research Centre</a>. All trademarks mentioned are the property of their respective owners.
</address>