# Exercise - Hadamard matrix multiplication timing!

In this exercise we are going to use the OpenCL profiling library to time the kernel execution for the Hadamard multiplication problem, where the values in matrices **D** and **E** at coordinates (i0,i1) are multiplied together to set the value at coordinates (i0,i1) in matrix **F**.

<figure style="margin-left:auto; margin-right:auto; width:80%;">
    <img style="vertical-align:middle" src="../images/elementwise_multiplication.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Elementwise multiplication of matrices D and E to get F.</figcaption>
</figure>

The source code is located in [mat_elementwise.cpp](mat_elementwise.cpp) and the kernel is in [kernels_elementwise.c](kernels_elementwise.c). Matrices **D** and **E** are created on the host and are filled with random numbers before upload to the compute device and computation of the solution. The steps are:

1. Parse program arguments
1. Discover resources and choose a compute device
1. Construct matrices **D_h** and **E_h** on the host and fill them with random numbers
1. Allocate memory for arrays **D_d**, **E_d**, and **F_d** on the compute device
1. Upload matrices **D_h** and **E_h** from the host to **D_d** and **E_d** on the device
1. Run the kernel to compute **F_d** from **D_d** and **E_d** on the device
1. Copy the buffer for matrix **F_d** on the device back to **F_h** on the host
1. Test the computed matrix **F_h** against a known answer
1. Write the contents of matrices **D_h**, **E_h**, and **F_h** to disk
1. Clean up memory alllocations and release resources

Your task is to measure how long the kernel takes using the OpenCL event profiling interface.

## Run the answer

The code [mat_elementwise_answer.cpp](mat_elementwise_answer.cpp) uses OpenCL Events to time both the memory copies from **D_h** to **D_d** and **E_h** to **E_d** and the kernel execution.

In [5]:
!make clean; make mat_elementwise_answer.exe; ./mat_elementwise_answer.exe

rm -r *.exe
g++ -std=c++11 -g -O2 -fopenmp -I/usr/include -I../include -L/usr/lib/x86_64-linux-gnu mat_elementwise_answer.cpp\
	-o mat_elementwise_answer.exe -lOpenCL
In file included from [01m[Kmat_elementwise_answer.cpp:15[m[K:
[01m[K../include/cl_helper.hpp:[m[K In function ‘[01m[K_cl_command_queue** h_create_command_queues(_cl_device_id**, _cl_context**, cl_uint, cl_uint, cl_bool, cl_bool)[m[K’:
  337 |         command_queues[n] = [01;35m[KclCreateCommandQueue([m[K
      |                             [01;35m[K~~~~~~~~~~~~~~~~~~~~^[m[K
  338 | [01;35m[K            contexts[n % num_devices],[m[K
      |             [01;35m[K~~~~~~~~~~~~~~~~~~~~~~~~~~[m[K           
  339 | [01;35m[K            devices[n % num_devices],[m[K
      |             [01;35m[K~~~~~~~~~~~~~~~~~~~~~~~~~[m[K            
  340 | [01;35m[K            queue_properties,[m[K
      |             [01;35m[K~~~~~~~~~~~~~~~~~[m[K                    
  341 | [01;35m[K         

Notice that timing information for the memory copies and kernel executions have been printed.

## Run the problem

The code in [mat_elementwise.cpp](mat_elementwise.cpp) has not been instrumented for timing kernel and copy events.

In [6]:
!make mat_elementwise.exe; ./mat_elementwise.exe

g++ -std=c++11 -g -O2 -fopenmp -I/usr/include -I../include -L/usr/lib/x86_64-linux-gnu mat_elementwise.cpp\
	-o mat_elementwise.exe -lOpenCL
In file included from [01m[Kmat_elementwise.cpp:15[m[K:
[01m[K../include/cl_helper.hpp:[m[K In function ‘[01m[K_cl_command_queue** h_create_command_queues(_cl_device_id**, _cl_context**, cl_uint, cl_uint, cl_bool, cl_bool)[m[K’:
  337 |         command_queues[n] = [01;35m[KclCreateCommandQueue([m[K
      |                             [01;35m[K~~~~~~~~~~~~~~~~~~~~^[m[K
  338 | [01;35m[K            contexts[n % num_devices],[m[K
      |             [01;35m[K~~~~~~~~~~~~~~~~~~~~~~~~~~[m[K           
  339 | [01;35m[K            devices[n % num_devices],[m[K
      |             [01;35m[K~~~~~~~~~~~~~~~~~~~~~~~~~[m[K            
  340 | [01;35m[K            queue_properties,[m[K
      |             [01;35m[K~~~~~~~~~~~~~~~~~[m[K                    
  341 | [01;35m[K            &errcode[m[K
      |        

As you can see, there currently is no way to tell how long the kernel ran for.

## Tasks

Your task is to time the kernel execution using OpenCL events and the the command queue functionality to profile events.

* Set the **profiling** flag to **CL_TRUE** for the call to the helper function **h_create_command_queues**. This enables profiling in the command queues.
* You may use the helper function **h_get_event_time_ms** to print out the kernel execution time (in milliseconds).

### Bonus tasks

* Use the helper function **h_get_event_time_ms** to measure the time and IO rate of the uploads and downloads to the compute device. Define an OpenCL Event of type **cl_event** to track IO events.
* Use any one of the profiling tools discussed to find out how long the kernel took.

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the Pawsey Supercomputing Centre
</address>