# Exercise - Timing Hadamard matrix multiplication

In this exercise we are going to use HIP events to time the kernel execution for the Hadamard (elementwise) multiplication problem, where the values in matrices **D** and **E** at coordinates (i0,i1) are multiplied together to set the value at coordinates (i0,i1) in matrix **F**.

<figure style="margin-left:auto; margin-right:auto; width:80%;">
    <img style="vertical-align:middle" src="../../images/elementwise_multiplication.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Elementwise multiplication of matrices D and E to get F.</figcaption>
</figure>

The source code is located in [exercise_profiling.cpp](exercise_profiling.cpp). Matrices **D** and **E** are created on the host and are filled with random numbers before upload to the compute device and computation of the solution. The steps are:

1. Parse program arguments
1. Discover resources and choose a compute device
1. Construct matrices **D_h** and **E_h** on the host and fill them with random numbers
1. Allocate memory for arrays **D_d**, **E_d**, and **F_d** on the compute device
1. Upload matrices **D_h** and **E_h** from the host to **D_d** and **E_d** on the device
1. Run the kernel to compute **F_d** from **D_d** and **E_d** on the device
1. Copy the buffer for matrix **F_d** on the device back to **F_h** on the host
1. Test the computed matrix **F_h** against a known answer
1. Write the contents of matrices **D_h**, **E_h**, and **F_h** to disk
1. Clean up memory alllocations and release resources

Your task is to measure how long it takes to execute the kernel by instrumenting the code with HIP events.

## Import the environment

The command below brings the `run` and `build` commands within reach of the Jupyter notebook.

In [1]:
import os
os.environ['PATH'] = f"{os.environ['PATH']}:../../install/bin"

# At a Bash terminal you need to do this instead from this directory
# source ../../env

## Running the answer

The code [exercise_profiling_answer.cpp](exercise_profiling_answer.cpp) uses HIP events to time both the memory copies from **D_h** to **D_d** and **E_h** to **E_d** and the kernel execution.

In [3]:
!build exercise_profiling_answer.exe; run exercise_profiling_answer.exe

[ 66%] Built target hip_helper
[100%] Built target exercise_profiling_answer.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"
Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65536,65536)
Time for event "memcpy": 1.844 ms (2274.97 MB/s)
Time for event "mat_hadamard": 0.415 ms
Maximum error (infinity norm) is: 0


## Running the exercise

The code [exercise_profiling.cpp](exercise_profiling.cpp) contains the exercise. It runs the solution just fine however it is lacking timing measurements.

In [4]:
!build exercise_profiling.exe; run exercise_profiling.exe

[ 50%] Built target hip_helper
[100%] Built target exercise_profiling.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"
Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65536,65536)
Maximum error (infinity norm) is: 0


As you can see, there currently is no output to tell how long the kernel ran for.

## Tasks

Your task is to time the kernel execution using HIP events.

* Create and initialise two events of type **hipEvent_t**
* Use **hipEventRecord** to insert the events into the default stream **just before** and **just after** the kernel.
* Call the helper function **h_get_event_time_ms** to print out the kernel execution time (in milliseconds).

### Bonus tasks

* Use the events and the helper function **h_get_event_time_ms** to measure the time and IO rate of the uploads and downloads to the compute device.
* Use **rocprof** (or nvprof) to get a trace of the application execution and view it with [Perfetto](https://ui.perfetto.dev/). From this you can determine the kernel runtime and the memory upload rate.

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the Pawsey Supercomputing Centre
</address>