# Exercise - Hadamard (elementwise) matrix multiplication

In this exercise we are going to solidify our understanding of HIP applications using a sister example of Hadamard matrix multiplication. Hadamard multiplication is elementwise multiplication. The values in matrices **D** and **E** at coordinates (i0,i1) are multiplied together to set the value at coordinates (i0,i1) in matrix **F**.

<figure style="margin-left:auto; margin-right:auto; width:80%;">
    <img style="vertical-align:middle" src="../images/elementwise_multiplication.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Elementwise multiplication of matrices D and E to get F.</figcaption>
</figure>

The source code is located in [mat_elementwise.cpp](mat_elementwise.cpp) and is similar to [mat_mult.cpp](mat_mult.cpp) in almost every aspect. 

1. Device discovery and selection
1. Matrices **D_h** and **E_h** allocated on the host and filled with random numbers.
1. Matrices **D_d** and **E_d** allocated on the compute device
1. Matrices **D_h** and **E_h** uploaded to device allocations **D_d** and **E_d**
1. The kernel **mat_elementwise** is run on the device to compute **F_d** from **D_d** and **E_d**.
1. **F_d** is copied to **F_h** and compared with the solution **F_answer_h** from sequential CPU code.
1. Memory and device cleanup

The code is missing some elements:

* The source code in [mat_elementwise.cpp](mat_elementwise.cpp) is missing the HIP machinery to upload memory from arrays **D_h** and **E_h** on the host to arrays **D_d** and **E_d** on the compute device.

* In addition, the kernel source **mat_elementwise** is missing some code to perform the actual elementwise multiplication.

As a HIP developer your task is to fill in the necessary source to enable the program to work correctly.

## Constructing the inputs and solution

We compile and run the program as shown below. The code works with a really small matrix (8,4) and pretty prints matrices so this exercise can be used from a command line.

In [11]:
!make clean; make; ./mat_elementwise.exe

rm -r *.exe
hipcc -g -O2 -fopenmp -I/usr/include -I../include -L/usr/lib/x86_64-linux-gnu mat_mult.cpp\
	-o mat_mult.exe 
hipcc -g -O2 -fopenmp -I/usr/include -I../include -L/usr/lib/x86_64-linux-gnu mat_elementwise.cpp\
	-o mat_elementwise.exe 
hipcc -g -O2 -fopenmp -I/usr/include -I../include -L/usr/lib/x86_64-linux-gnu mat_elementwise_answer.cpp\
	-o mat_elementwise_answer.exe 
Device id: 0
	name:                                    
	global memory size:                      536 MB
	available registers per block:           65536 
	maximum shared memory size per block:    65 KB
	maximum pitch size for memory copies:    536 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,2147483647,2147483647)
The output array F_h (as computed with HIP) is
----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  2.37e-38  2.37e-38  9.55e-38  9.55e-38 

Since there is no upload from host to device for matrices **D_d** and **E_d** the output just contains whatever values happened to be in memory at allocation. Furthermore the kernel is lacking the machinery to compute the actual transformation from **D_d** and **E_d** to **F_d**. The output array **F_d** is therefore full of errors, as seen above. 

## The desired answer

The source code [mat_elementwise_answer.cpp](mat_elementwise_answer.cpp) contains the full solution. By all means take a peek at the source code if you get stuck. If we run the solution and check the result we get no residual anywhere in the matrix **F**.

In [9]:
!make; ./mat_elementwise_answer.exe

hipcc -g -O2 -fopenmp -I/usr/include -I../include -L/usr/lib/x86_64-linux-gnu mat_elementwise_answer.cpp\
	-o mat_elementwise_answer.exe 
Device id: 0
	name:                                    
	global memory size:                      536 MB
	available registers per block:           65536 
	maximum shared memory size per block:    65 KB
	maximum pitch size for memory copies:    536 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,2147483647,2147483647)
The output array F_h (as computed with HIP) is
----
|  9.92e-03  5.07e-03  1.02e-03  2.81e-03 |
|  3.59e-01  4.77e-02  1.46e-01  4.86e-01 |
|  7.96e-02  2.32e-01  4.60e-02  3.45e-01 |
|  4.19e-01  3.30e-01  3.29e-03  3.37e-02 |
|  4.78e-01  3.64e-01  5.47e-02  1.33e-01 |
|  2.43e-01  9.09e-01  3.42e-01  2.07e-01 |
|  1.56e-01  1.04e-01  8.95e-02  4.61e-01 |
|  3.95e-01  1.40e-01  2.75e-01  4.09e-02 |
----
The CPU solution (F

## Tasks

In these set of tasks the aim is to solidify some of the understanding developed in the walkthrough of the code. We are going to read through the documentation of a function and implement some very simple kernel code.

1. In the source file [mat_elementwise.cpp](mat_elementwise.cpp) (line 85), re-enable the memory copy from **D_h** and **E_h** on the host to **D_d** and **E_d** using the function **hipMemcpy**. Have a look at the source code for [mat_mult.cpp](mat_mult.cpp) for inspiration.
1. Complete the kernel source code in line 37 of [mat_elementwise.cpp](mat_elementwise.cpp) so that a new value in matrix F at coordinates (i0,i1) is constructed from the corresponding values in matrices D and E.
    * Make sure you have a guard statement in place so that you don't overrun the bounds of F. See the source code in [mat_mult.cpp](mat_mult.cpp) for an example.
    * Use multi-dimensional indexing as shown in the <a href="../L2_Survival_C++/Lesson - Survival C++.ipynb">Survival C++</a> Lesson to index into arrays.
    * If you get stuck you can just use the kernel from the answer in [mat_elementwise_answer.cpp](mat_elementwise_answer.cpp).



In [2]:
!make; ./mat_elementwise.exe

make: Nothing to be done for 'all'.
Device id: 0
	name:                                    
	global memory size:                      536 MB
	available registers per block:           65536 
	maximum shared memory size per block:    65 KB
	maximum pitch size for memory copies:    536 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,2147483647,2147483647)
The output array F_h (as computed with HIP) is
----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00      -nan |
----
The CPU solution (F_answer_h) is 
----
|  2.73e-04  6.29e-01  2.80e-02  4.72e-02 |
|  2.33e-01  9.43e-02  2.41e-01  5.87e

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the Pawsey Supercomputing Centre
</address>