# Exercise - Using rectangular copies for Hadamard matrix multiplication

Hadamard matrix multiplication is where the values in matrices **D** and **E** at coordinates (i0,i1) are multiplied together to set the value at coordinates (i0,i1) in matrix **F**.

<figure style="margin-left:auto; margin-right:auto; width:80%;">
    <img style="vertical-align:middle" src="../../images/elementwise_multiplication.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Elementwise multiplication of matrices D and E to get F.</figcaption>
</figure>

The steps are: 

1. Parse command line arguments.
1. Device discovery and selection.
1. Allocate matrices **D_h**, **E_h**, and **F_h** on the host. Fill matrices **D_h** and **E_h** with random numbers.
1. Allocate matrices **D_d**, **E_d** and **F_d** on the compute device.
1. Upload matrices **D_h** and **E_h** to device allocations **D_d** and **E_d**.
1. Run the kernel **mat_elementwise** on the device to compute **F_d** from **D_d** and **E_d**.
1. Copy **F_d** on the device to **F_h** on the host.
1. Compare **F_h** is with the solution **F_answer_h** from sequential CPU code.
1. Write **D_h**, **E_h** and **F_h** to disk.
1. Release resources.

Using rectangular copies is an important skill to master, especially when you are decomposing your problem into sections that are to be handled by different devices. In this exercise we are going enable the elementwise matrix multiplication code to use a **rectangular copy** to copy the memory allocation **F_d** back to the host (**F_h**). The source code to edit is located in [exercise_rectcopy.cpp](exercise_rectcopy.cpp).

The code works fine as it is, however it is only using a 1D memory copy with a call to **hipMemcpy**. For teaching purposes your task is to instrument this program to copy the device memory allocation **F_d** to **F_h** using a call to the rectangular copy function **hipMemcpy3D**. 

## Source the path

This command puts into the path the commands for building and running the exercise.

In [1]:
import os
os.environ['PATH'] = f"{os.environ['PATH']}:../../install/bin"

# At a Bash terminal you need to do this instead
# source ../env

## Run the exercise code

As it stands, the code produces the right answer, but it is using a standard contiguous copy to copy **F_d** back to **F_h**.

In [2]:
!build exercise_rectcopy.exe; exercise_rectcopy.exe

[ 50%] Built target hip_helper
[100%] Built target exercise_rectcopy.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"
Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65536,65536)
The output array F_h (as computed with HIP) is
----
|  4.50e-01  1.70e-01  2.77e-01  2.21e-02 |
|  2.46e-02  3.48e-02  4.41e-02  2.05e-01 |
|  7.57e-01  4.06e-03  3.90e-01  2.74e-01 |
|  3.16e-01  3.38e-05  9.45e-02  9.03e-01 |
|  1.60e-02  6.24e-03  9.69e-02  4.00e-01 |
|  4.89e-01  4.12e-01  8.46e-01  8.93e-02 |
|  3.2

## Tasks

1. Load up the documentation for [hipMemcpy3D](https://rocm.docs.amd.docm).
1. In <a href="../../L6_Memory_Management/mat_mult_pitched_mem.cpp">mat_mult_pitched_mem.cpp:197</a> there is an example for performing a rectangular copy using **hipMemcpy3D**. Copy-paste that code to [exercise_rectcopy.cpp](exercise_rectcopy.cpp) and begin modifications.

### Bonus task

Experiment with the 3D copy, can you change it so that it only copies back a rectangular region inside F_d, leaving a border 1 cell wide all the way around?

### Answer

You can of course always look at the answer in [exercise_rectcopy_answer.cpp](exercise_rectcopy_answer.cpp) and run the code. But then try to understand why the solution is working.

In [3]:
!build exercise_rectcopy_answer.exe; exercise_rectcopy_answer.exe

[ 50%] Built target hip_helper
[100%] Built target exercise_rectcopy_answer.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"
Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65536,65536)
The output array F_h (as computed with HIP) is
----
|  4.50e-01  1.70e-01  2.77e-01  2.21e-02 |
|  2.46e-02  3.48e-02  4.41e-02  2.05e-01 |
|  7.57e-01  4.06e-03  3.90e-01  2.74e-01 |
|  3.16e-01  3.38e-05  9.45e-02  9.03e-01 |
|  1.60e-02  6.24e-03  9.69e-02  4.00e-01 |
|  4.89e-01  4.12e-01  8.46e-01  8.93e-02 |

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the <a href="https://pawsey.org.au">Pawsey Supercomputing Research Centre</a>. All trademarks mentioned are the property of their respective owners.
</address>