# Exercise - building a layer cake

In this exercise the goal is to optimise an iterative code so that copies are performed concurrently with compute. The code [exercise_layer_cake.cpp](exercise_layer_cake.cpp) is a synchronous application where at each iteration **n** a simple kernel called **fill_plane** fills an on-device allocation in **U_ds** with **n**. 

After the kernel completes, **hipMemcpy3D** is employed to copy the plane to **out_h**, a 3D array of output planes. We avoid copying plane 1 to work around a bug in the AMD implementation of **hipMemcpy3D**. This bug has been fixed in later versions of ROCM (not working in 5.7.3 on Setonix, but known to work in 6.0.2+).

If we make and run the application we can see that it correctly copies all planes other than plane 1.

## Add the path

This command puts into the path the commands for building and running the exercise.

In [1]:
import os
os.environ['PATH'] = f"{os.environ['PATH']}:../../install/bin"

# At a Bash terminal you need to do this instead
# source ../env

## Compile and run the exercise

The code [exercise_layer_cake.cpp](exercise_layer_cake.cpp) compiles and runs, but it is only a synchronous solution. There is more performance to be gained by implementing an asynchronous IO solution.

In [2]:
!build exercise_layer_cake.exe; exercise_layer_cake.exe

[ 50%] Built target hip_helper
[100%] Built target exercise_layer_cake.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"
Device id: 0
	name:                                    NVIDIA GeForce RTX 3060 Laptop GPU
	global memory size:                      6219 MB
	available registers per block:           65536 
	max threads per SM or CU:                1536 
	maximum shared memory size per block:    49 KB
	maximum shared memory size per SM or CU: 0 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,64)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65535,65535)
The synchronous calculation took 0.217000 milliseconds.
----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
----
----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  

## Your task

Using any/all of the IO synchronisation techniques in this lesson, your task is to modify the synchronous solution so that it may copy the plane to **out_h** concurrently with the kernel that fills the plane.

## A concurrent solution

One concurrent solution is located in the file [exercise_layer_cake_answers.cpp](exercise_layer_cake_answers.cpp). If you get stuck you are welcome to use this solution as inspiration for your own, however there are multiple ways (stream, event, or a combination) in which the necessary synchronisation can be implemented.

In [3]:
!build exercise_layer_cake_answers.exe; exercise_layer_cake_answers.exe

[ 50%] Built target hip_helper
[100%] Built target exercise_layer_cake_answers.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"
Device id: 0
	name:                                    NVIDIA GeForce RTX 3060 Laptop GPU
	global memory size:                      6219 MB
	available registers per block:           65536 
	max threads per SM or CU:                1536 
	maximum shared memory size per block:    49 KB
	maximum shared memory size per SM or CU: 0 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,64)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65535,65535)
The asynchronous calculation took 0.132000 milliseconds.
Layer - 0
:----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
----
Layer - 1
:----
|  0.00e+00  0.00e+00  0.0

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the Pawsey Supercomputing Centre
</address>