# Exercise - building a layer cake

In this exercise the goal is to optimise an iterative code so that copies are performed concurrently with compute. The code [layer_cake.cpp](layer_cake.cpp) is a synchronous application where at each iteration **n** a simple kernel called **fill_plane** fills an on-device allocation in **U_ds** with **n**. 

After the kernel completes, **hipMemcpy3D** is employed to copy the plane to **out_h**, a 3D array of output planes. We avoid copying plane 1 to work around a bug in the AMD implementation of **hipMemcpy3D**. Hopefully this bug will be fixed in the future.

If we make and run the application we can see that it correctly copies all planes other than plane 1.

In [1]:
import subprocess, os

In [2]:
!make
subprocess.run([os.path.join(os.getcwd(),"layer_cake.exe")])

make: Nothing to be done for 'all'.
Device id: 0
	name:                                    
	global memory size:                      536 MB
	available registers per block:           65536 
	maximum shared memory size per block:    65 KB
	maximum pitch size for memory copies:    402 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,2147483647,2147483647)
The synchronous calculation took 0.523000 milliseconds.
----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
----
----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
----
----
|  2.00e+00  2.00e+00  2.00e+00  2.00e+00 |
|  2.00e+00  2.00e+00  2.00e+00  2.00e+00 |
|  2.00e+

CompletedProcess(args=['/home/toby/Pelagos/Projects/HIP_Course/course_material/L8_IO_Optimisation/layer_cake.exe'], returncode=0)

## Your task

Using any/all of the IO synchronisation techniques in this lesson, your task is to modify the synchronous solution so that it may copy the plane to **out_h** concurrently with the kernel that fills the plane.

## A concurrent solution

One concurrent solution is located in the file [layer_cake_answers.cpp](layer_cake_answers.cpp). If you get stuck you are welcome to use this solution as inspiration for your own, however there are multiple ways (stream, event, or a combination) in which the necessary synchronisation can be implemented.

In [3]:
!make
subprocess.run([os.path.join(os.getcwd(),"layer_cake_answers.exe")])

make: Nothing to be done for 'all'.
Device id: 0
	name:                                    
	global memory size:                      536 MB
	available registers per block:           65536 
	maximum shared memory size per block:    65 KB
	maximum pitch size for memory copies:    402 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,2147483647,2147483647)
The asynchronous calculation took 0.409000 milliseconds.
Layer - 0
:----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
----
Layer - 1
:----
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
|  0.00e+00  0.00e+00  0.00e+00  0.00e+00 |
----
Layer - 2
:----
|  2.00e+00  2.00e+00  2.00e+00  2.00e+00 |
|  2.00e+00  2.00e+

CompletedProcess(args=['/home/toby/Pelagos/Projects/HIP_Course/course_material/L8_IO_Optimisation/layer_cake_answers.exe'], returncode=0)

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the Pawsey Supercomputing Centre
</address>