# Optimising compute with concurrent IO in HIP

With many iterative processes there is a need to get information **off** the device at regular intervals. Up to this point we have been transferring data off the compute device **after** the kernel is finished. Furthermore, the routines to transport memory between device and host have thus far been used in a blocking manner, meaning the code running on the host *pauses* while the transfer occurs. Most compute devices have the ability to transfer data **while** kernels are running. This means IO transfers can take place during compute, and in some instances may **take place entirely** while the kernel runs. For the cost of additional programming complexity, significant compute savings can be obtained, as the following diagram illustrates:

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:100%">
    <img style="vertical-align:middle" src="../images/optimising_io.svg"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: The difference between sequential and concurrent IO.</figcaption>
</figure>

## How to enable concurrent IO

### Use multiple streams

A stream is a place where one can perform work, such as the execution of a kernel or an IO operation. When no stream or the 0 stream is specified there is a default stream called the **null stream** to which work is submitted. Thus far we have been using just the null stream for compute and IO. GPU's have the ability to run streams that do IO **at the same time** as the stream/s devoted to compute. If a kernel launch does not keep all hardware threads busy on the device, then streams allow for multiple kernels to be run at the same time. In either case **there is a performance avantage to be gained** by using multiple streams. 

Streams are initialised using the commands **hipStreamCreate** and **hipStreamCreateWithFlags**. The latter command allows one to create streams where the null stream can optionally wait for work in the stream to complete (implicitly synchronise) before running its own work. The command **h_create_streams** from <a href="../include/hip_helper.hpp">hip_helper.hpp</a> is used to create additional streams with the option of passing a flag to implicitly synchronise with the null stream. Below is the code for **h_create_streams**:

```C++
/// Create a number of streams
hipStream_t* h_create_streams(size_t nstreams, int synchronise) {
    // Blocking is a boolean, 0==no, 
    assert(nstreams>0);

    unsigned int flag = hipStreamDefault;

    // If blocking is 0 then set NonBlocking flag
    // meaning we don't synchronise with the null stream
    if (synchronise == 0) {
        flag = hipStreamNonBlocking;
    }

    // Make the streams
    hipStream_t* streams = (hipStream_t*)calloc(nstreams, sizeof(hipStream_t));

    for (int i=0; i<nstreams; i++) {
        H_ERRCHK(hipStreamCreateWithFlags(&streams[i], flag));
    }

    return streams;
}
```

If streams operate concurrently, then there must be synchronisation controls. HIP events may be inserted into streams, and the event will reach a complete status after all prior work in the stream is done. Both event and stream synchronisation commands can help establish dependencies between streams. 

### Use pinned memory and asynchronous IO calls 

IO functions such as **hipMemcpy** and **hipMemcpy3D** are blocking, this means they wait until the copy is complete before returning. In HIP these IO functions have a sibling asychronous copy command such as **hipMemcpyAsync** and **hipMemcpy3DAsync** that take a stream as an extra argument and return immediately **only** if the host memory involved in the copy is **pinned**. In managed memory allocations the synchronisation occurs in the background and there is no need for an explicit synchronisation call.

## Example with the 2D wave equation

The [scalar wave equation](https://en.wikipedia.org/wiki/Wave_equation) adequately describes a number of wavelike phenomena. If **U** is a 2D grid storing the amplitude of the wave at every location (the wavefield), **V** is a 2D grid storing velocity, and **t** is time, then 2D waves propagate according to the formula,

$$\frac{\partial^2 \textbf{U}}{{\partial t}^2}=\textbf{V}^2 \left (\frac{\partial^2 \textbf{U}}{{\partial x_{0}}^2}+\frac{\partial^2 \textbf{U}}{{\partial x_{1}}^2} \right)+f(t)$$

where $x_0$ and $x_1$ are spatial directions and $f(t)$ is a forcing term. If $\Delta t$ is the time step a second-order finite-difference approximation to the time derivative is given in terms of the amplitude at timesteps $\textbf{U}_{0}, \textbf{U}_{1}$ and $\textbf{U}_{2}.$ 

$$\frac{\partial^2 \textbf{U}}{{\partial t}^2} \approx \frac{1}{\Delta t^2} \left ( \textbf{U}_{0} -2 \textbf{U}_{1}+\textbf{U}_{2} \right ) $$

Replace $\frac{\partial^2 \textbf{U}}{{\partial t}^2}$ with $\frac{1}{\Delta t^2} \left( \textbf{U}_{0} -2 \textbf{U}_{1}+\textbf{U}_{2} \right )$ and solve for $\textbf{U}_{2}$.

$$\textbf{U}_{2} \approx 2 \textbf{U}_{1} - \textbf{U}_{0} + \Delta t^2\textbf{V}^2 \left (\frac{\partial^2 \textbf{U}_{1}}{{\partial x_{0}}^2}+\frac{\partial^2 \textbf{U}_{1}}{{\partial x_{1}}^2} \right)+f_{1}$$

The equation above is now an iterative formula to generate the amplitude at the next timestep $\textbf{U}_2$ if we know the present ampltiude $\textbf{U}_{1}$ and past amplitude $\textbf{U}_{0}.$ We also use finite difference approximations for the spatial derivatives, and express the spatial derivatives as a matrix multiplied by $\textbf{U}_{1}$, but this complexity is unnecessary to show here. All we need to know is that the next timestep is a function ${\textbf{F}}$ of the present and past timesteps, the velocity, and the forcing term.

$$\textbf{U}_{2}(t)=\textbf{F}(\textbf{U}_0(t), \textbf{U}_1(t), \textbf{V}, f_{1}(t))$$

> In geophysics we usually use a [Ricker Wavelet](https://wiki.seg.org/wiki/Dictionary:Ricker_wavelet) for the forcing term $f$, and usually inject that wavelet into one cell within the grid as time progresses.

### Kernel implementation

In [wave2d_sync.cpp](wave2d_sync.cpp), [wave2d_async_streams.cpp](wave2d_async_streams.cpp), and [wave2d_async_events.cpp](wave2d_async_events.cpp) is a kernel called **wave2d_4o** that implements the function **F** above. HIP device allocations store $\textbf{U}_{0}, \textbf{U}_{1}, \textbf{U}_{2}$, and $\textbf{V}$ on the compute device. Here is the kernel code:

```C++
// Kernel to solve the wave equation with fourth-order accuracy in space
__global__ void wave2d_4o (
        // Arguments
        float_type* U0,
        float_type* U1,
        float_type* U2,
        float_type* V,
        size_t N0,
        size_t N1,
        float dt2,
        float inv_dx02,
        float inv_dx12,
        // Position, frequency, and time for the
        // wavelet injection
        size_t P0,
        size_t P1,
        float pi2fm2t2) {    

    // U2, U1, U0, V is of size (N0, N1)
    size_t i0 = blockIdx.y * blockDim.y + threadIdx.y;
    size_t i1 = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Required padding and coefficients for spatial finite difference
    const int pad_l=2, pad_r=2, ncoeffs=5;
    float coeffs[ncoeffs] = {-0.083333336f, 1.3333334f, -2.5f, 1.3333334f, -0.083333336f};
    
    // Limit i0 and i1 to the region of U2 within the padding
    i0=min(i0, (size_t)(N0-1-pad_r));
    i1=min(i1, (size_t)(N1-1-pad_r));
    i0=max((size_t)pad_l, i0);
    i1=max((size_t)pad_l, i1);
    
    // Position within the grid as a 1D offset
    long offset=i0*N1+i1;
    
    // Temporary storage
    float temp0=0.0f, temp1=0.0f;
    float tempV=V[offset];
    
    // Calculate the Laplacian
    for (long n=0; n<ncoeffs; n++) {
        // Stride in dim0 is N1        
        temp0+=coeffs[n]*U1[offset+(n*(long)N1)-(pad_l*(long)N1)];
        // Stride in dim1 is 1
        temp1+=coeffs[n]*U1[offset+n-pad_l];
    }
    
    // Calculate the wavefield U2 at the next timestep
    U2[offset]=(2.0f*U1[offset])-U0[offset]+((dt2*tempV*tempV)*(temp0*inv_dx02+temp1*inv_dx12));
    
    // Inject the forcing term at coordinates (P0, P1)
    if ((i0==P0) && (i1==P1)) {
        U2[offset]+=(1.0f-2.0f*pi2fm2t2)*exp(-pi2fm2t2);
    }
    
}
```

### Problem setup

For this problem we create the 2D grid as a square box of size $(N0,N1)=(256,256)$. The velocity is uniform at 343m/s, which is approximately the speed of sound in air. Then we use a Ricker wavelet as a forcing term $f(t)$ to 'let off a firework' in the middle of the box and run a number of timesteps to see how a sound wave propagates in the box. 

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:80%">
    <img style="vertical-align:middle" src="../images/wave2d_problem.svg"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: Problem setup for the 2D wave equation.</figcaption>
</figure>

The programs setup a velocity and wavefield of size (N0, N1). At each timestep the kernel **wave2d_4o** is used to update the solution and a Ricker wavelet is injected into the middle of the box. Enough timesteps are alloted so that the wave propagates through the medium and reflects off the walls. Wavefield arrays that are no longer needed are recycled for efficiency.

The Python code below readies the framework for plotting the answer:

In [2]:
%matplotlib widget

import os
import sys
import numpy as np
import subprocess
from ipywidgets import widgets
from matplotlib import pyplot as plt
from matplotlib import animation, rc
from IPython.display import HTML

sys.path.insert(0, os.path.abspath("../common"))

import py_helper

float_type = np.float32

defines=py_helper.load_defines("mat_size.hpp")

### Sequential (synchronous) IO solution

In [wave2d_sync.cpp](wave2d_sync.cpp) we use an array of three HIP device allocations to represent the wavefield at timesteps (0,1,2). The null stream (stream 0) is used for both kernel execution and IO.

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:100%">
    <img style="vertical-align:middle" src="../images/sequential_io.svg"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: Sequential IO solution.</figcaption>
</figure>

After setting up the wavefields, the synchronous solution loops through timesteps. At each timestep:

1. The kernel is launched to compute the next wavefield $\textbf{U}_{2}$ from $\textbf{U}_{0}$, $\textbf{U}_{1}$ and $\textbf{V}$.
1. The past wavefield $\textbf{U}_{0}$ is copied back to the output grid **out_h** as a plane in the 3D stack.

The code for the central time loop in [wave2d_sync.cpp](wave2d_sync.cpp) is below:

```C++
    for (int n=0; n<NT; n++) {
        // Get the wavefields
        U0_d = U_ds[n%nscratch];
        U1_d = U_ds[(n+1)%nscratch];
        U2_d = U_ds[(n+2)%nscratch];
        
        // Shifted time
        t = n*dt-2.0*td;
        pi2fm2t2 = pi*pi*fm*fm*t*t;
        
        // Launch the kernel using hipLaunchKernelGGL method
        // Use 0 when choosing the default (null) stream
        hipLaunchKernelGGL(wave2d_4o, 
            grid_nblocks, block_size, sharedMemBytes, 0,
            U0_d, U1_d, U2_d, V_d,
            N0, N1, dt2,
            inv_dx02, inv_dx12,
            P0, P1, pi2fm2t2
        );
                           
        // Check the status of the kernel launch
        H_ERRCHK(hipGetLastError());
          
        // Copy the wavefield back to the host
        // hipMemcpy is a barrier
        if (n>1 && n<NT-1) { // For consistency with the async solution
            H_ERRCHK(
                hipMemcpy(
                    (void*)&out_h[n*N0*N1],
                    U0_d,
                    nbytes_U,
                    hipMemcpyDeviceToHost
                )
            );
        }
    }
```

Notice that with the **hipMemcpy** call above, we used an address located within the host memory allocation **out_h** to perform the copy. This technique is **not valid** for use in asynchronous copies.

## Import the environment

The command below brings the `run` and `build` commands within reach of the Jupyter notebook.

In [3]:
import os
os.environ['PATH'] = f"{os.environ['PATH']}:../install/bin"

# At a Bash terminal you need to do this instead
# source ../env

#### Make and run the application.

In [4]:
!build wave2d_sync.exe; build wave2d_async_events.exe; build wave2d_async_streams.exe

[ 50%] Built target hip_helper
[100%] Built target wave2d_sync.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"
[ 50%] Built target hip_helper
[100%] Built target wave2d_async_events.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"
[ 50%] Built target hip_helper
[100%] Built target wave2d_async_streams.exe
[36mInstall the project...[0m
-- Install configuration: "RELEASE"


In [6]:
subprocess.run(["wave2d_sync.exe"])

Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65536,65536)
dt=0.001166, Vmax=343.000000
dt=0.00116618, fm=34.3, Vmax=343, dt2=1.35998e-06
The synchronous calculation took 72.224000 milliseconds.


CompletedProcess(args=['wave2d_sync.exe'], returncode=0)

#### Plot the output wavefield

At the end of iteration a binary file containing the wavefield at every timestep is written to the file **array_out.dat**. We read in this wavefield and plot it below. 

In [7]:
# Read the output file back in for display
output_sync=np.fromfile("array_out.dat", dtype=float_type)
nimages_sync=int(output_sync.size//(defines["N0_U"]*defines["N1_U"]))
images_sync=output_sync.reshape(nimages_sync, defines["N0_U"], defines["N1_U"])

py_helper.plot_slices(images_sync)

interactive(children=(IntSlider(value=0, description='n', max=639), Output()), _dom_classes=('widget-interact'…

#### Application trace

The script **make_traces.sh** produces traces in the **rocprof_trace** folder for each of the synchronous and concurrent IO solutions.

In [9]:
!./make_traces.sh

RPL: on '240507_161254' from '/opt/rocm-6.0.2' in '/nethome/tpotter/Pelagos/Projects/HIP_Course/course_material/L8_IO_Optimisation'
RPL: profiling '"wave2d_sync.exe"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_240507_161254_1533010'
RPL: result dir '/tmp/rpl_data_240507_161254_1533010/input_results_240507_161254'
ROCtracer (1533034):
ROCProfiler: input from "/tmp/rpl_data_240507_161254_1533010/input.xml"
  0 metrics
    HSA-trace(*)
    HSA-activity-trace()
    HIP-trace(*)
Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2

Then you can go to the address [https://ui.perfetto.dev](https://ui.perfetto.dev) to open the tracing utility. If you load the file **rocprof_trace/trace_sync.json** you should see something like this. The IO occurs after each kernel execution using the same command queue.

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:100%">
    <img style="vertical-align:middle" src="../images/synchronous_io.png"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: Sequential IO solution. IO occurs after compute.</figcaption>
</figure>

### Concurrent (asynchronous) IO solutions

In [wave2d_async_streams.cpp](wave2d_async_streams.cpp) and [wave2d_async_events.cpp](wave2d_async_events.cpp) are two solutions for concurrent IO. The goal is to use **multiple streams** so that while one stream is executing a kernel, the others are working on IO.

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:100%">
    <img style="vertical-align:middle" src="../images/concurrent_io.svg"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: Concurrent IO solution.</figcaption>
</figure>

Both solutions use one stream for compute and a number of streams for IO, however the difference between the two solutions is that [wave2d_async_streams.cpp](wave2d_async_streams.cpp) mainly uses synchronisation on streams to establish the necessary dependencies between IO and compute, while [wave2d_async_events.cpp](wave2d_async_events.cpp) mainly uses HIP events to achieve the same synchronisation. Both accomplish the same task of moving data while compute is taking place.

#### Concurrent access to buffers from the host and device

It is a **race condition** to read from a device allocation (from another stream) at the same time as a kernel is writing to the allocation. Furthermore, while it is technically possible to asynchronously read from an array that is also being read by a kernel, it can lead to undefined behaviour, as memory can be moved around. Therefore, it is **recommended** to perform IO only on buffers that we **know for sure** are not being used by a kernel. Our kernel needs access to wavefields at timesteps $\textbf{U}_{0}, \textbf{U}_{1}, \textbf{U}_{2}$, therefore they are active and **not safe** to copy, but wavefields at earlier timesteps e.g $\textbf{U}_{-2}, \textbf{U}_{-1}$ **are** inactive and safe to copy.

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:30%">
    <img style="vertical-align:middle" src="../images/wavefields.svg"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: Active (not okay to copy) and inactive (okay to copy) wavefields.</figcaption>
</figure>

In this instance, the solution to enable concurrent IO and avoid race conditions is to have an array of at least four memory allocations on the device to represent the wavefield. Then the kernel can operate on the active wavefields, while IO stream/s copy from inactive wavefields. An array of **nscratch=5** allocations to allow extra leeway for copies to finish.

During construction of these examples I found it **difficult to maintain synchronisation** when using multiple streams for compute, therefore a **stable solution** used one stream for compute to ensure there were no issues. The concurrent IO solutions use an array of five streams for IO and one stream for compute. Also allocated is an array of five events to demonstrate the use of events in workflow dependencies.

#### Output array converted to pinned memory

In [wave2d_async_streams.cpp](wave2d_async_streams.cpp) and [wave2d_async_events.cpp](wave2d_async_events.cpp) the 3D output array **out_h** on the host is allocated as pinned memory using **hipHostMalloc** to enable asynchronous copies. I encountered **silent failures** when trying to use addresses **within the pinned memory allocation** as inputs to **hipMemcpyAsync**. Therefore I had to use **hipMemcpy3DAsync** to perform the copies asynchronously and pass in the host pointer returned by **hipHostmalloc**.

#### Stream-based synchronisation

In [wave2d_async_streams.cpp](wave2d_async_streams.cpp) we use explicit stream-based synchronisation. During each iteration **n** of the time loop then:

1. We use **hipStreamSynchronize** to make the host wait for all streams associated with past copies, as well as the compute stream.
1. Submit the kernel to compute stream to solve for U[(n+2)%nscratch], then record an event (at index **n**) into the compute stream.
1. The wavefield at (**n**-1) (which we term the *copy_index*) is only safe to copy once the compute stream (from the previous iteration) is done with it. Use **hipStreamWaitEvent** to make the IO stream at (**n**-1) wait on event at (**n**-1). Then use the IO stream and **hipMemcpy3DAsync** to asynchronously copy the wavefield at (**n**-1) to the stack of output images. 

The following diagram shows how the dependencies play out with stream-based synchronisation.

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:100%">
    <img style="vertical-align:middle" src="../images/wavefields_concurrent_streams.svg"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: Workflow of stream-based synchronisation during an iteration.</figcaption>
</figure>

The time loop for stream-based iteration is replicated below:

```C++
    for (int n=0; n<NT; n++) {
        
        // Wait for the event associated with a stream
        
        // Explicitly wait for all relevant streams to finish
        H_ERRCHK(hipStreamSynchronize(streams[(n+2)%nscratch]));
        H_ERRCHK(hipStreamSynchronize(streams[(n+1)%nscratch]));
        H_ERRCHK(hipStreamSynchronize(streams[n%nscratch]));
        H_ERRCHK(hipStreamSynchronize(compute_stream));
        
        // Get the wavefields
        U0_d = U_ds[n%nscratch];
        U1_d = U_ds[(n+1)%nscratch];
        U2_d = U_ds[(n+2)%nscratch];
        
        // Shifted time
        t = n*dt-2.0*td;
        pi2fm2t2 = pi*pi*fm*fm*t*t;
        
        // Launch the kernel using hipLaunchKernelGGL method
        // Use 0 when choosing the default (null) stream
        hipLaunchKernelGGL(wave2d_4o, 
            grid_nblocks, block_size, sharedMemBytes, compute_stream,
            U0_d, U1_d, U2_d, V_d,
            N0, N1, dt2,
            inv_dx02, inv_dx12,
            P0, P1, pi2fm2t2
        );
                           
        // Check the status of the kernel launch
        H_ERRCHK(hipGetLastError());
          
        // Insert an event n%nscratch into compute stream
        // It will complete afer the kernel does
        H_ERRCHK(hipEventRecord(events[n%nscratch], compute_stream));   
        
        // Read memory from the buffer to the host in an asynchronous manner
        if (n>2) {
            size_t copy_index=n-1;
            
            // Insert a wait for the IO stream on the compute event
            H_ERRCHK(
                hipStreamWaitEvent(
                    streams[copy_index%nscratch], 
                    events[copy_index%nscratch],
                    0
                )
            );
            
            // Then asynchronously copy a wavefield back
            // using the IO stream.
            
            // Only change what is necessary in copy_parms
            copy_parms.srcPtr.ptr = U_ds[copy_index%nscratch];
            
            // Z positions of 1 don't seem to work on AMD platforms?!?!
            copy_parms.dstPos.z = copy_index;
            
            // Copy memory asynchronously
            H_ERRCHK(
                hipMemcpy3DAsync(
                    &copy_parms,
                    streams[copy_index%nscratch]
                )
            );
        }
    }
```

Notice that we used **hipMemcpy3DAsync** with the host pointer **out_h** to peform the asynchronous copy of a wavefield to the output array on the host.

##### Make and run the solution

In [10]:
subprocess.run(["wave2d_async_streams.exe"])

Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65536,65536)
dt=0.001166, Vmax=343.000000
dt=0.00116618, fm=34.3, Vmax=343, dt2=1.35998e-06
The asynchronous calculation took 61.448000 milliseconds.


CompletedProcess(args=['wave2d_async_streams.exe'], returncode=0)

If we examine the time elapsed for the asynchronous calculation, it shows a marked improvement over the synchronous one. Plotting the result shows that we get the same answer as with the synchronous solution.

In [12]:
output_async=np.fromfile("array_out.dat", dtype=float_type)
nimages_async=int(output_async.size//(defines["N0_U"]*defines["N1_U"]))
images_async=output_async.reshape(nimages_async, defines["N0_U"], defines["N1_U"])

py_helper.plot_slices(images_async)

print(f"Maximum residual between results is {np.max(images_async-images_sync)}")

interactive(children=(IntSlider(value=0, description='n', max=639), Output()), _dom_classes=('widget-interact'…

Maximum residual between results is 0.0


#### Event-based synchronisation

In the previous solution the host **explicitly waits** for each IO stream from previous iterations before the kernel works on active wavefields. We can accomplish the same synchronisation by recording an event after each copy and having the compute stream wait for all IO related events before working on the kernel. In [wave2d_async_events.cpp](wave2d_async_events.cpp) the time loop takes the form:


1. We use **hipStreamWaitEvent** to make the compute stream wait for copies (from previous iterations) of wavefields that are now at (**n**), (**n**+1) and (**n**+2). 
1. Submit the kernel to compute stream to solve for U[(n+2)%nscratch], then record an event (at index **n**) into the compute stream.
1. The wavefield at (**n**-1) (which we call the *copy_index*) is safe to copy once the compute stream is done with it. We use **hipEventSynchronize** to make the host wait on the IO event at (**n**-1) which is the event from the compute stream at the last iteration. This has the effect of eliminating a backlog of work accumulating on the compute stream.
1.Then use the IO stream at (**n**-1) and **hipMemcpy3DAsync** to asynchronously copy the wavefield at (**n**-1) to the stack of output images. 
1. Finally, an event at (**n**-1) is recorded to the stream at (**n**-1) so it can be waited on at step 1 for the next iteration.

The following diagram shows how the dependencies play out with events and streams during a single iteration.

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:100%">
    <img style="vertical-align:middle" src="../images/wavefields_concurrent_events.svg"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: Workflow of event-based synchronisation during an iteration.</figcaption>
</figure>

The code for the iterations in [wave2d_async_events.cpp](wave2d_async_events.cpp) is produced here.

```C++
    for (int n=0; n<NT; n++) {
        
        // Make the compute stream wait on events from previous copies
        H_ERRCHK(hipStreamWaitEvent(compute_stream, events[(n+2)%nscratch], 0));
        H_ERRCHK(hipStreamWaitEvent(compute_stream, events[(n+1)%nscratch], 0));
        H_ERRCHK(hipStreamWaitEvent(compute_stream, events[n%nscratch], 0));
        
        // Get the wavefields
        U0_d = U_ds[n%nscratch];
        U1_d = U_ds[(n+1)%nscratch];
        U2_d = U_ds[(n+2)%nscratch];
        
        // Shifted time
        t = n*dt-2.0*td;
        pi2fm2t2 = pi*pi*fm*fm*t*t;
        
        // Launch the kernel using hipLaunchKernelGGL method
        // Use 0 when choosing the default (null) stream
        hipLaunchKernelGGL(wave2d_4o, 
            grid_nblocks, block_size, sharedMemBytes, compute_stream,
            U0_d, U1_d, U2_d, V_d,
            N0, N1, dt2,
            inv_dx02, inv_dx12,
            P0, P1, pi2fm2t2
        );
                           
        // Check the status of the kernel launch
        H_ERRCHK(hipGetLastError());
          
        // Insert an event into stream at n%nscratch
        // It will complete afer the kernel does
        H_ERRCHK(hipEventRecord(events[n%nscratch], compute_stream));   
        
        // Read memory from the buffer to the host in an asynchronous manner
        if (n>2) {
            size_t copy_index=n-1;
            
            // Explicity wait for compute operation from previous iteration to finish
            // before initiating a copy
            H_ERRCHK(
                hipEventSynchronize(events[copy_index%nscratch])
            );
            
            // Then asynchronously copy a wavefield back
            // using an IO stream
            
            // Only change what is necessary in copy_parms
            copy_parms.srcPtr.ptr = U_ds[copy_index%nscratch];
            
            // Z positions of 1 don't seem to work on AMD platforms?!?!
            copy_parms.dstPos.z = copy_index;
            
            // Copy memory asynchronously
            H_ERRCHK(
                hipMemcpy3DAsync(
                    &copy_parms,
                    streams[copy_index%nscratch]
                )
            );
            // Record the event to the IO stream
            H_ERRCHK(
                hipEventRecord(
                    events[copy_index%nscratch],
                    streams[copy_index%nscratch]
                )
            );
        }
    }
```



##### Make and run the solution

In [13]:
subprocess.run(["wave2d_async_events.exe"])

Device id: 0
	name:                                    AMD Radeon VII
	global memory size:                      17163 MB
	available registers per block:           65536 
	max threads per SM or CU:                2560 
	maximum shared memory size per block:    65 KB
	maximum shared memory size per SM or CU: 65 KB
	maximum pitch size for memory copies:    2147 MB
	max block size:                          (1024,1024,1024)
	max threads in a block:                  1024
	max Grid size:                           (2147483647,65536,65536)
dt=0.001166, Vmax=343.000000
dt=0.00116618, fm=34.3, Vmax=343, dt2=1.35998e-06
The asynchronous calculation took 60.638000 milliseconds.


CompletedProcess(args=['wave2d_async_events.exe'], returncode=0)

If we check the time elapsed we find that the concurrent IO solution with event-based synchronsiation takes less time than both squential IO and stream-based synchronisation. A trace of HIP activity in **rocprof_trace/wave2d_async_streams.json** shows that IO is taking place during compute.

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:100%">
    <img style="vertical-align:middle" src="../images/asynchronous_io.png"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: Concurrent IO solution. IO can occur at the same time as compute.</figcaption>
</figure>


### Plot the wavefield and explore results

In [14]:
# Read the output file back in for display
output_async=np.fromfile("array_out.dat", dtype=float_type)
nimages_async=int(output_async.size//(defines["N0_U"]*defines["N1_U"]))
images_async=output_async.reshape(nimages_async, defines["N0_U"], defines["N1_U"])

py_helper.plot_slices(images_async)

print(f"Maximum residual between results is {np.max(images_async-images_sync)}")

interactive(children=(IntSlider(value=0, description='n', max=639), Output()), _dom_classes=('widget-interact'…

Maximum residual between results is 0.0


## Problems encountered with hipMemcpy3D and asynchronous copies

In the concurrent examples you might notice the use of **hipMemcpy3DAsync** to copy wavefields as planes back to the 3D host memory allocation **out_h**. This is because using **hipMemcpyAsync** to copy device memory to an address within a pinned memory allocation **failed silently** and produced faulty copies. Therefore it is **not recommended** to perform asychronous copies with pointers derived from *within* a pinned memory allocation. Only use the pointer returned by **hipHostMalloc** with asynchronous copy functions.

Problems were also encountered with **hipMemcpy3D** and **hipMemcpy3DAsync**. For some bizarre reason, during construction of these examples I found that on AMD platforms a value of **copy_parms.dstPos.z=1** resulted in an error for calls to either **hipMemcpy3D** or **hipMemcpy3DAsync**. This strange behaviour was not present with the NVIDIA backend. I have worked around this issue by only copying planes when the plane index z is greater than 2. Hopefully this issue is addressed in a future version of ROCM.

## Summary of learnings

In this module we explored how IO can take place at the same time as a kernel using multiple streams. The concurrent IO solution was faster than the sequential IO solution. With concurrent IO it is a safety measure to avoid accessing memory allocations that are being used by a kernel, unless the allocated is managed. Both stream and event-based synchronisation can establish and enforce dependencies between activity that occurs across multiple streams.

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a> for the Pawsey Supercomputing Centre
</address>