# Learning Objectives

We will learn about the following in this lab:

* Concept of overlapping computation with Memory transfer
* CUDA Streams overview and implementation
* CUDA Events overview and implementation
* Synchronization primitives in CUDA for the whole device, stream, event, etc.

# Improving Application Performance

### Analysis

The $(i+1)^{th}$ Jacobi iteration on any GPU cannot begin until all memory operations between all GPUs at the end of $i^{th}$ iteration are complete. The GPU is idle after its memory and compute operations are completed, as is visible in the profiler output below. The white space between the blue device kernel and the orange/ green/ pink memory operations is when the GPU is idle.

![memcpy_gpu_util](../../images/memcpy_gpu_util.png)

Let us quantify the time loss from the profiler output. 

![memcpy_util_selection](../../images/memcpy_util_selection.png)

On average, one iteration of `jacobi_kernel` takes about 600$\mu$s. The copy operations take about 50$\mu$s. The total time between Jacobi iterations is about 450$\mu$s. So the idle time is about $450-50=400\mu$s. 

We cannot recover all of the idle time as we are currently only considering the device timeline. Launching device kernels and copy operations has host-side overhead as well. Still, there is a significant opportunity to improve performance by minimizing the idle time.

### Optimization

Notice that the copy operations take place serially after the Jacobi iteration. The kernel computation must be complete before copying the updated halos from the GPU of interest (source) to its neighbours (destination).

However, we can perform the copy operation from the neighbouring GPUs (source) to the GPU of interest (destination) concurrently with the kernel computation as it will only be required in the next iteration.

An important optimization is to overlap computation and communication so that these operations can take place concurrently, whenever possible. We also need to keep track of dependencies so that the $(i+1)^{th}$ iteration on a GPU cannot begin until it sends and receives halos to and from its neighbours at the end of $i^{th}$ iteration.


## CUDA Concepts: Part 3

A CUDA device has multiple "engines" that can concurrently manage kernel execution(s) and data transfer(s). That is, we can overlap computation and communication in our application by utilizing these engines. This requires the use of CUDA Streams.

### Streams

A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run concurrently.

#### The default stream

All device operations (kernels and data transfers) in CUDA run in a stream. When no stream is specified, the default stream (also called the “null stream”) is used. All of our codes till now have implicitly used the default stream. 

The default stream is different from other streams because it is a synchronizing stream with respect to operations on the device: no operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the default stream must complete before any other operation (in any stream on the device) will begin.

We need to use non-default streams to achieve concurrency as showcased in the image below.

![cuda_streams_overview](../../images/cuda_streams_overview.png)

#### Non-default streams

Let us first learn to create and destroy non-default CUDA streams:

```c
cudaStream_t stream1;
cudaError_t result;
result = cudaStreamCreate(&stream1);
result = cudaStreamDestroy(stream1);
```

To issue a data transfer to a non-default stream we use the `cudaMemcpyAsync()` function, which takes a stream identifier as an optional fifth argument.

```c
result = cudaMemcpyAsync(TopNeighbour, myTopRow, size, cudaMemcpyDeviceToDevice, stream1);
```

To issue a kernel to a non-default stream we specify the stream identifier as a fourth configuration parameter. The third configuration parameter allocates shared device memory, use 0 for that. 

```c
jacobi_kernel<<<dim_grid, dim_block, 0, stream1>>>(...);
```

#### Synchronization

We have already encountered `cudaDeviceSynchronize()` function which blocks the host code until all previously issued operations on the device have completed. There are more fine-grained ways to synchronize codes that use streams.

The function `cudaStreamSynchronize(stream)` can instead be used to block the host until all previously issued operations in the specified stream have completed.

## Implementation exercise: Part 3

Now, let's implement CUDA streams in our application. Open the [jacobi_streams.cu](../../source_code/cuda/jacobi_streams.cu) file.

Alternatively, you can navigate to `CFD/English/C/source_code/cuda/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_streams.cu` file.

Note that we create 3 streams- `compute_stream`, `push_top_stream`, and `push_bottom_stream` for each GPU. We will compute the Jacobi iteration and perform GPU-local L2 norm copy operation on the `compute_stream`. Each GPU will perform its top and bottom halo copy operation to its neighbours using the `push_top_stream` and `push_bottom_stream` streams, respectively. 

Now, within the iterative Jacobi loop (the `while` loop), implement the following marked as `TODO: Part 3-`:

1. Synchronize `push_top_stream` and `push_bottom_stream` streams to ensure "top" and "bottom" neighbours have shared updated halos from the previous iteration.
2. Call device kernel on `compute_stream` stream with correct device arrays in function arguments.
3. Asynchronously copy GPU-local L2 norm back to CPU on `compute_stream` stream.
4. Ensure the computation is complete by synchronizing "compute_stream" stream before copying the updated halos to neighbours.
5. Implement top and bottom halo exchanges on the correct stream.

Review the topic above on Non-default streams if in doubt. Recall the utility of using separate `for` loops for launching device kernels and initiating copy operations.

After implementing these, let's compile the code:

In [None]:
!cd ../../source_code/cuda && make clean && make jacobi_streams

Validate the implementation by running the binary:

In [None]:
!cd ../../source_code/cuda && srun --partition=gpu -n1 --gres=gpu:2 ./jacobi_streams -p2p

We tested the code on a DGX system with 8 A100 GPUs, and we got the following output: 

```bash
Num GPUs: 8.
16384x16384: 1 GPU:   3.3205 s, 8 GPUs:   0.6481 s, speedup:     5.12, efficiency:    64.04 
```

Now, enable P2P on our current program by using the `-p2p` runtime flag. On the DGX system with 8 A100 GPUs, the efficiency will increase to 84 percent. Your efficiency numbers and improvement in performance may differ depending on the system topology, GPU type, etc.

### Profiling

Now, profile the P2P-enabled version of the program with `nsys`:

In [None]:
!cd ../../source_code/cuda/ && srun --partition=gpu -n1 --gres=gpu:2 nsys profile --trace=cuda,nvtx --stats=true -o jacobi_streams_p2p_report --force-overwrite true ./jacobi_streams -p2p

To view the profiler report, you would need to Download and save the report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../../source_code/cuda/jacobi_streams_p2p_report.nsys-rep) and choosing Save Link As. Once done open the report via the GUI. Total time between two Jacobi iterations as shown below.

![streams_util_selection](../../images/streams_util_selection.png)

The copy operations take same time as before, about 50$\mu$s. Thus, the idle time is $200-50=150\mu$s. Compare this idle time with the idle time for non-streams version of the application, which in our case is abour 400$\mu$s. Concurrency improves GPU utilization and consequently speedup and efficiency.

**Solution:** The solution for this exercise is present in `source_code/memcpy/solutions` directory: [jacobi_streams.cu](../../source_code/cuda/solutions/jacobi_streams.cu)

#### Analysis

Can we improve our program further? Yes! Can you think of any bottleneck that we have mentioned implicitly but haven't addressed yet? 

Recall that `cudaStreamSynchronize` function blocks the "host" until all previously issued operations in the specified stream have completed. Do we need to block the host?

The utility of this function in our application is that it ensures the dependencies between iterations and between computation and communication are respected. We don't need to block the host for this purpose. 

## CUDA Concepts: Part 4

### CUDA Events

CUDA Events are synchronization markers that provide a mechanism to signal when operations have occurred 
in a stream. They allow fine grained synchronization within a stream and also inter stream synchronization, e.g. let a stream wait for an event in another stream. 

Let us first learn to create and destroy CUDA events:

```c
cudaEvent_t event1;
cudaError_t result;
result = cudaEventCreate(&event1);
result = cudaEventDestroy(&event1);
```

#### Recording Events

Events have a boolean state- Occurred or Not Occurred. The default state is Occurred. We record an event as follows:

```c
cudaEventRecord(&event1, stream1); 
```

This function sets the event state of `event1` to Not Occurred, enqueues `event1` into queue at `stream1`, and the event state is set to Occurred when it reaches the front of the queue at `stream1`.

#### Synchronizing Stream with Events

`cudaEventSynchronize` acts similar to `cudaStreamSynchronize` and blocks the host until the recorded event has "Occured". But we do not wish to block the host thread. Thus, we use `cudaStreamWaitEvent`:

```c
cudaStreamWaitEvent(stream1, event1, 0);
```

This function blocks the stream until `event1` has Occured and it does not block the host. It works even if the event is recorded in a different stream or on a different device.

Thus, fine-grained synchronization that doesn't block the host is achieved by first using `cudaEventRecord` on the independent operation, for example, halo copy from GPU 0 to GPU 1 at the end of $i^{th}$ iteration. Then, before issuing the dependent operation, for example, Jacobi computation for $(i+1)^{th}$ iteration on GPU 1, we block the stream using `cudaStreamWaitEvent`.  

## Implementation Exercise: Part 4

Let's implement CUDA Events with Streams in our application. Open the [jacobi_streams_events.cu](../../source_code/cuda/jacobi_streams_events.cu) file.

Alternatively, you can navigate to `CFD/English/C/source_code/cuda/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_streams_events.cu` file.

Note that we create 5 events for each device, `compute_done`, `push_top_done[0]`, `push_top_done[1]`, `push_bottom_done[0]`, and `push_bottom_done[1]`. We need 2 events for each halo on every device:

1. To synchronize "top" and "bottom" neighbour's `push_bottom_stream` and `push_top_stream` copy operations of $(i-1)^{th}$ iteration, respectively, before computing $i^{th}$ Jacobi iteration in `compute_stream`.
2. To record current device's `push_top_stream` and `push_bottom_stream` copy operations at the end of $i^{th}$ iteration.

Now, within the iterative Jacobi loop (the `while` loop), implement the following marked as `TODO: Part 4-`:

* Block the "compute_stream" as long as the top and bottom halos from the neighbours are not copied to `dev_id`. The `push_top_done` and `push_bottom_done` events are to monitored for `bottom` and `top` neighbours, respectively for the previous iteration denoted by `iter % 2`. Note that there should be 2 distinct `cudaStreamWaitEvent` function calls.
* Record that Jacobi computation on `compute_stream` is done by using `cudaEventRecord` for `compute_done` event for `dev_id`.
* Wait for the Jacobi computation of `dev_id` to complete by using the `compute_done` event on `push_top_stream` so that the top halo isn't copied to the neighbour before computation is done.
* Record completion of top halo copy from `dev_id` to its neighbour to be used in next iteration. Record the event for `push_top_done` stream of `dev_id` for next iteration which is `(iter+1) % 2`.
* Repeat the same procedure as described in previous two points for bottom halo copy with `push_bottom_stream` and `push_bottom_done` event.

After implementing these, compile the code:

In [None]:
!cd ../../source_code/cuda && make clean && make jacobi_streams_events

Validate the implementation by running the binary with and without P2P:

In [None]:
!cd ../../source_code/cuda && srun --partition=gpu -n1 --gres=gpu:2 ./jacobi_streams_events 

We tested the code on a DGX system with 8 A100 16GB GPUs for the binary without using P2P, and we got the following output:

```bash
Num GPUs: 8.
16384x16384: 1 GPU:   3.3163 s, 8 GPUs:   0.5535 s, speedup:     5.99, efficiency:    74.89 
```

With using P2P, the efficiency increases marginally:

```bash
Num GPUs: 8.
16384x16384: 1 GPU:   3.3208 s, 8 GPUs:   0.5412 s, speedup:     6.14, efficiency:    76.70  
```

Let us profile the code to verify that using events indeed overlaps computation with communication within each GPU.

## Profiling

Profile the binary with P2P enabled using `nsys`:

In [None]:
! cd ../../source_code/cuda/ && srun --partition=gpu -n1 --gres=gpu:2 nsys profile --trace=cuda,nvtx --stats=true -o jacobi_streams_events_p2p_report --force-overwrite true ./jacobi_streams_events -p2p

To view the profiler report, you would need to Download and save the report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../../source_code/cuda/jacobi_streams_events_p2p_report.nsys-rep) and choosing Save Link As. Once done open the report via the GUI.

![jacobi_memcpy_streams_events_p2p_report](../../images/jacobi_memcpy_streams_events_p2p_report.png)

Observe that the computation is now overlapped with communication within each GPU. Moreover, we have decreased the total idle time between two Jacobi iterations to about $175\mu$s. Therefore, the GPU idle time is $175-50=125\mu$s, which is lesser than the $150\mu$s idle time achieved using just streams.

**Solution:** The solution for this exercise is present in `source_code/memcpy/solutions` directory: [jacobi_streams_events.cu](../../source_code/cuda/solutions/jacobi_streams_events.cu)

We have now covered implementing computation and communication overlap using CUDA Streams and then fine-tuning it using CUDA Events. Note that all of our codes currently are confined to a single node. We would like to scale our codes across nodes.

Therefore, let us learn about multi-node multi-GPU programming with MPI. Click bellow to access the next lab:

# [Next: Multi-Node programming with MPI](../mpi/multi_node_intro.ipynb)

Here's a link to the home notebook through which all other notebooks are accessible:

# [HOME](../../../start_here.ipynb)

---
## Links and Resources

* [Programming Concepts: CUDA Streams and Concurrency](https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf)
* [Programming Concepts: CUDA Events and Performance Monitoring](https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/)
* [Programming: CUDA Streams Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams-cdp)
* [Concepts: Overlapping Computation and Communication](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/)
* [Documentation: CUDA Stream Management API](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html)
* [Documentation: CUDA Events Management API](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html)
* [Code: Multi-GPU Programming Models](https://github.com/NVIDIA/multi-gpu-programming-models)
* [Code: GPU Bootcamp](https://github.com/gpuhackathons-org/gpubootcamp/)

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.


## Licensing
Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.
