# Learning Objectives

The goal of this lab is to:

* Review the scientific problem for which the Jacobi solver application has been developed.
* Understand the single-GPU code of the application.
* Learn about NVIDIA Nsight Systems profiler and how to use it to analyze our application.

# The Application

This section provides an overview of the scientific problem we focus on and the solver we employ. Then, we execute the single GPU version of the application program.

### Laplace Equation

Laplace Equation is a well-studied linear partial differential equation that governs steady state heat conduction, irrotational fluid flow, and many other phenomena. 

In this lab, we will consider the 2D Laplace Equation on a rectangle with [Dirichlet boundary conditions](https://en.wikipedia.org/wiki/Dirichlet_boundary_condition) on the left and right boundary and periodic boundary conditions on top and bottom boundary. We wish to solve the following equation:

$\Delta u(x,y) = 0\;\forall\;(x,y)\in\Omega,\delta\Omega$

### Jacobi Method

The Jacobi method is an iterative algorithm to solve a linear system of strictly diagonally dominant equations. The governing Laplace equation is discretized and converted to a matrix amenable to Jacobi-method based solver. The pseudo code for Jacobi iterative process can be seen in diagram below:

![gpu_programming_process](../../images/jacobi_algo.jpg)


The outer loop defines the convergence point, which could either be defined as reaching max number of iterations or when [L2 Norm](https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-73003-5_1070) reaches a max/min value. 


### The Code

The GPU processing flow in general follows 3 key steps:

1. Copy data from CPU to GPU
2. Launch GPU Kernel
3. Copy processed data back to CPU from GPU

![gpu_programming_process](../../images/gpu_programming_process.png)

We follow the same 3 steps in our code. Let's understand the single-GPU code first. 

The source code file, [jacobi.cu](../../source_code/single_gpu/jacobi.cu) (click to open), is present in `CFD/English/C/source_code/single_gpu/` directory. 

Alternatively, you can navigate to `CFD/English/C/source_code/single_gpu/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi.cu` file as shown below:

![jupyter_lab_navigation](../../images/jupyter_lab_navigation.png)

Similarly, have look at the [Makefile](../../source_code/single_gpu/Makefile). 

Refer to the `single_gpu(...)` function. The important steps at each iteration of the Jacobi Solver inside `while` loop are:
1. The norm is set to 0 using `cudaMemset`.
2. The device kernel `jacobi_kernel` is called to update the interier points.
3. The norm is copied back to the host using `cudaMemcpy` (DtoH), and
4. The periodic boundary conditions are re-applied for the next iteration using `cudaMemcpy` (DtoD).

```
    while (l2_norm > tol && iter < iter_max) {
        cudaMemset(l2_norm_d, 0, sizeof(float));

	   // Compute grid points for this iteration
        jacobi_kernel<<<dim_grid, dim_block>>>(a_new, a, l2_norm_d, iy_start, iy_end, nx);
       
        cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(float), cudaMemcpyDeviceToHost));

        // Apply periodic boundary conditions
        cudaMemcpy(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(float), cudaMemcpyDeviceToDevice);
        cudaMemcpy(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(float),cudaMemcpyDeviceToDevice);

	    cudaDeviceSynchronize());
	    l2_norm = *l2_norm_h;
	    l2_norm = std::sqrt(l2_norm);

        iter++;
	    if ((iter % 100) == 0) printf("%5d, %0.6f\n", iter, l2_norm);
        std::swap(a_new, a);
    }
```

Note that we run the Jacobi solver for 1000 iterations over the grid.

### Compilation and Execution

Let's first get an overview of the CUDA driver version and the GPUs running on the server by executing the `nvidia-smi` command below. Highlight the cell below by clicking on it and then either hit `Ctrl+Enter` on the keyboard or click on the `Run` button on the toolbar above. The output will be visible below the cell.

In [None]:
!srun --partition=gpu -n1 --gres=gpu:1 nvidia-smi

We will now compile the code:

In [None]:
!cd ../../source_code/single_gpu && make clean && make

Now, let us execute the program: 

In [None]:
!cd ../../source_code/single_gpu && srun --partition=gpu -n1 --gres=gpu:1 ./jacobi

The output reports the norm value every 100 iterations and the total execution time of the Jacobi Solver. The expected output is:

```
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh
    0, 31.999022
  100, 0.897983
  200, 0.535684
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235509
  700, 0.209829
  800, 0.189854
  900, 0.173818
16384x16384: 1 GPU:   4.4512 s
```

The execution time may differ depending on the GPU, but the norm value after every 100 iterations should be the same. The program accepts `-nx` and `-ny` flags to change the grid size (preferably a power of 2) and `-niter` flag to change the number of iterations.


# Profiling

While the program in our labs gives the execution time in its output, it may not always be convinient to time the execution from within the program. Moreover, just timing the execution does not reveal the bottlenecks directly. For that purpose, we profile the program with NVIDIA's Nsight Systems profiler's command-line interface (CLI), `nsys`. 

### NVIDIA Nsight Systems

Nsight Systems profiler offers system-wide performance analysis in order to visualize application’s execution timeline and help identify optimization opportunities on a system with multiple CPUs and GPUs.

#### Timeline

![Nsight Systems timeline](../../images/nsys_overview.png)

The highlighted portions are identified as follows:
* <span style="color:red">Red</span>: The CPU row provides thread-level core utilization data. 
* <span style="color:blue">Blue</span>: The CUDA HW row displays GPU kernel and memory transfer activities and API calls.
* <span style="color:orange">Orange</span>: The Threads row gives a detailed view of each CPU thread's activity including from OS runtime libraries, MPI, NVTX, etc.

#### `nsys` CLI

We will profile the application using `nsys` CLI. Here's a typical `nsys` command to profile a program:

`nsys profile --trace=cuda,nvtx --stats=true -o jacobi_report --force-overwrite true ./jacobi`

The `--trace` flag specifies that we want to trace CUDA and NVTX APIs (in addition to baseline tracing), `--stats` specifies that we want to generate a statistics summary after profiling, and `-o` allows us to name the report file (which will include the `.nsys-rep` extension). The `--force-overwrite` flag allows us to overwrite an existing report (of the same name).

Note that we can always use the `nsys --help` to know more about these and other available options.

### Viewing the Report

One can view the profiling report by using Nsight Systems GUI. Note that CUDA toolkit and the GUI application of the same version as CLI are required. Follow these steps:
* Open Nsight Systems GUI application.
* Click on _file $\rightarrow$ open_.
* Browse and select the `.nsys-rep` file.

Alternatively, we can enable the `--stats` flag to display profiling data on the terminal (refer to the image below).

![nsys cli sample output](../../images/nsys_cli_sample_output.png)

### NVIDIA Tools Extension (NVTX)

NVTX is C-based API for annotating events in applications. It is useful for profiling both specific events and large code blocks. We will routinely make use of NVTX APIs to instrument our application for `nsys` profiler. It helps `nsys` in collecting relevant information and improves the application timeline's readability. 

To use NVTX, follow these steps:
* `#include <nvToolsExt.h>` in the code file
* Insert `nvtxRangePush("myCodeBlock");` just before the code block begins and `nvtxRangePop();` just after it ends.

Now, go back to the [jacobi.cu](../../source_code/single_gpu/jacobi.cu) source code file and correlate the "Jacobi solve" annotated event visible on both the `nsys` CLI statistics and the GUI-based timeline to its use in the source code.

### Improving performance

Any code can be taken up for optimizations. We will follow the cyclic process to optimize our code and get best scaling results across multiple GPU:

* **Analyze** the code using profilers to identify bottlenecks and hotspots.
* **Parallelize** the routines where most of the time in the code is spent.
* **Optimize** the parallel code by analyzing first for opportunities, applying optimizations, verifying our gains, and repeating the process.

### Metrics of Interest

To quantify the performance gain, we denote the single-GPU execution time as $T_s$ and multi-GPU execution time for $P$ GPUs as $T_p$. Using this, we obtain the figures-of-merit:
* Speedup $S = T_s/T_p$ (optimal is $P$), and 
* Efficiency $E = S/P$ (optimal is $1$). 

### Analyzing the code

Let's profile the single-GPU code:

In [None]:
!cd ../../source_code/single_gpu/ && srun --partition=gpu -n1 --gres=gpu:1 nsys profile --trace=cuda,nvtx --stats=true -o jacobi_report --force-overwrite true ./jacobi

To view the profiler report, you would need to Download and save the report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../../source_code/single_gpu/jacobi_report.nsys-rep) and choosing Save Link As. Once done open the report via the GUI. This is the analysis step. Right click on the NVTX tab and select the Events View.

![nsys single_gpu_analysis](../../images/nsys_single_gpu_analysis.png)

Clearly, we need to parallelize the "Jacobi Solve" routine, which is essentially the iterative Jacobi solver loop. Click on the link to continue to the next lab where we parallelize the code using `cudaMemcpy` and understand concepts like Peer-to-Peer Memory Access.

# [Next: CUDA Memcpy and Peer-to-Peer Memory Access](../cuda/memcpy.ipynb)

Here's a link to the home notebook through which all other notebooks are accessible:

# [HOME](../../../start_here.ipynb)

---

## Links and Resources

* [Science: Laplace Equation](https://mathworld.wolfram.com/LaplacesEquation.html)
* [Science: Jacobi Method](https://en.wikipedia.org/wiki/Jacobi_method)
* [Programming: CUDA C/C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)
* [Programming: NVTX Documentation](https://docs.nvidia.com/nsight-visual-studio-edition/2020.1/nvtx/index.html)
* [Tools: NVIDIA NSight Systems profiler](https://developer.nvidia.com/nsight-systems)
* [Code: Multi-GPU Programming Models](https://github.com/NVIDIA/multi-gpu-programming-models)
* [Code: GPU Bootcamp](https://github.com/gpuhackathons-org/gpubootcamp/)

Don't forget to check out additional [Open Hackathons Resources](https://www.openhackathons.org/s/technical-resources) and join our [OpenACC and Hackathons Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

## Licensing
Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.
