Before we begin, let us execute the below cell to display information about the NVIDIA® CUDA® driver and the GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by clicking on it with your mouse, and pressing Ctrl+Enter, or pressing the play button in the toolbar above. You should see some output returned below the grey cell.

In [None]:
!nvidia-smi

## Learning objectives
The **goal** of this lab is to:

- Learn an overview of the NVIDIA Nsight™ Systems tool
- Learn how to view and interpret your profiling report via NVIDIA Nsight Systems
- Learn how to profile your application using Nsight Systems command line interface (CLI) with NVIDIA Tools Extension SDK (NVTX) application programming interface  (API)
- Learn how to find the performance limiters from the **Timeline** view of the tool

We do not intend to cover:

- Advanced optimization techniques in detail

### Introduction to Nsight Systems 
Nsight Systems offers system-wide performance analysis in order to visualize an application’s algorithms, help identify optimization opportunities, and improve the performance of applications running on a system consisting of multiple CPUs and GPUs.

#### Nsight Systems Timeline

When your profiling run with Nsight Systems is finished and you open the report in the Nsight Systems graphical user interface (GUI), the first thing you see is the **Timeline**  view. It consists of several timeline rows that show how the application is interacting with the various system resources over time. The profile shown in the below image is from a deep learning application using TensorFlow. So we have a scenario, which focuses on GPU activity and  what might limit the GPU to run at full load. The intention of all NVIDIA tools is to get the most performance from  your GPU system.

<img src="images/nsight_sys_tags.png">

Nsight Systems is packed with many features. A few of the features are highlighted in the above screenshot:

- API tracing of CUDA libraries and deep learning frameworks
- CPU utilization, CPU thread states and thread migration as well as CPU callstack sampling
- Operating system (OS) runtime library calls
- GPU activities (kernels and memory copies) as well as GPU metrics

In the following section, we briefly go over some of the Nsight Systems features. To read more, review https://docs.nvidia.com/nsight-systems/UserGuide/index.html.

- **CPU Cores Workload:** CPU rows help locate the CPU core's idle times. Each row shows how the process' threads utilize the CPU cores. Each core is shown in a different call and average utilization can be seen on each subrows.

<img src="images/cpu_row.png">

- **CPU Thread Activity:** Thread rows show a detailed view of each thread's activity including OS runtime libraries usage, CUDA API calls, NVTX time ranges and events (if integrated into the application).

<img src="images/thread_row.png">

- **CUDA API:** This row show traces of CUDA API calls on the OS thread. You can:
    - See when kernels are dispatched
    - See when memory operations are initiated
    - Locate the corresponding CUDA workload on GPU
 
<img src="images/cuda_api.png"> 

- **GPU Utilization:** CUDA workloads rows display kernel and memory transfer activities. 

<img src="images/cuda_row_0.png"> 

You can zoom in and  see the locations where GPU is underutilized.

<img src="images/cuda_row_1.png">  


You can also see the CPU-GPU correlation by clicking on a CUDA API call to see the correlation with the underlying GPU activity on the CUDA row (highlighted in teal):

<img src="images/correlation.png">  

### Profiling using Command Line Interface (CLI)
To profile your application, you can either use the graphical user interface(GUI) or command line interface (CLI). During this lab, we will profile a mini-application using CLI.

The Nsight Systems CLI is referred to as `nsys`, provides several different commands. A basic profiling session can be done via `nsys profile ./app`. Below are some of the useful switches for CLI profiling:

- API tracing: `-t, --trace=cuda,nvtx,osrt,opengl`, other options are `cublas`,`cusparse`,`cudnn`,`mpi`,`oshmem`,`ucx`,`openacc`,`openmp`,`vulkan`,`none`.
- Overwrite existing report: `-f, --force-overwrite=[true|false]`
- Summary statistics (profile output on command line): `--stats=[true|false]`
- Report file name: `-o, --output=report#(patterns for hostname, PID and environment variables)`
- Callstack sampling: `-s, --sample=[cpu|none]` , `--sampling-period=number of CPU Instructions Retired events` , `-b, --backtrace=[lbr|fp|dwarf|none]` , `--samples-per-backtrace={1..12} , (The number of CPU IP samples collected for every CPU IP sample backtrace collected.)`

*Note:* Set the paranoid level: `“sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid’`

- CUDA memory usage: `--cuda-memory-usage=[true|false]` , tracks the GPU memory usage by CUDA kernels. This is applicable only when CUDA tracing is enabled.

**Note**: You do not need to memorize the profiler options. You can always run `nsys --help` or `nsys [specific command] --help` from the command line and use the necessary options or profiler arguments. Moreover, there is a special option to help with the transition from the legacy NVIDIA nvporf tool. Calling `nsys nvprof [options]` will provide the best available translation of `nvprof [options]`. For more information on Nsight Systems and NVTX, please see the __[Profiler documentation](https://docs.nvidia.com/nsight-systems/)__. 

An example of typical command line invocation: `nsys profile -t openacc,nvtx --stats=true --force-overwrite true -o laplace ./laplace`

where command switch options used are:
- `profile`: start a profiling session
- `-t`: Selects the APIs to be traced (`nvtx` and `openacc` in this example)
- `--stats`: if true, it generates a summary of statistics after the collection
- `--force-overwrite`: if true, it overwrites the existing generated report
- `-o`: name for the intermediate result file, created at the end of the collection (.qdrep filename)

<!--
*CLI Profiling - MPI Program:*

- Single Node: `nsys profile [nsys_args] mpirun [mpirun_args] your_executable`. The command will create one report file.
- Multiple Nodes: `mpirun [mpirun_args] nsys profile [nsys_args] your_executable`, you can set output report name with `-o report_name_%q{OMPI_COMM_WORLD_RANK}`. (For OpenMPI, PMI_RANK for MPICH and SLURM_PROCID for Slurm). The command will create one report file per MPI rank.

You can also profile only specific ranks: 
```
#!/bin/bash
# OMPI_COMM_WORLD_LOCAL_RANK for node local rank
if [ $OMPI_COMM_WORLD_RANK -eq 0 ]; then
    nsys profile -t mpi "$@"
else
    "$@"
fi
```
-->
#### How to View the Report
<a name="gui-report"></a>
When using CLI to profile the application, there are two ways to view the profiler's report. 

1) On the Terminal using the `--stats` option: By using `--stats` switch option, profiling results are displayed on the console terminal after the profiling data is collected. The collected data includes CUDA API, kernels and memory operations (by time and by size), OS runtime and NVTX.

<img src="images/laplas3.png" width="80%" height="80%">

2) NVIDIA Nsight System GUI: After the profiling session ends, a `*.nsys-rep` file will be created. This file can be loaded into Nsight Systems GUI using *File -> Open*. If you would like to view this on your local machine, this requires that the local system has the CUDA toolkit installed of the same version and the Nsight System GUI version should match the CLI version. More details on where to download the NVIDIA Nsight Systems can be found in the **Links and Resources** at the end of this page.

To view the profiler report, simply open the file from the GUI (File > Open).

<img src="images/nsight_open.png" width="80%" height="80%">

### Using NVIDIA Tools Extension (NVTX) 
<a name="nvtx"></a>
NVIDIA Tools Extension (NVTX) is a C-based Application Programming Interface (API) for annotating events, time ranges and resources in applications. NVTX brings the profiled application’s logic into the Profiler, making the Profiler’s displayed data easier to analyze and enabling correlation of the displayed data to profiled application’s actions.  

During this lab, we profile the application using the Nsight Systems CLI and collect the timeline. We will also be tracing NVTX APIs (already integrated into the application). The NVTX tool is a powerful mechanism that allows users to manually instrument their application. NVIDIA Nsight Systems can then collect the information and present it on the timeline. It is particularly useful for tracing CPU events and time ranges and greatly improves the timeline's readability. NVTX provides means to correlate the profile data with the application code. When profiling, you need to add `nvtx` to the tracing options of nsys profile. Example :  `nsys profile -t nvtx ./app`

**Using NVTX with C/C++**:: For C/C++ code, add `#include "nvtx3/nvToolsExt.h"` in your source code and wrap parts of your code which you want to capture events with calls to the NVTX API functions. For example, try adding `nvtxRangePush("main")` at the beginning of your `main()` function, and `nvtxRangePop(`) just before the return statement at the end. For more information, read https://github.com/NVIDIA/NVTX.

The sample code snippet below shows the use of range events.The resulting NVTX markers can be viewed in Nsight Systems **Timeline** view. 

```cpp
#include <nvtx3/nvToolsExt.h>
...
nvtxMark("Point in time");
...
nvtxRangePush("Name of your code region");
// your code goes here
nvtxRangePop();
```

**Using NVTX with Fortran**: The NVIDIA HPC SDK Fortran compiler provides NVTX bindings (`libnvhpcwrapnvtx.[a|so]` has to be linked). You would need to wrap parts of your code with `nvtxStartRange` and `nvtxEndRange` and add`-lnvhpcwrapnvtx` at the compile time . Documentation can be found here: https://docs.nvidia.com/hpc-sdk/compilers/fortran-cuda-interfaces/index.html#cfnvtx-runtime
    
```fortran
use nvtx
...
call nvtxStartRange("YourRange")
! some Fortran code
call nvtxEndRange
```

For other compilers, you can write your own Fortran NVTX bindings or use existing ones, e.g. https://raw.githubusercontent.com/maxcuda/NVTX_example/master/nvtx.f90

**Using NVTX with Python**: Python developers can either use decorators `@nvtx.annotate()` or a context manager with `nvtx.annotate(..)`. To get NVTX Python Get NVTX Python module, use `python -m pip install nvtx`.


```python
import nvtx
@nvtx.annotate(“f()”, color="purple")
def f():
    for i in range(5):
    with nvtx.annotate("loop", color="red"):
    # Python code goes here
```

PyTorch CUDA provides NVTX bindings:

```python
from torch.cuda import nvtx

nvtx.range_push("YourCode")
# your Python code
nvtx.range_pop()
```

<img src="images/nvtx.PNG" width="80%" height="80%">

You can learn more at https://nvtx.readthedocs.io/en/latest/index.html and https://developer.nvidia.com/blog/nvidia-tools-extension-api-nvtx-annotation-tool-for-profiling-code-in-python-and-c-c/.


Detailed NVTX documentation can be found under the __[CUDA Profiler user guide](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvtx)__.


As well as tracing NVTX, CUDA (CUDA API trace and workload), Nsight Systems for Linux x86_64 and Power targets is capable of capturing information about OpenMP and OpenACC execution in the profiled process.

-----

# <div style="text-align: center ;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em">[HOME](introduction.ipynb#steps)</div>

-----

# Links and Resources


[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)


**NOTE**: To be able to see the Nsight System profiler output, please download the latest version of Nsight Systems from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [Open Hackathons Resources](https://www.openhackathons.org/s/technical-resources) and join our [OpenACC and Hackathons Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.