Before we begin, let us execute the below cell to display information about the NVIDIA® CUDA® driver and the GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by clicking on it with your mouse, and pressing Ctrl+Enter, or pressing the play button in the toolbar above. You should see some output returned below the grey cell.

In [None]:
!nvidia-smi


## Learning objectives
The **goal** of this lab is to:

- Learn how to profile a kernel using NVIDIA Nsight™ Compute command line interface
- Learn how to navigate different sections of the Nsight Compute profiling tool
- Learn how to profile a kernel and find the performance limiters

We do not intend to cover:

- Advanced optimization techniques in detail


### Introduction to Nsight Compute
The Nsight Compute tool provides detailed performance metrics and application programming interface (API) debugging via a user interface (UI)  and command line tool. Nsight Compute is an interactive kernel profiler for GPU applications that provides detailed performance metrics and API debugging via a user interface and command line tool. The NVIDIA Nsight Compute command line interface (CLI), referred to as (`ncu`) provides a non-interactive way to profile applications from the command line and can print the results directly on the command line or store them in a report file. 

Results can then be imported to the graphical user interface (GUI) version for inspection. With the command line profiler, you can instrument the target API, and collect profile results for either specified kernels or all of them.

<img src="images/compute.png" >

- **Navigating the report via GUI**
The Nsight Compute UI consists of a header with general information, as well as controls to switch between report pages or individual collected kernel launches. By default, the profile report comes up on the *Details* page. You can easily switch between different report pages of the report with the dropdown labeled *Page* on the top-left of the page. 

<img src="images/page-compute.png" >

A report can contain any number of results from kernel launches. The *Launch* dropdown allows switching between the different results in the report.

<img src="images/launch-compute.png" >

Below is the overall look of the profiler report opened inside the UI.

<img src="images/compute_tags.png">

- **Sections and Sets**

Nsight Compute uses section sets to decide the number of metrics to be collected. By default, a relatively small number of metrics is collected such as SOL (speed of light – comparison against best possible behavior), launch statistics, and occupancy analysis. You can optionally select which of these sections are collected and displayed with command-line parameters. If you are profiling from the command line, use the flag `--set detailed` or `--set full`. In the later sections, you will learn how to collect these metrics. To read more about different sections in Nsight Compute, review the documentation: http://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sections-and-rules 

<img src="images/sections-compute.png" >


The below screenshots show a close-up view of example sections in the Nsight Compute profiler. You can expand each section by clicking on it. Under each section, there is a description explaining what it shows (some of these sections are not collected by default). 

<img src="images/header-compute.png" >

Various sections have a triangle with an exclamation mark in front of them. Follow the warning sign/icon to understand what the bottleneck is and for guidance on how you can improve it.

<img src="images/warning-compute.png" >

Some of the sections have one or more bodies with additional charts or tables. You can click on the triangle expander icon in the top-left corner of each section to show or hide those. If a section has multiple bodies, a dropdown in the top-right corner allows you to switch between them. As shown in the example screenshot below, you can switch between different bodies in the SOL section and choose to view *SOL Chart*, *SOL breakdown*, *SOL Rooflines*, or all together.

<img src="images/expand-compute.png" >


Let's have a look at some of these sections:

The _**GPU Speed Of Light Roofline**_ Chart section contains a Roofline chart that is helpful for visualizing kernel performance. More information on how to use and read this chart can be found in [*Roofline Charts*](#roofline) section.

<img src="images/roofline-compute.png" >

The _**Memory Workload Analysis**_ section contains a Memory chart that visualizes data transfers, cache hit rates, instructions, and memory requests. More information on how to use and read this chart can be found in [*Memory  Charts*](#memory) section.

<img src="images/memory-compute.png" >

_**Source Counters**_ can contain source hotspot tables that indicate the **N** highest or lowest values of one or more metrics in the kernel source code. In other words, it depicts performance problems in the source code.

<img src="images/source-compute.png" >

You can select the location links to navigate directly to this location in the *Source Page* (it displays metrics that can be correlated with source code). If you hover the mouse over a value, you can see which metrics contribute to it.Hotspot tables point out performance problems in your source. Please note for the correlation of SASS and source code to work, the source code needs to be compiled with the `-lineinfo` flag. Moreover, if available from `-lineinfo`, correlated CUDA source files are permanently imported into the report by using `--import-source 1` when profiling (NOTE: Source is not embedded in the report by default, needs local or remote access to the source file to resolve in the UI. Import source during collection to solve this).
 
<img src="images/sass-compute.png" >

The *View* dropdown can be used to select different code (correlation) options. This includes SASS, parallel thread execution (PTX) and Source (CUDA-C), as well as their combinations. The availability of these options depends on the source information embedded into the executable.


Some of the pre-defined source metrics include: 
- Live Registers: number of registers that need to be kept valid by the compiler. A high value indicates that many registers are required at this code location, potentially increasing the register pressure and the maximum number of registers required by the kernel.
- Instructions Executed: number of times the source (instruction) was executed per individual warp, independent of the number of participating threads within each warp.
- Divergent Branches: when there are two or more active threads with divergent targets, the number increments. These can lead to warp stalls due to resolving the branch or instruction cache misses.

Moreover, you can also track register dependencies in the SASS view of the Source page. When a register is read, all the potential addresses where it could have been written are found. The links between these lines are drawn in the view. All dependencies for registers, predicates, uniform registers and uniform predicates are shown in their respective columns.

<img src="images/sass-reg-dependency.png" >

The picture above shows some dependencies for a simple kernel. On the first row, which is line 69 of the SASS code, we can see writes on registers R4 and R5, represented by filled triangles pointing to the left. On the same row, we can also see reads on registers R7 and R28, represented by regular triangles pointing to the right. Additionally, we see both types of triangles on the same line, which means that a read and a write occurred for the same register. This can be seen on line 71 for both registers  R4, R5. You can hover  the mouse over each triangle to see the name of the register. To learn more, review https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#profiler-report-source-page.

<!--
Section Name | Description |
-|-
Compute Workload Analysis |	Detailed analysis of the compute resources of the streaming multiprocessors (SM), including the achieved instructions per clock (IPC) and the utilization of each available pipeline. Pipelines with very high utilization might limit the overall performance.
InstructionStats (Instruction Statistics) |Statistics of the executed low-level assembly instructions (SASS). The instruction mix provides insight into the types and frequency of the executed instructions. A narrow mix of instruction types implies a dependency on few instruction pipelines, while others remain unused. Using multiple pipelines allows hiding latencies and enables parallel execution.
LaunchStats (Launch Statistics) | 	Summary of the configuration used to launch the kernel. The launch configuration defines the size of the kernel grid, the division of the grid into blocks, and the GPU resources needed to execute the kernel. Choosing an efficient launch configuration maximizes device utilization.
Memory Workload Analysis  |	Detailed analysis of the memory resources of the GPU. Memory can become a limiting factor for the overall kernel performance when fully utilizing the involved hardware units (Mem Busy), exhausting the available communication bandwidth between those units (Max Bandwidth), or by reaching the maximum throughput of issuing memory instructions (Mem Pipes Busy). Depending on the limiting factor, the memory chart and tables allow to identify the exact bottleneck in the memory system.
Nvlink (Nvlink)| High-level summary of NVLink utilization. It shows the total received and transmitted (sent) memory, as well as the overall link peak utilization.
Occupancy (Occupancy)| Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. Another way to view occupancy is the percentage of the hardware's ability to process warps that is actively in use. Higher occupancy does not always result in higher performance, however, low occupancy always reduces the ability to hide latencies, resulting in overall performance degradation. Large discrepancies between the theoretical and the achieved occupancy during execution typically indicates highly imbalanced workloads.
SchedulerStats (Scheduler Statistics) |	Summary of the activity of the schedulers issuing instructions. Each scheduler maintains a pool of warps that it can issue instructions for. The upper bound of warps in the pool (Theoretical Warps) is limited by the launch configuration. On every cycle each scheduler checks the state of the allocated warps in the pool (Active Warps). Active warps that are not stalled (Eligible Warps) are ready to issue their next instruction. From the set of eligible warps, the scheduler selects a single warp from which to issue one or more instructions (Issued Warp). On cycles with no eligible warps, the issue slot is skipped and no instruction is issued. Having many skipped issue slots indicates poor latency hiding.
SourceCounters  |	Source metrics, including branch efficiency and sampled warp stall reasons. Sampling Data metrics are periodically sampled over the kernel runtime. They indicate when warps were stalled and couldn't be scheduled. See the documentation for a description of all stall reasons. Only focus on stalls if the schedulers fail to issue every cycle.
SpeedOfLight |	High-level overview of the utilization for compute and memory resources of the GPU. For each unit, the Speed Of Light (SOL) reports the achieved percentage of utilization with respect to the theoretical maximum. On Volta+ GPUs, it reports the breakdown of SOL SM and SOL Memory to each individual sub-metric to clearly identify the highest contributor.
WarpStateStats (Warp State Statistics) | 	Analysis of the states in which all warps spent cycles during the kernel execution. The warp states describe a warp's readiness or inability to issue its next instruction. The warp cycles per instruction define the latency between two consecutive instructions. The higher the value, the more warp parallelism is required to hide this latency. For each warp state, the chart shows the average number of cycles spent in that state per issued instruction. Stalls are not always impacting the overall performance nor are they completely avoidable. Only focus on stall reasons if the schedulers fail to issue every cycle.
-->


- **Comparing multiple results**
With the Nsight Compute GUI, you can create a baseline and compare results against each other. On the *Details* page, press the button *Add Baseline* to make the current report/result, the baseline for all other results from this report and any other report opened in the same instance of Nsight Compute. When a baseline is set, every element on the *Details* page shows two values: The current value of the result in focus and the corresponding value of the baseline or the percentage of change from the corresponding baseline value.


<img src="images/baseline-compute.png" >

- **Applying Rules**
Sections on the *Details* page may provide rules. By pressing the *Apply Rules* button on the top of the page, all available rules for the current report are executed. 


<img src="images/rule-compute.png" >


### Roofline Charts 
<a name="roofline"></a>
Once the high-performance software code is written, you need to understand how well the application performs on the available hardware. Whether the architecture is comprised of CPUs, GPUs, or something else, different platforms will have different hardware limitations such as available memory bandwidth and theoretical compute limits. The Roofline performance model visualizes achieved performance and helps you understand how well your application is using the available hardware resources and find the performance limiters. 

Kernel performance is not only dependent on the operational speed of the GPU. Since a kernel requires data to work on, performance is also dependent on the rate at which the GPU can feed data to the kernel. A typical Roofline chart combines the peak performance and memory bandwidth of the GPU, a metric called *arithmetic intensity* (a ratio between Work and Memory Traffic), into a single chart, to represent the achieved performance of the profiled kernel more realistically.

With arithmetic intensity and floating point operations per second (FLOP/s), you can plot a kernel on a graph that includes rooflines and ceilings of performance limits and visualize how your kernel is affected by them.

- *Arithmetic intensity* The ratio between compute work (FLOPs) and data movement (bytes)
- *FLOP/s*: Floating-point operations per second


Nsight Compute collects and displays roofline analysis data in the Roofline chart. This chart is part of the *Speed of Light (SOl)* section. 


<img src="images/roofline-overview.png">


This chart actually shows two different Rooflines. However, the following components can be identified for each:

- *Vertical Axis* represents floating point operations per second (FLOPS) (Note: For GPUs this number can get quite large and to better accommodate the range, this axis is rendered using a logarithmic scale.)
- *Horizontal Axis* represents arithmetic intensity, which is the ratio between Work (expressed in FLOP/s), and Memory Traffic (expressed in bytes per second). The resulting unit is in floating point operations per byte. This axis is also shown using a logarithmic scale.
- *Memory Bandwidth Boundary* is the sloped part of the Roofline. By default, this slope is determined entirely by the memory transfer rate of the GPU but it  can also be customized.
- *Peak Performance Boundary* is the flat part of the  Roofline By default, this value is determined entirely by the peak performance of the GPU but can be customized too.
- *Ridge Point* is the point at which the memory bandwidth boundary meets the peak performance boundary (a useful reference when analyzing kernel performance).
- *Achieved Value* represents the performance of the profiled kernel.

To learn more about customizing  Nsight Compute tools, read the Nsight Compute Customization Guide: https://docs.nvidia.com/nsight-compute/2021.2/CustomizationGuide/index.html#abstract

#### Roofline Analysis

The Roofline chart can be very helpful in guiding performance optimization efforts for a particular kernel.

<img src="images/roofline-analysis.png">

As shown here, the ridge point partitions the Roofline chart into two regions. The area shaded in blue under the sloped *Memory Bandwidth Boundary* is the *Memory Bound* region, while the area shaded in green under the *Peak Performance Boundary* is the *Compute Bound* region. The region in which the achieved value falls determines the current limiting factor of kernel performance.

The distance from the achieved value to the respective Roofline boundary (shown in this figure as a dotted white line), represents the opportunity for performance improvement. The closer the achieved value is to the Roofline boundary, the more optimal its performance. An achieved value that lies on the *Memory Bandwidth Boundary* but is not yet at the height of the ridge point would indicate that any further improvements in overall FLOP/s are only possible if the arithmetic intensity is increased at the same time. 

If you hover your mouse over the achieved value, you can see the achieved performance (FLOP/s)(as shown in the below example).

<img src="images/roofline-achieved.png">

Using the baseline feature in combination with Roofline charts is a good way to track optimization progress over a number of kernel executions. As shown in the example below, the Roofline chart also contains an achieved value for each baseline. The outline color of the plotted achieved value point can be used to determine from which baseline the point came.In this example, the outline colors are light blue and green showing the achieved value points.

<img src="images/roofline-baseline.png">

### Memory Charts 
<a name="memory"></a>
The *Memory Workload Analysis* section shows a detailed analysis of the memory resources of the GPU. Memory can become a limiting factor for the overall kernel performance when fully utilizing the involved hardware units (Mem Busy), exhausting the available communication bandwidth between those units (Max Bandwidth), or by reaching the maximum throughput of issuing memory instructions (Mem Pipes Busy). Depending on the limiting factor, the memory chart and tables allow you to identify the exact bottleneck in the memory system.

Below is a memory chart of an NVIDIA V100 Tenstor Core GPU:

<img src="images/compute-memory.png" >

*Logical unit* (e.g: Kernel, Global memory) are shown in green and *physical units* (e.g: L2 Cache, Device Memory, System memory) are shown in blue color. Since not all GPUs have all units, the exact set of shown units may vary for a specific GPU architecture.

*Links* between *Kernel* and other logical units represent the number of executed instructions (Inst) targeting the respective unit. For example, the link between Kernel and Global represents the instructions loading from or storing to the global memory space. 

Links between logical units (green) and physical units (blue) represent the number of requests (Req) issued as a result of their respective instructions. For example, the link going from L1/TEX Cache to Global shows the number of requests generated due to global load instructions.

The color of each link represents the percentage of peak utilization of the corresponding communication path. The color legend to the right of the chart shows the applied color gradient from unused (0%) to operating at peak performance (100%). Triangle markers to the left of the legend correspond to the links in the chart. 


Colored rectangles inside the units located at the incoming and outgoing links represent port utilization. Units often share a common data port for incoming and outgoing traffic. Ports use the same color gradient as the data links. The below screenshot shows the mapping of the peak values between the memory chart and the table. An example of the correlation between the peak values reported in the memory tables and the ports in the memory chart is shown below:

<img src="images/compute-memtable.png">

Moreover, if you hover your mouse on any of the memory table cells in the UI, you get all the background documentation and metric calculations.

<img src="images/compute-memtable-hover.png">

Memory tables show detailed metrics for the various memory hardware units such as device memory. To learn more, please read the profiling guide: https://docs.nvidia.com/nsight-compute/2021.2/ProfilingGuide/index.html#memory-tables



### Profiling Using Command Line Interface 
To profile the application, you can either use the GUI or command line interface (CLI). During this lab, we will profile the applications using CLI. The Nsight Compute command line executable is named `ncu`. To collect the default set of data for all kernel launches in the application, run:

```
ncu -o output ./rdf
```

For all kernel invocations in the application code, details page data will be gathered and displayed and the results are written to `output.ncu-rep`. 

<img src="images/compute-cli-1.png"  width="80%" height="80%">

As seen in the above screenshot, each output from the compute profiler starts with `==PROF==`. The other lines are output from the application itself. For each profiled kernel, the name of the kernel function and the progress of data collection are shown. In the example screenshot, the kernel function name starts with `_Z16pair_gpu_183_gpuPKdS0_S0_...`.


<img src="images/compute-cli-2.png"  width="80%" height="80%">

An example screenshot shows major sections (annotated in green) for SOL (speed of light – comparison against best possible behavior), launch statistics, and occupancy analysis for the example kernel function `pair_gpu`. You can optionally select which of these sections are collected and displayed with command-line parameters. Simply run `ncu --list-sets` from the terminal to see the list of available sets. 


<img src="images/compute-sets.png" width="80%" height="80%"> 


To see the list of currently available sections, use `--list-sections`.


<img src="images/compute-sections.png" width="80%" height="80%"> 

To collect all sections and sets when profiling your application with Nsight Compute, add `--set full` to the command line. Then it collects Memory and Compute Workload Analysis, scheduler, warp state and instruction statistics in addition to the default sections and all will be added to the profiling report. 

**Note**: The choice of sections and metrics will affect profiling time and will slow down the process. It also increases the size of the output.


There are also options available to specify for which kernel data should be collected. Below is a typical command line invocation to collect the default set of data for all kernel launches in the target application:

`ncu -k _Z16pair_gpu_183_gpuPKdS0_S0_Pyiidddi --launch-skip 1 --launch-count 1 -f -o output ./rdf`

where command switch options used for this lab are:
- `-c` or `--launch-count`: to specify the number of kernel launches to collect
- `-s` or `--launch-skip`: to specify the number of kernels to skip before collection starts
- `-k` or `--kernel-name`: to specify the matching kernel name
- `-f`: Overwrites the existing generated report
- `-o`: name for the intermediate result file, created at the end of the collection (.nsight-cuprof-report or .ncu-rep filename)

**Customizing data collection**: One may ask how to decide on the number of kernels to skip and how many kernel launches to collect? Since data is collected per kernel, it makes sense to collect for more than one kernel launch if kernels have different behavior or performance characteristics. The decision on how many kernel launches to skip or collect depends on if you want to collect the performance metrics for those kernel launches or not.

You can also profile the kernel from inside NVIDIA Nsight Systems or you can copy the command line options for the specific kernel you want to profile. To achieve this, you would need to right click on the kernel in the timeline view from inside Nsight Systems. 

<img src="images/nsys-compute-command.png" width="80%" height="80%">

Then click on "Analyze the selected Kernel with NVIDIA Nsight Compute". 

<img src="images/nsys-compute-command1.png" width="50%" height="50%">

Then choose "Display the command line to use NVIDIA Nsight Compute CLI". Then, you copy the command and run it on the target system to analyze the selected kernel.

<img src="images/nsys-compute-command2.png" width="50%" height="50%">


**Note**: You do not need to memorize the profiler options. You can always run `ncu --help` from the command line and use the necessary options or profiler arguments. For more info on Nsight Compute CLI, please read the __[documentation](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html)__.


### How to view the report
The profiler report contains all the information collected during profiling for each kernel launch. When using CLI to profile the application, there are two ways to view the profiler's report. 

1) On the Terminal: By default, a temporary file is used to store profiling results, and data is printed to the command line. You can also use `--print-summary per-kernel` option to view the summary of each kernel type on the terminal. To read more about console output options, review the guide at https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#command-line-options-console-output .


2) NVIDIA Nsight Compute UI: To permanently store the profiler report, use `-o` to specify the output filename. After the profiling session ends, a `*.nsight-cuprof-report` or `*.ncu-rep` file will be created. This file can be loaded into Nsight Compute UI using *File -> Open*. If you would like to view this on your local machine, this requires that the local system has CUDA toolkit installed of the same version and the Nsight Compute UI version should match the CLI version. More details on where to download the CUDA toolkit can be found in the “Links and Resources” at the end of this page.

To view the profiler report, simply open the file from the GUI (File > Open).

<img src="images/compute-open.png">

**NOTE**: Example screenshots are for reference only and you may not get identical profiler report.

-----

# <div style="text-align: center ;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em">[HOME](introduction.ipynb#steps)</div>

-----

# Links and Resources


[NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute)


**NOTE**: To be able to see the Nsight Compute profiler output, please download the latest version of Nsight Compute from [here](https://developer.nvidia.com/nsight-compute).

Don't forget to check out additional [Open Hackathons Resources](https://www.openhackathons.org/s/technical-resources) and join our [OpenACC and Hackathons Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.