Before we begin, let's execute the below cell to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!nvidia-smi

## Learning objectives
The **goal** of this lab is to:

- Learn about tracing of MPI, OpenSHMEM, NVSHMEM and NCCL
- Learn how to do multi-process profiling

We do not intend to cover:

- How to use NVSHMEM, NCCL, and MPI

### Nsight Systems 
Nsight Systems tool offers system-wide performance analysis in order to visualize application’s algorithms, help identify optimization opportunities, and improve the performance of applications running on a system consisting of multiple CPUs and GPUs. Nsight Systems is packed with many features. A few of the features are highlighted in the above screenshot:

- API tracing of CUDA libraries and deep learning frameworks
- CPU utilization, CPU thread states and thread migration as well as CPU callstack sampling
- OS runtime library calls
- GPU activities (kernels and memory copies) as well as GPU metrics

This section briefly explores other features of Nsight Systems that were not covered as part of the other labs.

#### GPU Metrics Sampling

Nsight systems has a GPU Metrics feature that is used to identify performance limiters in applications using GPU for computations. It uses periodic sampling to gather performance metrics and detailed timing statistics associated with different GPU hardware units taking advantage of specialized hardware to capture this data in a single pass with minimal overhead. These metrics provide an overview of GPU efficiency over time within compute and input/output (IO) activities:

- IO throughputs: PCIe, NVLink, and DRAM
- SM utilization: SMs activity, tensor core activity, instructions issued, warp occupancy (including unallocated slots)

It can help users answer the common questions:

- Is my GPU idle? 
- Is my instruction rate low (possibly IO bound)?
- Is my GPU full? Sufficient kernel grids size and streams? Are my SMs and warp slots full?
- Can I see GPU Direct RDMA/Storage or other transfers?
- Am I using TensorCores?
- Am I possibly blocked on IO, or number of warps, etc

Nsight Sytems GPU Metrics require NVIDIA Turing architecture or newer. Learn more about GPU metrics at https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metric-sampling

<img src="images/gpu_metrics.png">


#### TRACING OF MPI, OpenSHMEM, NVSHMEM AND NCCL

Nsight Systems supports MPI as well as OpenSHMEM tracing. You can record MPI communication parameters and track the MPI communicators. So, if you want to follow the data and see which MPI ranks are communicating with each other and how much data is transferred, you can see this information in the report. 

<img src="images/mpi_comm.png">

In the above screenshot you see the tooltip for an `MPI_Irecv` (bottom right) with its MPI tag, number of bytes that have been received, the sender of the data and also the communicator. In the case of MPI, you can also trace MPI for Fortran applications. 

Besides of MPI and OpenSHMEM, where we intercept the library calls, Nsight Systems can also trace calls into NVSHMEM and NCCL libraries. This is based on NVTX, where the usage was explained in previous labs. The result looks similar to what is shown in the above screenshot, just with the function names of the respective API and also the respective row labels, e.g. NCCL and NVSHMEM rows instead of MPI and UCX as in the above screenshot.

In the screenshot, you see the execution timeline of a super short range of an MPI program which triggers some `MPI_Isends` and `MPI_Irecvs`. For each MPI call you get the communication parameters, and you also get the UCX API calls, which are triggered by the MPI implementation, which in this example is OpenMPI. 


#### UCX API TRACING

Unified Communication (X) or UCX Layer is an open-source communication framework which acts as a common library and API for several higher level communication libraries, e.g. Open MPI (including its OpenSHMEM implementation) and MPICH. If UCX library trace is selected Nsight Systems will trace the subset of functions of the UCX protocol layer UCP that are most likely be involved in performance bottlenecks. If OpenSHMEM library trace is selected Nsight Systems will trace the subset of OpenSHMEM API functions that are most likely be involved in performance bottlenecks. 

<img src="images/ucx.png">

In the above screenshot, we have both `MPI_Isend` and `MPI_Irecv` calls that trigger UCP API calls (you only see the `MPI_Isend` calls, because the `MPI_Irecv` calls are super short for this particular example). The bottom row in the timeline shows the processing of transfers from non-blocking UCP communication operations. So in the UCX row, you see the submit functions and in the row below you see when the processing of the transfers starts (if we were to zoom out, we could also see when the processing ends).

#### NIC Performance Metrics
NVIDIA ConnectX smart network interface cards (smart NICs) offer advanced hardware offloads and accelerations for network operations. Viewing smart NICs metrics, on Nsight Systems timeline, enables developers to better understand their application’s network usage. Developers can use this information to optimize the application’s performance.

<img src="images/NIC.png">

The performance counters are displayed over Nsight Systems timeline, letting you know when the application is sending and receiving data. There are also counters that indicate network congestion like the `IB Send Wait` counter that you see in the above screenshot.

### Nsight Systems Multi-Process Profiling

On compute clusters, where you have to use a workload manager or want to do a run over multiple nodes, the `nsys profile` command is prefixed before the application. With this, a report file is generated for each process. If you can launch your application without workload manager on a single node, e.g. with `mpirun`, you can prefix `nsys profile` before `mpirun` and only a single report including all processes is generated.

- **Single Node**: `nsys profile [nsys_args] mpirun [mpirun_args] your_executable`. The command will create one report file.
- **Multiple Nodes**: `mpirun [mpirun_args] nsys profile [nsys_args] your_executable`, you can set output report name with `-o report_name_%q{OMPI_COMM_WORLD_RANK}`. (For OpenMPI, PMI_RANK for MPICH and SLURM_PROCID for Slurm). The command will create one report file per MPI rank.

You can also profile only specific ranks: 

```
#!/bin/bash
# OMPI_COMM_WORLD_LOCAL_RANK for node local rank
if [ $OMPI_COMM_WORLD_RANK -eq 0 ]; then
    nsys profile -t mpi "$@"
else
    "$@"
fi
```

Below is an example command that was run on a compute facility which uses the *SLURM* workload manager using two nodes.

```
srun [SRUN_ARGS] nsys profile -t mpi,ucx -s none --nic-metrics=true -o ./report_mpi.%q{SLURM_PROCID} -f true ./myprogram [PROGRAM_ARGS]
```

where,

- `nsys profile`: starts a profiling session
- `-t, --trace=...`: sets the APIs to be traced, in this example, it is UCX and MPI
- `-s,--sample=[cpu|none]`: controls CPU IP sampling
- `--nic-metrics=[true|false]`: controls Network Interface Cards (NIC) metrics collection
- `-o, --output=report#`: output profile report file path
- `-f --force-overwrite`: overwrite the output report file, if it already exists

To learn more about other switches to use with `nsys profile`, you can type `nsys profile --help` on the command line or read the [online docs](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-profiling).

### Nsight Compute Multi-Process Profiling

On a single-node submission, one Nsight Compute instance can profile all launched child processes and data for all processes is stored in one report file.

`ncu --target-processes all -o <singlereport-name> <app> <args>`

On multi-node submissions, one tool instance can be used per node. Make sure instances don’t write to the same report file on a shared disk. 

`ncu -o report_%q{OMPI_COMM_WORLD_RANK} <app> <args>`

Similar to Nsight systems, consider profiling only a single rank, e.g. using a wrapper script (see below example)

```
#!/bin/bash
if [[ "$OMPI_COMM_WORLD_RANK" == "3" ]] ; then
    /sw/cluster/cuda/11.1/ nsight-compute/ncu -o report_${OMPI_COMM_WORLD_RANK} --target-processes all $*
else
    $*
fi
```

-----

# <div style="text-align: center ;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em">[HOME](../_start_profiling.ipynb#steps)</div>

-----

# Links and Resources


[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[NVIDIA Nsight Compute](https://docs.nvidia.com/nsight-compute/index.html)


**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System's latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).