## Profiling Tutorial

### Learning objectives
Learn how to profile your application with NVIDIA Nsight Systems and NVTX API calls to find performance limiters and bottlenecks and apply incremental parallelization strategies. In this lab, you will:

- Learn to follow cyclical process (analyze, parallelize, optimize) to help you identify the portions of the code that would benefit from GPU acceleration and apply parallelisation strategies and optimization techniques to see additional speedups and improve performance
- Understand what a profiler is and which NVIDIA Nsight tool to choose in order to profile your application
- Profile a sequential weather modeling application (integrated with NVIDIA Tools Extension (NVTX) APIs) with NVIDIA Nsight Systems to capture and trace CPU events and time ranges
- Understand how to use NVIDIA Nsight Systems profiler’s report to detect hotspots to port the application to the GPU
- Learn how to use Nsight Systems to identify issues such as underutilized GPU device and unnecessary data movements in the application and to apply optimization strategies step by step to expose more parallelism and utilize computer’s CPU and GPU
- Learn how to use occupancy to address performance limitations


In this lab, we will be optimizing a serial application written in C programming language. The first 5 labs optimize the weather simulation application using the OpenACC programming model. The optional lab 6 optimizes an application based that uses Jacobi iterative method using CUDA. 


### Tutorial Outline
- [Introduction](jupyter_notebook/introduction.ipynb)
    - Overview of Nsight profiler tools ([Nsight Systems](jupyter_notebook/nsight_systems.ipynb) and [Nsight Compute](jupyter_notebook/nsight_compute.ipynb))
    - Overview of [Mini Weather application](jupyter_notebook/miniweather.ipynb)
    - Optimization Steps to parallel programming with OpenACC
- [Lab 1](jupyter_notebook/profiling_lab1.ipynb)
    - How to compile a serial application with NVIDIA HPC compiler
    - How to profile a serial application with Nsight Systems and NVTX APIs
    - How to use profiler's report to find hotspots
    - Scaling and Amdahl's law and why it matters
- [Lab 2](jupyter_notebook/profiling_lab2.ipynb) 
    - Parallelise the serial application using OpenACC compute directives
    - How to compile a parallel application with NVIDIA HPC compiler
    - What does the compiler feedback tell us
    - Profile with Nsight Systems
    - Finding bottlenecks from Nsight Systems report
- [Lab 3](jupyter_notebook/profiling_lab3.ipynb)
    - How to combine the knowledge from compiler feedback and profiler to optimize the application
    - What is occupancy
    - Demystifying Gangs, Workers, and Vectors
    - Apply collapse clause to optimize the application further
- [Lab 4](jupyter_notebook/profiling_lab4.ipynb) 
    - Inspect data movement from the profiler's report
    - Data management with OpenACC
    - Apply incremental parallelization strategies and use profiler's report for the next step
- [Lab 5](jupyter_notebook/profiling_lab5.ipynb)
    - When and How to use Nsight Compute
    - What does the profiler tell us, where is the bottleneck
    - How to use baselines with Nsight Compute
- **Optional**
    - [Lab 6](jupyter_notebook/profiling_lab6.ipynb)
        - Performance Analysis of an application using Nsight Systems and Compute (CUDA example)
    - [Advanced](jupyter_notebook/nsight_advanced.ipynb)
        - What are GPU Metrics in Nsight Systems
        - How to profile multi processes
    

### Tutorial Duration
The lab material will be presented in a 2.5hr session. The link to the material is available for download at the end of each lab.

### Content Level
Beginner, Intermediate

### Target Audience and Prerequisites
The target audience for this lab is researchers/graduate students and developers who are interested in getting hands on experience with the NVIDIA Nsight System through profiling a real life parallel application.

While Labs 1-5 do not assume any expertise in CUDA experience, basic knowledge of OpenACC programming (e.g: compute constructs), GPU architecture, and programming experience with C/C++ is desirable.

The Optional lab 6 requires basic knowledge of CUDA programming, GPU architecture, and programming experience with C/C++.

### Additional Resources
- Please install [NVIDIA Nsight Compute](https://docs.nvidia.com/nsight-compute/index.html) and [NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/) on your local machine.

- [Nsight Developer Tools Training Contents](https://github.com/NVIDIA/nsight-training)

--- 

## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).