This lab gives an overview of the Nvidia Nsight Tool and steps to profile an application with Nsight Systems command line interface with NVTX API. You will learn how to integrate NVTX markers in your application to trace CPU events when profiling using Nsight tools. 
 
In this lab, we will optimize the weather simulation application written in Fortran (if you prefer to use C++, click [this link](../../C/jupyter_notebook/profiling-c.ipynb)). 

Let's execute the cell below to display information about the GPUs running on the server by running the pgaccelinfo command, which ships with the PGI compiler that we will be using. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!pgaccelinfo

# NVIDIA Profiler

### What is profiling
Profiling is the first step in optimizing and tuning your application. Profiling an application would help us understand where most of the execution time is spent. You will gain an understanding of its performance characteristics and can easily identify parts of the code that present opportunities for improvement. Finding hotspots and bottlenecks in your application, can help you decide where to focus our optimization efforts.

### NVIDIA Nsight Tools
NVIDIA offers Nsight tools (Nsight Systems, Nsight Compute, Nsight Graphics), a collection of applications which enable developers to debug, profile the performance of CUDA, OpenACC, or OpenMP applications. 

Your profiling workflow will change to reflect the individual Nsight tools. Start with Nsight Systems to get a system-level overview of the workload and eliminate any system level bottlenecks, such as unnecessary thread synchronization or data movement, and improve the system level parallelism of your algorithms. Once you have done that, then proceed to Nsight Compute or Nsight Graphics to optimize the most significant CUDA kernels or graphics workloads, respectively. Periodically return to Nsight Systems to ensure that you remain focused on the largest bottleneck. Otherwise the bottleneck may have shifted and your kernel level optimizations may not achieve as high of an improvement as expected.

- **Nsight Systems** analyze application algorithm system-wide
- **Nsight Compute** debug and optimize CUDA kernels 
- **Nsight Graphics** debug and optimize graphic workloads

<img src="images/Nsight Diagram.png" width="80%" height="80%">
*The data flows between the NVIDIA Nsight tools.*

In this lab, we only focus on Nsight Systems to get the system-wide actionable insights to eliminate bottlenecks.

### Introduction to Nsight Systems 
Nsight Systems tool offers system-wide performance analysis in order to visualize application’s algorithms, help identify optimization opportunities, and improve the performance of applications running on a system consisting of multiple CPUs and GPUs.

#### Nsight Systems Timeline
- CPU rows help locating CPU core's idle times. Each row shows how the process' threads utilize the CPU cores.
<img src="images/cpu.png" width="80%" height="80%">

- Thread rows shows a detailed view of each thread's activity including OS runtime libraries usage, CUDA API calls, NVTX time ranges and events (if integrated in the application).
<img src="images/thread.png" width="80%" height="80%">

- CUDA Workloads rows display Kernel and memory transfer activities. 
<img src="images/cuda.png" width="80%" height="80%">

### Profiling using command line interface 
To profile your application, you can either use the Graphical User Interface(GUI) or Command Line Interface (CLI). During this lab, we will profile the mini application using CLI.

The Nsight Systems command line interface is named `nsys`. Below is a typical command line invocation:

`nsys profile -t openacc,nvtx --stats=true --force-overwrite true -o miniWeather ./miniWeather`

where command switch options used for this lab are:
- `profile` – start a profiling session
- `-t`: Selects the APIs to be traced (nvtx and openacc in this example)
- `--stats`: if true, it generates summary of statistics after the collection
- `--force-overwrite`e: if true, it overwrites the existing generated report
- `-o` – name for the intermediate result file, created at the end of the collection (.qdrep filename)

**Note**: You do not need to memorise the profiler options. You can always run `nsys --help` or `nsys [specific command] --help` from the command line and use the necessary options or profiler arguments.
For more info on Nsight profiler and NVTX, please see the __[Profiler documentation](https://docs.nvidia.com/nsight-systems/)__.

### How to view the report
When using CLI to profile the application, there are two ways to view the profiler's report. 

1) On the Terminal using `--stats` option: By using `--stats` switch option, profiling results are displayed on the console terminal after the profiling data is collected.

<img src="images/laplas3.png" width="100%" height="100%">

2) NVIDIA Nsight System GUI: After the profiling session ends, a `*.qdrep` file will be created. This file can be loaded into Nsight Systems GUI using *File -> Open*. If you would like to view this on your local machine, this requires that the local system has CUDA toolkit installed of same version and the Nsight System GUI version should match the CLI version. More details on where to download CUDA toolkit can be found in the “Links and Resources” at the end of this page.

To view the profiler report, simply open the file from the GUI (File > Open).

<img src="images/nsight_open.png" width="80%" height="80%">

# Using NVIDIA Tools Extension (NVTX) 
NVIDIA Tools Extension (NVTX) is a C-based Application Programming Interface (API) for annotating events, time ranges and resources in applications. NVTX brings the profiled application’s logic into the Profiler, making the Profiler’s displayed data easier to analyze and enables correlating the displayed data to profiled application’s actions.  

During this lab, we profile the application using Nsight Systems command line interface and collect the timeline. We will also be tracing NVTX APIs (already integrated into the application). The NVTX tool is a powerful mechanism that allows users to manually instrument their application. NVIDIA Nsight Systems can then collect the information and present it on the timeline. It is particularly useful for tracing of CPU events and time ranges and greatly improves the timeline's readability. 

**How to use NVTX**: To use NVTX in Fortran code, you have to use `nvtx` module. The code uses the Fortran ISO C Binding module to create an interface to the NVTX C functions. Add `use nvtx` in your source code and wrap parts of your code which you want to capture events with calls to the NVTX API functions. For example, try adding `nvtxStartRange("main")` in the beginning of your `main()` function, and `nvtxEndRange` just before the return statement in the end.

The sample code snippet below shows the use of range events. The resulting NVTX markers can be viewed in Nsight Systems timeline view. 

```fortran
  call nvtxStartRange("init")
  call init()

  !Output the initial state
  call output(state,etime)

  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  !! MAIN TIME STEP LOOP
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  call system_clock(t1)
  call nvtxStartRange("while")
  do while (etime < sim_time)
    !If the time step leads to exceeding the simulation time, shorten it for the last step
    if (etime + dt > sim_time) dt = sim_time - etime
    !Perform a single time step
    call nvtxStartRange("perform_timestep")
    call perform_timestep(state,state_tmp,flux,tend,dt)
    call nvtxEndRange
    !Inform the user
    write(*,*) 'Elapsed Time: ', etime , ' / ' , sim_time
    !Update the elapsed time and output counter
    etime = etime + dt
    output_counter = output_counter + dt
    !If it's time for output, reset the counter, and do output
    if (output_counter >= output_freq) then
      output_counter = output_counter - output_freq
      call output(state,etime)
    endif
  enddo
  call nvtxEndRange
  call system_clock(t2,rate)
  write(*,*) "CPU Time: ",dble(t2-t1)/dble(rate)

  !Deallocate
  call finalize()
  call nvtxEndRange
   
```

<img src="images/fortran_nvtx.png" width="80%" height="80%">

Detailed NVTX documentation can be found under the __[CUDA Profiler user guide](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvtx)__.

### Steps to follow
To obtain the best performance from GPU and utilize the hardware, one should follow the cyclical process (analyze, parallelize, optimize). 

- **Analyze**: In this step, you first identify the portion of your code that includes most of the computation and most of the execution time is spent. From here, you find the hotspots, evaluate the bottlenecks and start investigating GPU acceleration.

- **Parallelize**: Now that we have identified the bottlenecks, we use OpenACC compute constructs to parallelise the routines where most of the time is spent.

- **Optimize**:  To further improve the performance, one can implement optimization strategies step by step in an iterative process including: identify optimization opportunity, apply and test the optimization method, verify and repeat the process.

Note: The above optimization is done incrementally after investigating the profiler output.

We will follow the optimization cycle for porting and improving the code performance.

<img src="images/Optimization_Cycle.jpg" width="80%" height="80%">

# Getting Started 
In the following sections, we parallelise and optimize the serial [mini weather application](miniweather.ipynb) following the above steps. Next section comprises 5 exercises, each will guide you through steps to detect performance limiters and steps to overcome them. For each exercise, inspect the code, compile, and profile it. Then, investigate the profiler’s report to identify the bottlenecks and spot the optimization opportunities.  At each step, locate problem areas in the application and make improvements iteratively to increase performance.

This lab comprises of multiple exercises, each follows the optimization cycle method. For each exercise, build the code with a simple `make` by running the cell and profile it with `nsys`.


**NOTE**: Example screenshots are for reference only and you may not get identical profiler report. In other words, some **screenshots represents profiler report for the values of 400,200,1500.**

-----

# <div style="text-align: center ;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em">[NEXT](profiling-fortran-lab1.ipynb)</div>

-----

# Links and Resources

[OpenACC API Guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).