&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&ensp;
[Home Page](Start_Here.ipynb)
    
    
[Previous Notebook](Introduction_to_Performance_analysis.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&ensp;
[1](Introduction_to_Performance_analysis.ipynb)
[2]
[3](Performance_Analysis_using_NSight_systems_Continued.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[Next Notebook](Performance_Analysis_using_NSight_systems_Continued.ipynb)

# Performance analysis using Nsight systems


In this notebook, we will use Nsight systems to profile and improve the DeepStream application's throughput performance.


- [Using NSight Systems to generate a report and finding bottlenecks to solve](#Using-NSight-Systems-to-generate-a-report-and-finding-bottlenecks-to-solve) 
    - [Streammux parameters](#Streammux-parameters)
    - [Batch size across cascaded networks](#Batch-size-across-cascaded-networks)
    - [NVInfer](#NVInfer)
    - [NVTracker](#NVTracker)
- [Summary](#Summary)


## NVIDIA Profiler

### What is profiling
Profiling is the first step in optimizing and tuning your application. Profiling an application would help us understand where most of the execution time is spent. You will gain an understanding of its performance characteristics and can easily identify parts of the code that present opportunities for improvement. Finding hotspots and bottlenecks in your application, can help you decide where to focus our optimization efforts.

### NVIDIA Nsight Tools
NVIDIA offers Nsight tools (Nsight Systems, Nsight Compute, Nsight Graphics), a collection of applications which enable developers to debug, profile the performance of CUDA, OpenACC, or OpenMP applications. 

Your profiling workflow will change to reflect the individual Nsight tools. Start with Nsight Systems to get a system-level overview of the workload and eliminate any system level bottlenecks, such as unnecessary thread synchronization or data movement, and improve the system level parallelism of your algorithms. Once you have done that, then proceed to Nsight Compute or Nsight Graphics to optimize the most significant CUDA kernels or graphics workloads, respectively. Periodically return to Nsight Systems to ensure that you remain focused on the largest bottleneck. Otherwise the bottleneck may have shifted and your kernel level optimizations may not achieve as high of an improvement as expected.

- **Nsight Systems** analyze application algorithm system-wide
- **Nsight Compute** debug and optimize CUDA kernels 
- **Nsight Graphics** debug and optimize graphic workloads

<img src="images/Nsight Diagram.png" width="80%" height="80%">
*The data flows between the NVIDIA Nsight tools.*

In this lab, we only focus on Nsight Systems to get the system-wide actionable insights to eliminate bottlenecks.

### Introduction to Nsight Systems 
Nsight Systems tool offers system-wide performance analysis in order to visualize application’s algorithms, help identify optimization opportunities, and improve the performance of applications running on a system consisting of multiple CPUs and GPUs.

#### Nsight Systems Timeline
- CPU rows help locating CPU core's idle times. Each row shows how the process' threads utilize the CPU cores.
<img src="images/cpu.png" width="80%" height="80%">

- Thread rows shows a detailed view of each thread's activity including OS runtime libraries usage, CUDA API calls, NVTX time ranges and events (if integrated in the application).
<img src="images/thread.png" width="80%" height="80%">

- CUDA Workloads rows display Kernel and memory transfer activites. 
<img src="images/cuda.png" width="80%" height="80%">


### Profiling using command line interface 
To profile your application, you can either use the Graphical User Interface(GUI) or Command Line Interface (CLI). During this lab, we will profile the mini application using CLI.

The Nsight Systems command line interface is named `nsys`. Below is a typical command line invocation:

`nsys profile -t cuda,nvtx --stats=true --force-overwrite true -o deepstream ./exe`

where command switch options used for this lab are:
- `profile` – start a profiling session
- `-t`: Selects the APIs to be traced (nvtx and openacc in this example)
- `--stats`: if true, it generates summary of statistics after the collection
- `--force-overwrite`e: if true, it overwrites the existing generated report
- `-o` – name for the intermediate result file, created at the end of the collection (.qdrep filename)

**Note**: You do not need to memorize the profiler options. You can always run `nsys --help` or `nsys [specific command] --help` from the command line and use the necessary options or profiler arguments.
For more info on Nsight profiler and NVTX, please see the __[Profiler documentation](https://docs.nvidia.com/nsight-systems/)__.

### How to view the report
When using CLI to profile the application, there are two ways to view the profiler's report. 

1) On the Terminal using `--stats` option: By using `--stats` switch option, profiling results are displayed on the console terminal after the profiling data is collected.

<img src="images/laplas3.png" width="100%" height="100%">

2) NVIDIA Nsight System GUI: After the profiling session ends, a `*.qdrep` file will be created. This file can be loaded into Nsight Systems GUI using *File -> Open*. If you would like to view this on your local machine, this requires that the local system has CUDA toolkit installed of same version and the Nsight System GUI version should match the CLI version. More details on where to download CUDA toolkit can be found in the “Links and Resources” at the end of this page.

To view the profiler report, simply open the file from the GUI (File > Open).

<img src="images/nsight_open.png" width="80%" height="80%">

### Using NVIDIA Tools Extension (NVTX) 
NVIDIA Tools Extension (NVTX) is a C-based Application Programming Interface (API) for annotating events, time ranges and resources in applications. NVTX brings the profiled application’s logic into the Profiler, making the Profiler’s displayed data easier to analyse and enables correlating the displayed data to profiled application’s actions.  

DeepStream framework integrates the most stages of the pipeline with NVTX. During this lab, profile the application using Nsight Systems command line interface and collect the timeline. We will also be tracing NVTX APIs (already integrated into the DeepStream Pipeline). The NVTX tool is a powerful mechanism that allows users to manually instrument their application. NVIDIA Nsight Systems can then collect the information and present it on the timeline. It is particularly useful for tracing of CPU events and time ranges and greatly improves the timeline's readability. 

- An example of NVTX domain being shown by default for GSTNVInfer:

<img src="images/nvtx_domain.png" width="80%" height="80%">

Detailed NVTX documentation can be found under the __[CUDA Profiler user guide](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvtx)__.

### Steps to follow
To obtain the best performance from GPU and utilize the hardware, one should follow the cyclical process (analyze, parallelize, optimize). 

- **Analyze**: In this step, you first identify the element in pipeline that includes most of the computation and most of the execution time is spent. From here, you find the hotspots, evaluate the bottlenecks and start investigating GPU acceleration.

- **Optimize**:  Improve the performance by implementing optimization strategies step by step in an iterative process including: identify optimization opportunity, apply and test the optimization method, verify and repeat the process.

Note: The above optimization is done incrementally after investigating the profiler output.

We will follow the optimization cycle for porting and improving the code performance.

### Nsight System Tips

- The timeline view of Nsight after running the DeepStream application has many rows. Each row representing the execution by seperate thread which may be launched by GStreamer engine for different stages of piepline. It is beneficial to pin the rows of importnance as shown in image below
<img src="images/pinning_row.png" width="80%" height="80%">



## Using NSight Systems to generate a report and finding bottlenecks to solve

Let us now profile the pipeline that we optimized in the previous notebook.

In [None]:
!nsys profile --force-overwrite true -o ../source_code/reports/report1 python3 ../source_code/utils/deepstream-no-osd-queue.py --num-sources 3 --prof True

Download and save the report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../source_code/reports/report1.qdrep) .

Opening the above in NSight Compute we can notice that Stream Muxer is the key bottleneck and the pipeline is waiting for the Muxer to finish. 

![batch_size](images/batch_size.PNG)

With this information let us now look at different parameters of Streammux to optimize the pipeline in next section

#### Streammux parameters 

In previous section we noticed that Streammux cannot provide buffers quick enough for NVInfer to process, which creates a bottleneck in this case.

We can fix this by tweaking control parameters that we can find in the [nvstreammux documentation](https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvstreammux.html). 

![img](images/nvstreamux-control.png)

Out of this, the most obvious ones to change are the Width, Height, and Batch_size. 

The current configurations for streammux are as follows 

```python
# Set Input Width , Height and Batch Size 
streammux.set_property('width', 1920)
streammux.set_property('height', 1080)
streammux.set_property('batch-size', 1)
```

We set the Width and Height to that of the original video resolution to prevent it from scaling.

```python
# Set Input Width , Height  
streammux.set_property('width', 1280)
streammux.set_property('height', 720)
```

Let us also set the batch size to that of the number of input sources to make sure streammux works at full capacity. 

```python
# Set Batch Size
streammux.set_property('batch-size', num_sources)
```

Let us now benchmark and profile our application. The code with the changes are bundled in python file [here](../source_code/utils/deepstream-no-osd-queue-streammux.py)

In [None]:
!python3 ../source_code/utils/deepstream-no-osd-queue-streammux.py --num-sources 3

The performance is better than before and hence we can fit more number of streams. Let us run for 16 frames now.

In [None]:
!python3 ../source_code/utils/deepstream-no-osd-queue-streammux.py --num-sources 16

Appending this to the table from previous notebook. 

|Pipeline|Relative Time(V100)|Relative Time(A100)|
|---|----|---|
|Default Pipeline|baseline|baseline|
|With Queues|~3x|~3.1x|
|Without OSD |~3.1x|10x|
|With Queues and without OSD|~3.15x|~10.12x|
|Streammux - Optimization|~4x|~12.5x|


Let us now profile this application to further optimize it.

In [None]:
!nsys profile --force-overwrite true -o ../source_code/reports/report2 python3 ../source_code/utils/deepstream-no-osd-queue-streammux.py --num-sources 16 --prof True

Download and save the report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../source_code/reports/report2.qdrep)

Let us now open this in NSight systems and view it.

![Batch size](images/batch_size_nvinfer.png)


We can notice that for processing one batch of the buffer, it takes a long time for the NVInfer, which is even larger for secondary inference.  Let us follow the same cycle and optimize the NVInfer paramters in next section to increase the throughput.

#### Batch size across cascaded networks


We can refer to NVInfer control parameters [here](https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvinfer.html). 

Just like for Streammux, let us now set the Batch size for the Primary and secondary inference. The code is present here for analyzing [here](../source_code/utils/deepstream-no-osd-queue-streammux-nvinfer.py)

For Primary Inference, let us set the batch size to be equal to the number of sources, and let us try with three different values for Secondary inference. 


In [None]:
!python3 ../source_code/utils/deepstream-no-osd-queue-streammux-nvinfer.py --num-sources 3 --sgie-batch-size 3 

In [None]:
!python3 ../source_code/utils/deepstream-no-osd-queue-streammux-nvinfer.py --num-sources 3 --sgie-batch-size 9

In [None]:
!python3 ../source_code/utils/deepstream-no-osd-queue-streammux-nvinfer.py --num-sources 3 --sgie-batch-size 27

In [None]:
!python3 ../source_code/utils/deepstream-no-osd-queue-streammux-nvinfer.py --num-sources 3 --sgie-batch-size 81

Summarising the above results, we notice that when we increase the batch size, the FPS increases. This is obvious as we process buffers in a batch, and this makes the pipeline more efficient. Still, when we increase the Secondary inference batch size to a much higher value, we notice that the FPS decreases,this is because the Secondary inference is stalled and made to wait for more buffers to process in a batch of higher value. Hence when we choose our Secondary batch size, we need to tweak it as per the application. E.g., Setting it to the average number of cars per frame for this application would work as a reasonable estimate.

Adding this to our previous table : 

|Pipeline|Relative Time(V100)|Relative Time(A100)|
|---|----|---|
|Default Pipeline|baseline|baseline|
|With Queues|~3x|~3.1x|
|Without OSD |~3.1x|10x|
|With Queues and without OSD|~3.15x|~10.12x|
|Streammux - Optimization|~4x|~12.5x|
|NVInfer - Set Primary & Secondary batch size|~6x|~20.5x|

Let us run this for a higher number of streams and profile it for further optimization.

In [None]:
!nsys profile --force-overwrite true -o ../source_code/reports/report3 python3 ../source_code/utils/deepstream-no-osd-queue-streammux-nvinfer.py --num-sources 16 --sgie-batch-size 27 --prof True

Download and save the report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../source_code/reports/report3.qdrep)

Let us now download and open the NSight systems report to review it.

![img](images/inference.png)

We can notice that we have significantly reduced the time taken to process per batch by setting Batch sizes for both the primary and secondary inferences. 

#### NVInfer 

Let us now try to reduce it further using the [configuration parameters](https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvinfer.html#gst-nvinfer-file-configuration-specifications) available to us in NVInfer 

Some of the optional parameters that we can change are as follows. 

```txt
# Optional properties for detectors:
#   cluster-mode(Default=Group Rectangles), interval(Primary mode only, Default=0)
#   custom-lib-path,
#   parse-bbox-func-name


# Other optional properties:
#   net-scale-factor(Default=1), network-mode(Default=0 i.e FP32),
#   model-color-format(Default=0 i.e. RGB) model-engine-file, labelfile-path,
#   mean-file, gie-unique-id(Default=0), offsets, process-mode (Default=1 i.e. primary),
#   custom-lib-path, network-mode(Default=0 i.e FP32)
```

One crucial inference parameter here is the `network-mode`. We can use the `INT8` Quantized network to make our inference faster. 

The `network-mode` parameters are as follows :

```txt
Integer 
0: FP32 
1: INT8 
2: FP16
```

#### NVTracker

One more important parameter that we can use to make our inference faster is the `interval` parameter. By setting the interval parameter, we set the number of frames to be skipped for inference.

We will then use NVTracker to keep track of our object's location. For NVTracker, we can set the low-level tracker as per our applications.

![img](images/nvtracker.png)

We will use the Lightweight IOU tracker for our application. We can set the same using the following line in our tracker config file.

**In DeepStream 5.0** :
```
ll-lib-file=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_mot_iou.so
```

**In DeepStream 6.0** :

```
ll-lib-file=/opt/nvidia/deepstream/deepstream/lib/libnvds_nvmultiobjecttracker.so
ll-config-file=config_tracker_NvDCF_perf.yml
```

Let us now run our application and benchmark it. The code is bundled in sigle python file present [here](../source_code/utils/deepstream-no-osd-queue-streammux-nvinfer-nvtracker.py)

In [None]:
!python3 ../source_code/utils/deepstream-no-osd-queue-streammux-nvinfer-nvtracker.py --num-sources 3 --sgie-batch-size 27 

We can now append this to our previous table and compare with the previous results.


|Pipeline|Relative Time(V100)|Relative Time(A100)|
|---|----|---|
|Default Pipeline|baseline|baseline|
|With Queues|~3x|~3.1x|
|Without OSD |~3.1x|10x|
|With Queues and without OSD|~3.15x|~10.12x|
|Streammux - Optimization|~4x|~12.5x|
|NVInfer - Set Primary & Secondary batch size|~6x|~20.5x|
|NVTracker + INT8| 6.2x |~28.2x|


Let us now try to run multiple streams with this configuration.

In [None]:
!python3 ../source_code/utils/deepstream-no-osd-queue-streammux-nvinfer-nvtracker.py --num-sources 30 --sgie-batch-size 120

### Summary

We started with baseline version of pipline and increased the FPS by almost 6x. The optimizations discussed in the first and second notebooks are some parameters that are easy to tweak to get maximum performance for our DeepStream pipeline.


In the upcoming notebook, let us take another example and try to improve the application step-by-step.

## Licensing
  
This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).

[Previous Notebook](Introduction_to_Performance_analysis.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&ensp;
[1](Introduction_to_Performance_analysis.ipynb)
[2]
[3](Performance_Analysis_using_NSight_systems_Continued.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[Next Notebook](Performance_Analysis_using_NSight_systems_Continued.ipynb)


&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&ensp;
[Home Page](Start_Here.ipynb)