<p> <center> <a href="../start_here.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 51%; text-align: right;">
        <a href="01_introduction.ipynb">1</a>
        <a href="02_pytorch_mnist.ipynb">2</a>
        <a href="03_data_transfer.ipynb">3</a>
        <a >4</a>
        <a href="05_summary.ipynb">5</a>
    </span>
    <span style="float: right; width: 49%; text-align: right;"><a href="05_summary.ipynb">Next Notebook</a></span>
</div>

# Part 1: Tensor Cores 
---

The goal of this notebook is to show how to enable mixed precision (FP32/FP16) on the Tensor Core to further optimize our application.

## Tensor Core Usage

Tensor cores are specialized processing units designed to accelerate the process of tensor/matrix multiplication. Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy. Our application runs on the `NVIDIA® DGX™ A100 Ampere architecture` GPU. You can also run the application on other GPU architectures, for example `NVIDIA Turing™ architecture` which has Tensor Core precision.


<center><img src=images/tensor_cores.jpg height="60%" width="60%" ></center>

The screenshot below shows a table of NVIDIA GPU architectures and supported Tensor Core precisions.

<center><img src=images/architecture_tensor_cores.jpg height="70%" width="70%"> </center>
<center><a href="https://www.nvidia.com/en-us/data-center/tensor-cores/"> Source: NVIDIA website</a></center>

## Analyze the Profile Report
To verify if the application uses Tensor Cores, we will use a new feature in NVIDIA Nsight™ Systems: **GPU performance metrics sampling**. Notice in the previous notebook, to profile the application after the second optimization we used the Nsight Systems `--gpu-metrics-device=all` CLI option. This enables the collection of the new feature and is intended to measure the utilization of different GPU subsystems. Hardware counters within the GPU are periodically read and used to generate performance metrics.

Let's analyze the application's Tensor Cores usage by examining the report `(secondOptimization.nsys-rep)` in the Nsight Systems GUI. Scroll down to the bottom of the timeline until you see the rows for GPU metrics. Expand the `SM instructions` timeline row to see the `Tensor Active` which represents the ratio of `cycles the SM tensor pipes or FP16x2 pipes were active issuing tensor instructions` to `the number of cycles in the sample period` as a percentage.

<img src=images/report_activate_tensor.jpg>

As shown in the screenshot above, the percentage graph is an `average of 5.7%` and `maximum of 45%`, so the application already uses the Tensor Cores on the A100 GPU. But, this is not the case for other architectures. For example, after examining the secondOptimization.nsys-rep from NVIDIA Turing™ GPU architecture, the percentage graph is zero at `Tensor Active/FP16 Activate`. Therefore, Tensor Core utilization has to be explicitly enabled using, for example, `automatic mixed precision (AMP)`.


<img src=images/TensorCoreUsage.jpg>


## Automatic Mixed Precision (AMP)

Mixed Precision is the combined use of different numerical format `(single and half-precision computation)` in the training of a deep neural network.
- single precision: FP32
- half precision: FP16

The use of mixed precision is possible in NVIDIA GPU architectures such as `Ampere`, `Volta™` , and `Turing`. The benefits include:

- speed up of math-intensive operations using tensor cores,
- require less memory bandwidth, thus data transfer operations are speedup, and
- require less memory thus, the training and deployment of larger neural networks are possible.

AMP automates the process of training using mixed precision through deep neural network (DNN) frameworks. PyTorch has the [Automatic Mixed Precision (AMP)](https://pytorch.org/docs/stable/amp.html) package which provides a simple way for users to convert existing FP32 training scripts to mixed FP32 and FP16 precision. This unlocks faster computation with Tensor Cores on NVIDIA GPUs. In the screenshot below you will see the code changes made `(in green color frame)` to use the AMP package in PyTorch.

<img src=images/amp.jpg height="50%" width="50%">

Profile again and verify the code change addresses Tensor Core usage on the Turing GPU. 

In [None]:
!nsys profile --trace cuda,osrt,nvtx \
--capture-range cudaProfilerApi \
--gpu-metrics-device=all \
--output ../reports/thirdOptimization_env \
--force-overwrite true \
python3 ../source_code/main_opt3.py

Open the report (thirdOptimization.nsys-rep) in the GUI. Scroll down to view the `Tensor Active / FP16 Active` timeline row.

<img src=images/Optimization3.jpg>

Now, we can see the Tensor Cores usage on the Turing GPU. Note that the main contribution of AMP is that it reduces the kernel time using Tensor Cores thereby achieving a speedup.

## Compare the Performance Before and After the Optimizations
Now that we have addressed three different performance problems, we will time the application [main_opt3.py](../source_code/main_opt3.py).

In [None]:
!cd ../source_code && time python3 main_opt3.py

**Expected output on A100 GPUs**:

```python
Train Epoch: 1 [0/60000 (0%)]	Loss: 2.308961

Test set: Average loss: 0.1024, Accuracy: 9683/10000 (97%)

Train Epoch: 2 [0/60000 (0%)]	Loss: 0.154755

Test set: Average loss: 0.0608, Accuracy: 9814/10000 (98%)

Train Epoch: 3 [0/60000 (0%)]	Loss: 0.110753

Test set: Average loss: 0.0535, Accuracy: 9827/10000 (98%)

----------------------------------------------------------

Train Epoch: 9 [0/60000 (0%)]	Loss: 0.059646

Test set: Average loss: 0.0370, Accuracy: 9865/10000 (99%)

Train Epoch: 10 [0/60000 (0%)]	Loss: 0.055018

Test set: Average loss: 0.0365, Accuracy: 9865/10000 (99%)


real	1m24.619s
user	2m35.069s
sys	 0m7.676s

```


Comparing the time taken to run our baseline code [main_baseline.py](../source_code/main_baseline_nvtx.py) from [notebook 2](02_pytorch_mnist.ipynb) with the code after applying the three recent optimizations so far [main_opt3.py](../source_code/main_opt3.py), we see that the overall time taken has reduced as shown in the table below.


|Training code| Time|speedup|
|--|--|--|
|basline| 113s|-|
|optimized|~85|1.3x|


## Links and Resources


[NVIDIA Nsight Systems](https://docs.nvidia.com/nsight-systems/)


**NOTE**: To be able to see the profiler output, please download the latest version of NVIDIA Nsight Systems from [here](https://developer.nvidia.com/nsight-systems).


You can also get resources from [Open Hackathons technical resource page](https://www.openhackathons.org/s/technical-resources)


--- 

## Licensing 

Copyright © 2022 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<div>
    <span style="float: left; width: 51%; text-align: right;">
        <a href="01_introduction.ipynb">1</a>
        <a href="02_pytorch_mnist.ipynb">2</a>
        <a href="03_data_transfer.ipynb">3</a>
        <a >4</a>
        <a href="05_summary.ipynb">5</a>
    </span>
    <span style="float: left; width: 49%; text-align: right;"><a href="05_summary.ipynb">Next Notebook</a></span>
</div>

<br>
<p> <center> <a href="../start_here.ipynb">Home Page</a> </center> </p>