<p> <center> <a href="../start_here.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 51%; text-align: right;">
        <a href="01_introduction.ipynb">1</a>
        <a href="02_pytorch_mnist.ipynb">2</a>
        <a href="03_data_transfer.ipynb">3</a>
        <a >4</a>
        <a href="tb05_summary.ipynb">5</a>
    </span>
    <span style="float: right; width: 49%; text-align: right;"><a href="tb05_summary.ipynb">Next Notebook</a></span>
</div>

# Part 2: Tensor Cores 
---

The goal of this notebook is show how to enable mixed precision (FP32/FP16) on the Tensor Core to further optimize our application.

## Tensor Core Usage

Tensor cores are specialized processing units designed to accelerate the process of tensor/matrix multiplication. Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy. Our application runs on the DGX A100 `Ampere architecture` GPU. You can also run the application on other GPUs, for example, `Turing GPU architecture` which has tensor Core precision.


<center><img src=images/tensor_cores.jpg height="60%" width="60%" ></center>

The screenshot below shows a table of GPU architecture and supported Tensor Cores precisions.

<center><img src=images/architecture_tensor_cores.jpg height="70%" width="70%"> </center>
<center><a href="https://www.nvidia.com/en-us/data-center/tensor-cores/"> view source here</a></center>

## Analyze the TensorBoard visualization

To verify Tensor Core usage, check the `GPU Summary` frame.  You can also verify through the `Kernel Veiw` by selecting the `GPU Kernel` in the `Views` dropdown.  From previous notebook as shown below, the `Kernel Time using Tensor Cores` is `30.7%` and likewise the `Tensor Core Utilization` in the `Kernel View`. We are able to see this because Tensor Core utilization is automatic with Ampere Architecture GPUs like DGX A100. However, this may not be the case if you are running the lab on other GPU architectures. Our aim is to introduce `Automatic Mixed Precision (AMP)` Tensor Core operations.

<img src=images/profile_summary_opt22_tensor_core.jpg>
<img src=images/tensor_core_util_opt22.jpg>

## Automatic Mixed Precision (AMP)

Mixed Precision is the combine use of different numerical format `(single and half precision computation)` in the training of a deep neural network.
- single precision: FP32
- half precision: FP16


The use of Mixed precision is possible in GPU architectures such as `Ampere`, `Volta`, and `Turing`. The benefits include:
- speed up of math-intensive operations using tensor cores
- require less memory bandwidth, thus data transfer operations are speedup
- require less memory thus, the training and deployment of larger neural networks are possible.

Automatic Mixed Precision AMP automates the process of training using mixed precision through DNN frameworks. PyTorch has [Automatic Mixed Precision (AMP)](https://pytorch.org/docs/stable/amp.html) package which provides a simple way for users to convert existing FP32 training scripts to mixed FP32 & FP16 precision. This unlocks faster computation with Tensor Cores on NVIDIA GPUs.

Our first step before we introduce `AMP` into our application code is to search for `fp16` operations running within the Tensor Core in the `Kernel View`. The result we got shows it's absent. 

<img src=images/fp16_opt22.jpg>


In the screenshot below you will see the code changes made (in green color frame) to use AMP package in PyTorch.

<img src=images/amp.jpg height="50%" width="50%">

Let's profile again and verify if the code change enables `fp16` ops within the Tensor Core operations.

In [None]:
!python3 ../source_code/tb_main_opt2.py

Next is to run the cell below to visualize the profile in the TensorBoard. If you are working on a remote machine, remember to do `port-forward` as described in the previous notebooks before opening the browser at `localhost:6006/` .

In [None]:
!tensorboard --logdir=../log

<img src=images/fp16_opt33.jpg>


Now, you can see `fp16` ops under the `Name` column and the `Yes` that validate ops running within the Tensor Core under the `Tensor Cores Used` column. The impact of this on our application model is that it reduces the amount of time spent by the Tensor Core for computation because math-intensive operations were speedup by `AMP`. This is verifiable in the `GPU Summary` frame shown below.

<img src=images/profile_summary_opt33.jpg>

<img src=images/step_time_breakdown_opt33.jpg>

<img src=images/trace_view_opt33.jpg>

The following changes were found after `AMP` was activated:

- Increase in GPU usage from `37%` to `52%`
- Time spent by DataLoader further reduced from `4,638µs`(~27.9%) to `948µs` (10.7%).

## Compare the Performance Before and After the Optimizations
Now that we have addressed three different performance problems, let's time the application [tb_main_opt3.py](../source_code/tb_main_opt3.py).

In [None]:
!cd ../source_code && time python3 main_opt3.py

**Expected output on DGX A100**:

```python
Train Epoch: 1 [0/60000 (0%)]	Loss: 2.308961

Test set: Average loss: 0.1024, Accuracy: 9683/10000 (97%)

Train Epoch: 2 [0/60000 (0%)]	Loss: 0.154755

Test set: Average loss: 0.0608, Accuracy: 9814/10000 (98%)

Train Epoch: 3 [0/60000 (0%)]	Loss: 0.110753

Test set: Average loss: 0.0535, Accuracy: 9827/10000 (98%)

----------------------------------------------------------

Train Epoch: 9 [0/60000 (0%)]	Loss: 0.059646

Test set: Average loss: 0.0370, Accuracy: 9865/10000 (99%)

Train Epoch: 10 [0/60000 (0%)]	Loss: 0.055018

Test set: Average loss: 0.0365, Accuracy: 9865/10000 (99%)


real	1m24.619s
user	2m35.069s
sys	 0m7.676s

```


Comparing the time taken to run our baseline code [main_baseline.py](../source_code/tb_main_baseline_nvtx.py) from [notebook 2](tb02_pytorch_mnist.ipynb) with the code after applying the three optimizations so far [main_opt3.py](../source_code/main_opt3.py), we see that the overall time taken has reduced as shown in the table below.


|Training code| Time|speedup
|--|--|--|
|basline| 113s|-|
|optimized|~85s|1.3x|


## Links and Resources


[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)


**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System's latest version from [here](https://developer.nvidia.com/nsight-systems).

You can also get resources from [openhackathons technical resource page](https://www.openhackathons.org/s/technical-resources)

---
 ## Licensing
  
This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).

<div>
    <span style="float: left; width: 51%; text-align: right;">
        <a href="01_introduction.ipynb">1</a>
        <a href="02_pytorch_mnist.ipynb">2</a>
        <a href="03_data_transfer.ipynb">3</a>
        <a >4</a>
        <a href="tb05_summary.ipynb">5</a>
    </span>
    <span style="float: left; width: 49%; text-align: right;"><a href="tb05_summary.ipynb">Next Notebook</a></span>
</div>

<br>
<p> <center> <a href="../start_here.ipynb">Home Page</a> </center> </p>