<p> <center> <a href="../start_here.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 51%; text-align: right;">
        <a href="tb01_introduction.ipynb">1</a>
        <a>2</a>
        <a href="tb03_data_transfer.ipynb">3</a>
        <a href="tb04_tensor_core_util.ipynb">4</a>
        <a href="05_summary.ipynb">5</a>
    </span>
    <span style="float: right; width: 49%; text-align: right;"><a href="tb03_data_transfer.ipynb">Next Notebook</a></span>
</div>

# Part 2: Optimizing PyTorch's MNIST Training Program
---

This notebook is focused on optimizing a deep neural network (DNN) training program using the Modified National Institute of Standards and Technology (MNIST) dataset. 


## Running the Application

The MNIST database consists of normalized fixed-size handwritten digits images. The database includes 60k training examples and 10k test examples. Click [here](http://yann.lecun.com/exdb/mnist/) to learn more about `LeCun et al., 1998` dataset. In this lab, the MNIST database will be used for training a DNN that recognizes handwritten digits. Our training program is adopted from [PyTorch GitHub] (https://github.com/pytorch/examples/tree/master/mnist) and written using PyTorch framework. The image below is an example of normalized digits from the testing set.

<center><img src="images/mnist.jpg"></center>
<center><a href="http://yann.lecun.com/exdb/publis/pdf/lecun-90c.pdf"> View source here<a/> </center>
    
Run the cell below to execute the baseline training program.

In [None]:
!cd ../source_code && time python3 main_baseline.py

**Expected output on the NVIDIA® DGX™ A100**:

```python

Train Epoch: 1 [0/60000 (0%)]	Loss: 2.308954

Test set: Average loss: 0.1049, Accuracy: 9668/10000 (97%)

Train Epoch: 2 [0/60000 (0%)]	Loss: 0.166372

Test set: Average loss: 0.0629, Accuracy: 9805/10000 (98%)

Train Epoch: 3 [0/60000 (0%)]	Loss: 0.116849

Test set: Average loss: 0.0555, Accuracy: 9826/10000 (98%)

---------------------------------------------------------

Train Epoch: 9 [0/60000 (0%)]	Loss: 0.057983

Test set: Average loss: 0.0373, Accuracy: 9862/10000 (99%)

Train Epoch: 10 [0/60000 (0%)]	Loss: 0.055643

Test set: Average loss: 0.0368, Accuracy: 9869/10000 (99%)


real	1m53.085s
user	2m20.216s
sys	 0m6.212s

```
It takes approximately`2 minutes` to execute the `10 epochs` in the training program

## Profile the Application

Remember in Part 1 of the lab we gave a summary of the train function in [main_baseline.py](../source_code/main_baseline.py) as: 
- Data is copied from CPU to the GPU (device),
- Forward pass runs on the GPU, and
- Backward pass runs on the GPU.

The first step in our optimization workflow is to use the PyTorch Profiler command to wrap the `profiler schedule` and the `tensorboard_trace_handler` and create the Profiler object `prof`. The Profiler object should be placed in the section of our application code to be profile by calling the `prof.start()` and `prof.stop()` methods. You can see the preview below in `green frame` within our PyTorch mnist `main` method screenshot.
  


```python
import torch.profiler as profiler

.................................

prof = profiler.profile( schedule=profiler.schedule( wait=1, warmup=1, active=3, repeat=2),
                         on_trace_ready=profiler.tensorboard_trace_handler('../log/mnist'),
                         record_shapes=True,with_stack=True)
prof.start()
.............................
# code to profile
............................

prof.stop()
```

<img src=images/tb_main.jpg width=70%>


Let's briefly review the profiling commands used.

- `schedule=profiler.schedule ( wait=1, warmup=1, active=3, repeat=2)`: command to schedule when to start profiling. This command is already explained in the previous notebook. The Profiler skips the first step, start warming up on the next step, and start profiling trace for the next three steps. The `repeat=2` implies repeating the profiling cycle twice. 
-`on_trace_ready=profiler.tensorboard_trace_handler('../log/mnist')`: handles all the profile trace from `profiler.schedule` and saves them in the `../log/mnist` for visualization on the TensorBoard 
- `record_shapes=True`: command to record shapes of operator’s input
- `with_stack=True`: command to record source information for the ops. This is responsible for file and line number record 
- `prof.start()`: start profile
- `prof.stop()`: stop profile 


Our focus is on the training process, therefore the profiler object `prof` is passed unto the train method to capture profile trace at specified active steps `prof.step()` in the `profiler.schedule`.


<img src=images/tb_train.png width=60%>
Please run the cell below to profile the application.

In [None]:
!python3 ../source_code/tb_main_baseline_profiler.py

After profiling, run the cell below to visualize the profile trace in the `TensorBoard`.

In [None]:
!tensorboard --logdir=../log

There are two ways to view the visualization. First, if you are running this on your local machine, open the Google Chrome browser and type `localhost:6006/`

<img src=images/browse_port.jpg width=90%>

Second, if you are running on a remote machine, for example on the DGX A100, you have to do `port forwarding` on port `6006` as shown in the screenshot below. Next, open the Google Chrome browser and type `localhost:6006/`

<img src=images/port_forwarding.jpg width=80%>

Now we can analyze the profile using the TensorBoard visualization.

## Analyze the Profile

An overview is shown below.

<img src=images/profile_summary.jpg>

From the GPU Summary frame, the `GPU utilization` is under 8% which is very low. We can also see the same in the `Execution Summary` panel looking at the `kernel` row. The most noticeable one is the `DataLoader` which took ~89% of the time. In the `Step Time Breakdown` panel below, the `DataLoader` is seen to be consuming most step time (in microseconds).

<img src=images/step_time_breakdown.jpg>
<img src=images/performance_recommendation.jpg>

Finally, the `Performance Recommendation` panel suggests setting the number of workers `(num_workers)` on DataLoader and enable multi-processing data loading. There are two steps to do this:
- Increase the value of `num_workers` (if already set)
- Enable `pinned memory` because it's a CPU bottleneck issue from `memory pageable` (Data Transfers between Host and Device)

## Optimize Code to Address the CPU Bottleneck
Let's inspect the data loader [torch.utils.data.DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) used in our application. From the code in `tb_main_baseline_nvtx.py` shown below, a single worker subprocess is asynchronously loading the data.

<img src=images/NumberOfWorkers.jpg width=50%>

To increase the overlap between data loading and training on the GPU, the `num_workers` parameter should be increased. Run the following cell to see the code changes made to tune this parameter

In [None]:
!diff -d --color=always ../source_code/tb_main_baseline_profiler.py ../source_code/tb_main_opt1.py

Depending on the number of CPU cores available on the target system, we can increase `num_of_workers` to  use the total number of CPU cores available by setting the `num_workers` to `multiprocessing.cpu_count()`, `2`, and `more`, to improve the overlap. There are rare situations where setting `num_workers` to `multiprocessing.cpu_count()` would prompt the following warnings:

```python
/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 256 worker processes in total. 
Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. 
Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
```

Then, you may have to adjust to the recommended maximum number of workers. Let's increase the number of workers to 2 `'num_workers': 2`. 

## Profile to Verify Optimization
Profile again by executing the cell below to verify if the code change addresses the bottleneck.

In [None]:
!python3 ../source_code/tb_main_opt1.py

After profiling, run the cell below to visualize the profile trace in the `TensorBoard`.

In [None]:
!tensorboard --logdir=../log

If you are working on a remote machine, remember to do `port-forwarding` as described above before opening the browser at `localhost:6006/`.

<img src=images/profile_summary_opt11.jpg>
<img src=images/trace_view_opt11.jpg>


Now, we can see the `GPU utilization` increases from `~8%` to `38%` on the `GPU Summary` and `Trace view ` panels. Meanwhile the `DataLoader` running time on the CPU has reduced from `~89%` to `48.34%` as shown in the `Execution Summary` panel. The `Step Time Breakdown` likewise reflects these changes, especially in the third and fourth steps.

<img src=images/step_time_breakdown_opt11.jpg>

Proceed to the next notebook to implement the second proposed memory optimization.

## Links and Resources


[NVIDIA Nsight™ Systems](https://docs.nvidia.com/nsight-systems/)


**NOTE**: To be able to see the profiler output, please download the latest version of NVIDIA Nsight Systems from [here](https://developer.nvidia.com/nsight-systems).


You can also get resources from [Open Hackathons technical resource page](https://www.openhackathons.org/s/technical-resources)


--- 

## Licensing 

Copyright © 2022 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<div>
    <span style="float: left; width: 51%; text-align: right;">
        <a href="tb01_introduction.ipynb">1</a>
        <a >2</a>
        <a href="tb03_data_transfer.ipynb">3</a>
        <a href="tb04_tensor_core_util.ipynb">4</a>
        <a href="05_summary.ipynb">5</a>
    </span>
    <span style="float: left; width: 49%; text-align: right;"><a href="tb03_data_transfer.ipynb">Next Notebook</a></span>
</div>

<br>
<p> <center> <a href="../start_here.ipynb">Home Page</a> </center> </p>