<p> <center> <a href="../start_here.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 51%; text-align: right;">
        <a href="01_introduction.ipynb">1</a>
        <a href="02_pytorch_mnist.ipynb">2</a>
        <a>3</a>
        <a href="04_tensor_core_util.ipynb">4</a>
        <a href="05_summary.ipynb">5</a>
    </span>
    <span style="float: right; width: 49%; text-align: right;"><a href="04_tensor_core_util.ipynb">Next Notebook</a></span>
</div>

# Part 1: Data Transfers Between Host (CPU) and Device (GPU)
---

The objective of this notebook is to optimize data transfer between Host and Device. 


## Analyze the Report

Let's analyze the data transfers between host and graphics processing unit (GPU) in the report `firstOptimization.nsys-rep` from the first optimization step. Open the report in the NVIDIA® Nsight™ Systems graphical user interface (GUI). Expand the `NVIDIA CUDA® device row` by clicking on the tiny triangle in front of it. Select the `Memory` row and right-click to choose `Show in Events View` option as shown below.

<img src="images/report_show_in_events_view.jpg" height="30%" width="30%">


This populates the `Events View` window with the memory operations listed in chronological order. Click on the `Duration` column header to sort the table in the Events View by duration so that the longest memory operation shows up first. Right-click on the first entry in the table and select "Show Current on Timeline" as illustrated below.

<img src=images/report_show_current_timeline.jpg>


This zooms into the event on the timeline and the teal highlights help you find the CUDA API call, `cudaMemcpyAsync`, that initiated the memory operation on the GPU (see the image below). Note: You may have to zoom out and/or scroll up to find the CUDA API call on the CPU thread.

<img src=images/report_api_call.jpg>


You notice the following from the timeline:
- All Host-to-Device (HtoD) memcopies are using pageable memory which is:
    - slower and, 
    - causes the `cudaMemcpyAsync` API call on the CPU thread to block until the operation completes on the GPU.
- The longest memcpy operation takes ~385 microseconds to complete on the GPU.
- The CUDA API call (`cudaMemcpyAsync`) corresponding to the longest memcpy operation is almost 0.5ms long.

## Optimize the Application to Use Pinned Memory

Host (CPU) memory allocations are pageable by default. The GPU cannot access data directly from pageable host memory. When a data transfer is invoked from pageable host memory to device memory, the CUDA driver must first allocate a temporary page-locked (or “pinned”) host array, copy the host data to the pinned array, and then transfer the data from the pinned array to the device memory. The pinned memory is used as a staging area for transfers from the host to the device. By directly allocating our host data to pinned memory, we can avoid this extra step and its overhead. See the blog [post](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/) for more details.

<img src=images/PageableVsPinned.jpg width=50%>


The settings used for the data loader [torch.utils.data.DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) in our application relies on the default value of `pin_memory: False`. Execute the cell below to see the code change made `(in green color)` to use pinned memory.

In [1]:
!diff -U4 --color=always ../source_code/main_opt1.py ../source_code/main_opt2.py

[1m--- ../source_code/main_opt1.py	2022-08-18 03:51:54.178624632 +0900[0m
[1m+++ ../source_code/main_opt2.py	2022-08-19 23:25:01.750840799 +0900[0m
[36m@@ -160,8 +160,9 @@[0m
     test_kwargs = {'batch_size': args.test_batch_size}
     if use_cuda:
         #multiprocessing.cpu_count()
         cuda_kwargs = {'num_workers': 2,
[32m+                       'pin_memory': True,[0m
                        'shuffle': True}
         train_kwargs.update(cuda_kwargs)
         test_kwargs.update(cuda_kwargs)
 


## Profile Again to Verify Optimization
Profile again by executing the cell given below to verify if the code change addresses the problem with host-to-device memory transfers after setting `pin_memory: True`.

In [None]:
!nsys profile --trace cuda,osrt,nvtx \
--capture-range cudaProfilerApi \
--gpu-metrics-device=all \
--output ../reports/secondOptimization \
--force-overwrite true \
python3 ../source_code/main_opt2.py

Open the report (secondOptimization.nsys-rep) in the GUI. Similar to how we navigated the timeline previously, expand the `CUDA device` row and select the `Memory` row and right-click to choose `Show in Events View`. Sort the table in the `Events View` by duration to list the longest memory operation first. Right-click on the topmost event to select `Show current on timeline`. You should see the view as shown below.

<img src="images/report_pinned_memory.jpg" >


In the profile collected after optimization, we observed that:
- All HtoD memcopies now use pinned memory,
- The longest memcpy is now only `183µs` compared to`~385µs` before optimization, and
- The `cudaMemcpyAsync` API call corresponding to the longest memcpy is now reduced from `490µs` to `~36µs`.

Now that we have addressed a bottleneck with memory transfers, let's identify the next performance bottleneck by clicking on the `Next Notebook` link below.

## Links and Resources


[NVIDIA Nsight Systems](https://docs.nvidia.com/nsight-systems/)


**NOTE**: To be able to see the profiler output, please download the latest version of NVIDIA Nsight Systems from [here](https://developer.nvidia.com/nsight-systems).


You can also get resources from [Open Hackathons technical resource page](https://www.openhackathons.org/s/technical-resources)


--- 

## Licensing 

Copyright © 2022 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<div>
    <span style="float: left; width: 51%; text-align: right;">
        <a href="01_introduction.ipynb">1</a>
        <a href="02_pytorch_mnist.ipynb">2</a>
        <a >3</a>
        <a href="04_tensor_core_util.ipynb">4</a>
        <a href="05_summary.ipynb">5</a>
    </span>
    <span style="float: left; width: 49%; text-align: right;"><a href="04_tensor_core_util.ipynb">Next Notebook</a></span>
</div>

<br>
<p> <center> <a href="../start_here.ipynb">Home Page</a> </center> </p>