In this lab, we will optimize the weather simulation application written in Fortran (if you prefer to use C++, click [this link](../../C/jupyter_notebook/profiling-c.ipynb)). 

Let's execute the cell below to display information about the GPUs running on the server by running the pgaccelinfo command, which ships with the PGI compiler that we will be using. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!pgaccelinfo

## Exercise 4

### Learning objectives
Learn how to improve the performance of the application by managing data movement and reducing the unnecessary data transfers. In this exercise you will:

- Learn about unified memory and how to automatically migrate data between CPU and GPU
- Learn how to use it via PGI compiler managed option, and profiling managed memory
- Learn how to identify redundant memory copies via Nsight Systems
- Learn how to improve efficiency by reducing extra data copies via OpenACC data directive
- Learn how to use PGI compiler feedback as a guidance on where to insert the OpenACC data directives
- Apply data directives to the parallel application, benchmark and profile it

Let's inspect the profiler report from previous exercise. From the "timeline view" on the top pane, double click on the "CUDA" from the function table on the left and expand it. Zoom in on the timeline and you can see a pattern similar to the screenshot below. The blue boxes are the compute kernels and each of these groupings of kernels is surrounded by purple and teal boxes (annotated with red color) representing data movements.

What this graph is showing us is that we're doing a lot of data movement between GPU and CPU.
    
<img src="images/nsys_data_mv.png">

The compiler feedback we collected earlier tells us quite a bit about data movement too. If we look again at the compiler feedback from above, we see the following.

<img src="images/ffeedback3.png" width="90%" height="90%">

The compiler feedback is telling us that the compiler has inserted data movement around our parallel region at line 277 which copies the `hy_dens_cell`, `hy_dens_theta_cell`, and `state` arrays in and out of GPU memory and also copies `flux` array out. 

The compiler can only work with the information we provide. It knows we need the `hy_dens_cell`, `hy_dens_theta_cell`, `state`, and `flux` arrays on the GPU for the accelerated section within the  `compute_tendencies_x` function, but we didn't tell the compiler anything about what happens to the data outside of those sections. Without this knowledge, the compiler has to copy the full arrays to the GPU and back to the CPU for each accelerated section. This is a lot of unnecessary data transfers. 

Ideally, we would want to move the data (example: `hy_dens_cell`, `hy_dens_theta_cell`, `state` arrays) to the GPU at the beginning, and only transfer back to the CPU at the end (if needed). And as for the `flux` array in this example, we do not need to copy any data back and forth. So we only need to create space on the device (GPU) for this array. 

We need to give the compiler information about how to reduce the extra and unnecessary data movement. By adding OpenACC `data` directive to a structured code block, the compiler will know how to manage data according to the clauses. For information on the data directive clauses, please visit [OpenACC 3.0 Specification](https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC.3.0.pdf).

Now, add `data` directives to the code, save the file, re-compile via `make`, and profile it again.

Click on the <b>[miniWeather_openacc.f90](../source_code/lab4/miniWeather_openacc.f90)</b> and <b>[Makefile](../source_code/lab4/Makefile)</b> links and modify `miniWeather_openacc.f90` and `Makefile`. Remember to **SAVE** your code after changes, before running below cells.

In [None]:
!cd ../source_code/lab4 && make clean && make

Let us start inspecting the compiler feedback and see if it applied the optimizations. Here is the screenshot of expected compiler feedback after adding the `data` directives. You can see that from line 104, compiler is managing data according to the provided clauses. In other words, it is assuming that data is present on the GPU and it only copies data to the GPU only if the data do not exist.

<img src="images/ffeedback4.png" width="90%" height="90%">

Now, **Profile** your code with Nsight Systems command line `nsys`.

In [None]:
!cd ../source_code/lab4 && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o miniWeather_5 ./miniWeather

[Download the profiler output](../source_code/lab4/miniWeather_5.qdrep) and open it via the GUI. Have a look at the example expected output below:

<img src="images/nsys_fast_mv.png">

Have a look at the data movements annotated with red color and compare it with the previous versions. We have accelerated the application and reduced the execution time by eliminating the unnecessary data transfers between CPU and GPU.

**Note**: Next exercise gives an overview on introduction to Nsight Compute tool and it is optional.

## Post-Lab Summary

If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below.

In [None]:
%%bash
cd ..
rm -f openacc_profiler_files.zip
zip -r openacc_profiler_files.zip *

**After** executing the above zip command, you should be able to download the zip file [here](../openacc_profiler_files.zip).

-----

# <p style="text-align:center;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em"> <a href=../../profiling_start.ipynb>HOME</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="float:center"> <a href=profiling-fortran-lab5.ipynb>NEXT</a></span> </p>

-----

# Links and Resources

[OpenACC API Guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).