Before we begin, let us execute the below cell to display information about the NVIDIA® CUDA® driver and the GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by clicking on it with your mouse, and pressing Ctrl+Enter, or pressing the play button in the toolbar above. You should see some output returned below the grey cell.

In [None]:
!nvidia-smi

## Exercise 2 

### Learning objectives

The **goal** of this lab is to:
- Implement OpenACC parallelism using parallel directives to parallelize the serial application
- Learn how to compile your parallel application with the NVIDIA HPC compiler
- Benchmark and compare the parallel version of the application with the serial version
- Learn how to interpret NVIDIA HPC compiler feedback to ensure the applied optimization was successful

We do not intend to cover:

- The OpenACC programming model
- Advanced optimization techniques in detail

After inspecting the profiler report from the terminal, we noticed that most of the computation is done in the `perform_timestep`. So, we ported the code to the GPU using the OpenACC programming model and added OpenACC compute directives (`#pragma acc parallel`) around the expensive routines (loops) in the code. Click on the <b>[miniWeather_openacc.cpp](../source_code/lab2/miniWeather_openacc.cpp)</b> and <b>[Makefile](../source_code/lab2/Makefile)</b> and inspect the code before running the below cells. 

Once done, compile the code with `make`. View the NVIDIA HPC compiler feedback (enabled by adding `-Minfo=accel` flag) and investigate the compiler feedback for the OpenACC code. The compiler feedback provides useful information about applied optimizations.

In [None]:
# compile the C/C++ code
!cd ../source_code/lab2 && make clean && make

Let's inspect part of the compiler feedback and see what it's telling us (the lines in the compiler feedback might be slightly different for you).

<img src="images/cfeedback1_.png">

- Using `-ta=tesla:managed`, instruct the compiler to build for an NVIDIA GPU using "CUDA Managed Memory"
- Using `-Minfo` command-line option, we will see all output from the compiler. In this example, we use `-Minfo=accel` to only see the output corresponding to the accelerator (in this case an NVIDIA GPU).
- Let's look at the line starting with `compute_tendencies_x`. It tells us which function the following information is in reference to.
- The line starting with 278, shows we created a parallel OpenACC loop. This loop is made up of gangs (a grid of blocks in CUDA language) and vector parallelism (threads in CUDA language) with the vector size being 128 per gang. `278, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */`
- The rest of the information concerns data movement. The compiler detected the possible need to move data and handled it for us. We will get into this later in this lab.

It is very important to inspect the feedback to make sure the compiler is doing what you have asked of it.

Now, let's **profile** the application for smaller values of `nx_glob`,`nz_glob`, and `sim_time`: **40, 20, 100**.

In [None]:
!cd ../source_code/lab2 && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o miniWeather_2 ./miniWeather 40 20 100

You can see that the changes made actually slowed down the code and it runs slower compared to the non-accelerated CPU-only version. Let's review the profiler's report. Download and save the report file by holding down <mark>Shift</mark> and <mark>right-clicking</mark> [Here](../source_code/lab2/miniWeather_2.nsys-rep), then choosing <mark>Save Link As</mark>. Once done open the report via NVIDIA Nsight™ Systems user interface (UI) locally. 

The timeline of the application is shown below.

<img src="images/1_timeline_full.png" width=90%>

Hovering over the blue chart in the CUDA device row, we see that the CUDA kernel coverage on the GPU is about 80-90% throughout. This means that the GPU is idle for the remaining 10-20% of the time.

<img src="images/1_gpu_row.png" width=90%>

**Let's zoom into the timeline to see what's going on.** Press the Ctrl key while moving the mouse scroll wheel up or down to zoom into or out of the area around the mouse pointer. Another way to zoom in is to select the region you want to zoom into and press *Shift*+*Z* keys.

<img src="images/1_timeline.png" width=90%>

Zoom into the OpenACC row. Nsight Systems is capable of capturing information about OpenACC execution in the profiled process. Under the CPU rows in the timeline tree, each thread that uses OpenACC will show OpenACC trace information. You can click on an OpenACC application programming interface (API) call to see the correlation with the underlying CUDA API calls (highlighted in teal). If the OpenACC API results in GPU work, that will also be highlighted:

<img src="images/1_correlation.png" width=90%>

If you hover over a particular OpenACC construct, it will bring up a tooltip with details about that construct:

<img src="images/1_openacc_row.png" width=90%>

From the "Timeline view" on the top pane, double click on the "CUDA" from the function table on the left and expand it. Zoom in on the timeline and you can see a pattern similar to the screenshot below. Clearly, there is a repeating pattern where the GPU is idle for some time followed by a burst of kernel and memory operations. The blue boxes are the compute kernels and each of these groupings of kernels is surrounded by purple and teal boxes (annotated with red color) representing data movements. **Screenshots represent profiler report for the values of 400,200,200.**

<img src="images/nsys_slow.png" width=90%>

Let's hover your mouse over kernels (blue boxes) one by one from each row and review the provided information.

<img src="images/occu-1.png" width=90% >


**Note**: In the next two exercises, we start optimizing the application by improving occupancy and reducing data movements.

## Post-Lab Summary

If you would like to download this lab for later viewing, it is recommended you go to your browser's file menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well. You can also execute the following cell block to create a zip file of the files you have been working on, and download it with the link below.

In [None]:
%%bash
cd ..
rm -f _profiler_files.zip
zip -r _profiler_files.zip *

**After** executing the above zip command, you should be able to download and save the zip file by holding down <mark>Shift</mark> and <mark>right-clicking</mark> [Here](../_profiler_files.zip), then choosing <mark>Save Link As</mark>. 

-----

# <p style="text-align:center; border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em"> <a href=../_start_profiling.ipynb>HOME</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="float:center"> <a href=profiling_lab3.ipynb>NEXT</a></span> </p>

-----

# Links and Resources

[OpenACC API Guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download the latest version of Nsight Systems from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [Open Hackathons Resources](https://www.openhackathons.org/s/technical-resources) and join our [OpenACC and Hackathons Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.