In this lab, we will optimize the weather simulation application written in Fortran (if you prefer to use C++, click [this link](../../profiling-c.ipynb)). 

Let's execute the cell below to display information about the GPUs running on the server by running the pgaccelinfo command, which ships with the PGI compiler that we will be using. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!pgaccelinfo

## Exercise 2 

### Learning objectives
Learn how to identify and parallelise the computationally expensive routines in your application using OpenACC compute constructs (A compute construct is a parallel, kernels, or serial construct.). In this exercise you will:

- Implement OpenACC parallelism using parallel directives to parallelise the serial application
- Learn how to compile your parallel application with PGI compiler
- Benchmark and compare the parallel version of the application with the serial version
- Learn how to interpret PGI compiler feedback to ensure the applied optimization were successful

Before start modifying the serial code, change the value of `nx_glob`, `nz_glob` , and `sim_time` in the serial code to default values `nx_glob` = 40 , `nz_glob`= 20 , and `sim_time`= 10. **NOTE:** We validate against smaller values of `nx_glob`, `nz_glob` , and `sim_time`: **40, 20 , 10**

From the top menu, click on *File*, and *Open* `miniWeather_serial.f90` from the current directory at `profiling/Fortran` directory. Remember to **SAVE** your code after changes, before running below cells.

Next, **Compile** and **Run** it.

In [None]:
!make

In [None]:
!./miniWeather

Now, let's copy the output of the serial code `reference.nc` to the `checker` folder for later use (see section [Validating Output](../../profiling-fortran.ipynb#Getting-Started)).

In [None]:
!cp reference.nc ../../checker/reference.nc

From the top menu, click on *File*, and *Open* `miniWeather_openacc.f90` and `Makefile` from the current directory at `profiling/Fortran/lab2` directory and inspect the code before running below cells.We have already added OpenACC compute directives (`!$acc parallel loop`) around the expensive routines (loops) in the code.

Once done, compile the code with `make`. View the PGI compiler feedback (enabled by adding `-Minfo=accel` flag) and investigate the compiler feedback for the OpenACC code. The compiler feedback provides useful information about applied optimizations.

In [None]:
!make

Let's inspect part of the compiler feedback and see what it's telling us.

<img src="../../../images/ffeedback1-0.png">

- Using `-ta=tesla:managed`, instruct the compiler to build for an NVIDIA Tesla GPU using "CUDA Managed Memory"
- Using `-Minfo` command-line option, we will see all output from the compiler. In this example, we use `-Minfo=accel` to only see the output corresponding to the accelerator (in this case an NVIDIA GPU).
- The first line of the output, `compute_tendencies_x`, tells us which function the following information is in reference to.
- The line starting with 247 and 252, shows we created a parallel OpenACC loop. This loop is made up of gangs (a grid of blocks in CUDA language) and vector parallelism (threads in CUDA language) with the vector size being 128 per gang. 
- The line starting with 249 and 252, `Loop is parallelizable` of the output tells us that on these lines in the source code, the compiler found loops to accelerate.
- The rest of the information concerns data movement. Compiler detected possible need to move data and handled it for us. We will get into this later in this lab.

It is very important to inspect the feedback to make sure the compiler is doing what you have asked of it.

Now, **Run** the application in default mode (without entering any argument) and **Validate** the output to verify we are getting the correct result/output. We have already made a copy of the `reference.nc` (output of the serial code). Rename the current output to `new.nc`, copy it to the *checker* folder and compare it to the “correct” output (`reference.nc`) by running the `checker.py` code (a simple code written to “check” the output as we make changes to the code (including offloading computation to the accelerator or optimizations)). The `checker.py` code, looks for largest error and largest “difference” – computes % difference. It looks for largest error and largest “difference” – computes % difference.

**NOTE:** We validate against smaller values of `nx_glob`, `nz_glob` , and `sim_time`: **40, 20 , 10** (Make sure to modify the file accordingly)

In [None]:
!./miniWeather

Now, copy the output to the *checker* folder and **Validate** the output.

In [None]:
!cp reference.nc ../../checker/new.nc
!ipython ../../checker/checker.py

Once you verified the output to ensure the correctness of the application and that its functionality is not broken, **Profile** it with Nsight Systems command line `nsys`.

In [None]:
!nsys profile -t nvtx --stats=true --force-overwrite true -o miniWeather_3 ./miniWeather

You can see that the changes made actually slowed down the code and it runs slower compared to the non-accelerated CPU only version. Let's checkout the profiler's report. [Download the profiler output](miniWeather_3.qdrep) and open it via the GUI. 

From the "timeline view" on the top pane, double click on the "CUDA" from the function table on the left and expand it. Zoom in on the timeline and you can see a pattern similar to the screenshot below. The blue boxes are the compute kernels and each of these groupings of kernels is surrounded by purple and teal boxes (annotated with red color) representing data movements. 

<img src="../../../images/nsys_slow.png" width="80%" height="80%">

Let's hover your mouse over kernels (blue boxes) one by one from each row and checkout the provided information.

<img src="../../../images/occu-1.png" width="60%" height="60%">

**Note**: In the next two exercises, we start optimizing the application by improving the occupancy and reducing data movements.

## Post-Lab Summary

If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below.

In [None]:
%%bash
rm -f openacc_profiler_files.zip
zip -r openacc_profiler_files.zip *

**After** executing the above zip command, you should be able to download the zip file [here](files/openacc_profiler_files.zip).

-----

# <p style="text-align:center;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em"> <a href=../../../START_profiling.ipynb>HOME</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="float:center"> <a href=../lab3/profiling-fortran-lab3.ipynb>NEXT</a></span> </p>

-----

# Links and Resources

[OpenACC API Guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System version 2020.1 from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.