In this lab, we will optimize the weather simulation application written in C++ (if you prefer to use Fortran, click [this link](../../Fortran/jupyter_notebook/profiling-fortran.ipynb)). 

Let's execute the cell below to display information about the GPUs running on the server by running the pgaccelinfo command, which ships with the PGI compiler that we will be using. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!pgaccelinfo

## Exercise 1 

###  Learning objectives
Learn how to assess your serial application, compile, and profile with Nsight systems and find the hotspots. In this exercise you will:

- Learn how to compile your serial application with PGI compiler
- Learn how to benchmark and profile the serial code using NVIDIA Nsight systems 
- Learn how to identify routines responsible for the bulk of the execution time via NVTX markers shown on the Nsight System’s timeline
- Learn about scaling and Amdahl’s law

To identify opportunities and parallelise the code, understanding the structure of the code is very important.

**Understand and analyze** the code present at:
 
[Serial Code](../source_code/lab1/miniWeather_serial.cpp) 

[Makefile](../source_code/lab1/Makefile)

Open the downloaded file for inspection.

**Compile** the code with PGI compiler by running `make`. You can get compiler feedback by adding the `-Minfo` flag. Some of the available options are:

- `accel` – Print compiler operations related to the accelerator
- `all` – Print all compiler output
- `intensity` – Print loop intensity information

Example usage: `-Minfo=accel`

In [None]:
!cd ../source_code/lab1 && make clean && make

Now, if we **profile** the serial code via Nsight Systems command line (see below example command) and download the report, we can investigate the serial code further.

`nsys profile -t nvtx --stats=true --force-overwrite true -o miniWeather_1 ./miniWeather`

For the example command above, we download the profiler output (`miniWeather_1.qdrep`) and open it via the Nsight Systems UI. From the timeline view, checkout the NVTX markers displays as part of threads. **Why are we using NVTX?** Please see the section on [Using NVIDIA Tools Extension (NVTX)](profiling-c.ipynb#Using-NVIDIA-Tools-Extension-(NVTX))

<img src="images/e1-nvtx_gui.png">

You can also checkout NVTX statistic from the terminal console once the profiling session ended. From the NVTX statistics, you can see most of the execution time is spend in `perform_timestep`. This is a function worth checking out.

<img src="images/e1-nvtx_terminal.png">

#### Scaling and Amdahl's law
To plan an incremental parallellization strategy after identifying routines responsible for the bulk of the execution time, it is important to know how the application can scale. The amount of performance an application achieves by running on a GPU depends on the the extend to which it can be parallelized. Code that cannot be sufficiently parallelized should run on the host, unless doing so would result in excessive transfers between the host and the device. It is very important to understand the relation between the problem size and computational performance as this can determine the amount of speedup and benefit you would get by parallelizing on GPU.  

We can **Profile** the application again and run the executable with different values for `nx_glob`, `nz_glob` , and `sim_time`.

**Note:** You can provide input values for `nx_glob`, `nz_glob` , and `sim_time` where,

* `nx_glob` and `nz_glob` is the number of total cells in the x and z directions
* `sim_time` is the simulation time in seconds

The number of total cells in the x-direction must be twice as large as the total number of cells in the z directions. The default values are 40, 20, and 10 seconds.

Now, we profile the code again and open the example expected output via the Nsight Systems UI.

From the "timeline view", take a closer look at the "NVTX" markers from function table on the left side of top pane and compare it with the timeline from the previous report. You can see now that the most time consuming part of the application is the initialization. 

<img src="images/e1-nvtx.png">

Due to the small problem size (`nx_glob`, `nz_glob` , and `sim_time` in this example), most of the computation is dominated by the initialization and there is not enough work/computation to make it suitable for GPU. 

According to *Amdahl's law*, the speedup achieved by accelerating portions of an application is limited by the code sections that are not accelerated. Before parallelizing an application, it is important to know that the overal performance improvement gained by optimizing portion of the code is limited by the fraction of time that the improved section is actually used. In other words, you may speedup portion of the code by a factor of N, but if only a small fraction of time is spent in this portion of the code, then the overall performance hasn't been improved substantially.

So, in this example, changing the problem size can hide the initialization part of the code and make it a better candidate for the GPU. Now that you have determined what the most important bottleneck is, modify the application to make this problem more appropriate for the GPU.

## Post-Lab Summary

If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below.

In [None]:
%%bash
cd ..
rm -f openacc_profiler_files.zip
zip -r openacc_profiler_files.zip *

**After** executing the above zip command, you should be able to download the zip file [here](../openacc_profiler_files.zip).

-----

# <p style="text-align:center; border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em"> <a href=../../profiling_start.ipynb>HOME</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="float:center"> <a href=profiling-c-lab2.ipynb>NEXT</a></span> </p>

-----

# Links and Resources

[OpenACC API Guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).