# Profiling your code to find what to offload Requirements

- [Get started](./Get_started.ipynb)
- [Data management](./Data_management.ipynb)

## Development cycle

When you port your code with OpenACC you have to find the hotspots which can benefit from offloading.

That's the first part of the development cycle (Analyze).
This part should be done with a profiler since it helps a lot to find the hotspots.

Once you have found the most time consuming part, you can add the OpenACC directives.
Then you find the next hotspot, manage memory transfers and so on.

<img src="../../pictures/workflow_en.png" style="float:none" width="25%"/>

## Quick description of the code

The code used as an example in this chapter generates a picture and then applies a blurring filter.

Each pixel of the blurred picture has a color that is the weighted average of its corresponding pixel on the original picture and its 24 neighbors.

<img src="../../pictures/stencil_tp_blur.png" style="float:none" width="25%"/>

It will generate 2 pictures that look like:

<img src="../../pictures/blur_500x500.png" style="float:none" width="50%"/>

## Profiling CPU code

The first task you have to achieve when porting your code with OpenACC is to find the most demanding loops in your CPU code.
You can use your favorite profiling tool:

- [gprof](https://sourceware.org/binutils/docs/gprof/)
- [ARM MAP](https://www.arm.com/products/development-tools/server-and-hpc/forge/map)
- [Nsight Systems](https://developer.nvidia.com/nsight-systems)

Here we will use the Nsight Systems.

The first step is to generate the executable file. Run the following cell which will just compile the code inside the blur.c and create 2 files:
- blur.f90 (the content of the cell)
- blur.f90.exe (the executable)


This lets us introduce the command to run an already existing file `%idrrunfile filename`.

In [None]:
%idrrunfile --profile ../../examples/Fortran/blur.f90

Now you can run the UI by executing the following cell and choosing the right reportxx.qdrep file (here it should be report1.qdrep).

Please also write down the time taken (should be around 0.3 s on 1 Cascade Lake core).

In [None]:
%%bash
module load nvidia-nsight-systems/2021.2.1
nsys-ui $PWD/report1.qdrep

## The graphical profiler

The Graphical user interface for the Nsight Systems (version 2021.2.1) is the following:

<img src="../../pictures/NSight-global.png" style="float:none"/>

### The timeline

Maybe the most important part is the timeline:

<img src="../../pictures/NSight-timeline.png" style="float:none"/>

It has the information about what happened during execution of your code with a timeline view.

You can select a portion of the timeline by holding the left button of the mouse (when the mouse is set up for right-handed people) and dragging the cursor.

<img src="../../pictures/NSight-timeline_select.png" style="float:none"/>

and zoom (maj+z or right-click "Zoom into selection"):

<img src="../../pictures/NSight-timeline_select_zoomed.png" style="float:none"/>

### Profile

To see a summary of the time taken by each function you have to select "Bottom-up View" in the part below the timeline.
You can unroll the functions to have a complete view.

<img src="../../pictures/NSight-bottom-up.png" style="float:none"/>

#### Analysis

So here we see that most of the time is spent into the weight function. You can open the [blur.c file](../../examples/C/blur.c) to see what this function does.

The work is done by this double loop which computes the value of the blurred pixel

```fortran
   do i = 0, 4
       do j = 0, 4
            pix = pix + pic((x+i-2)*3*cols+y*3+l-2+) * coefs(i,j) 
       enddo
   enddo
```


Parallelizing this loop will not give us the optimal performance. Why?

The iteration space is 25. So we will launch a lot of kernels (number of pixels in the picture) with a very small number of threads for a GPU.

As a reminder NVIDIA V100 can run up to 5,120 threads at the same time.

You also have to remember that launching a kernel has an overhead.

So the advice is:

- __Give work to the GPU__ by having large kernels with a lot of computation
- __Avoid launching too many kernels__ to reduce overhead

We have to find another way to parallelize this code! The `weight` function is called by `blur` which is a loop over the pixels.

As an exercise, you can add the directives to offload `blur`. Once you are done you can run the profiler again.

In [None]:
%idrrunfile -a ../../examples/Fortran/blur.f90

## Profiling GPU code: other tools

Other tools available for profiling GPU codes include:

- [ARM MAP](https://www.arm.com/products/development-tools/server-and-hpc/forge/map)
- Environment variables NVCOMPILER_ACC_TIME and NVCOMPILER_ACC_NOTIFY

It is possible to activate profiling by the runtime using two environment variables, NVCOMPILER_ACC_TIME and NVCOMPILER_ACC_NOTIFY. It provides a fast and easy way of profiling without a need of a GUI.

Warning: disable NVCOMPILER_ACC_TIME (`export NVCOMPILER_ACC_TIME 0`) if using another profiler.

## NVCOMPILER_ACC_NOTIFY

Additional profiling information can be collected by using the variable NVCOMPILER_ACC_NOTIFY. The values below correspond to activation of profiling data collection depending on a type of GPU operation.

- 1: kernel launches
- 2: data transfers
- 4: region entry/exit
- 8: wait operations or synchronizations
- 16: device memory allocates and deallocates
  
  For example, in order to obtain output including the kernel executions and data transfers, one needs to set NVCOMPILER_ACC_NOTIFY to 3.