# Profiling and Parallelizing with OpenACC

Lab written by Jeff Larkin

The following timer counts down to a five minute warning before the lab instance shuts down.  You should get a pop up at the five minute warning reminding you to save your work!  If you are about to run out of time, please see the [Post-Lab](#Post-Lab-Summary) section for saving this lab to view offline later.

<iframe id="timer" src="timer/timer.html" width="100%" height="120px"></iframe>

---
Before we begin, let's verify [WebSockets](http://en.wikipedia.org/wiki/WebSocket) are working on your system.  To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above.  If all goes well, you should see some output returned below the grey cell.  If not, please consult the [Self-paced Lab Troubleshooting FAQ](https://developer.nvidia.com/self-paced-labs-faq#Troubleshooting) to debug the issue.

In [1]:
print "The answer should be three: " + str(1+2)

The answer should be three: 3


Let's execute the cell below to display information about the GPUs running on the server.

In [2]:
!nvidia-smi

Wed Nov  8 20:19:11 2017       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.68     Driver Version: 352.68         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GRID K520           On   | 0000:00:03.0     Off |                  N/A |
| N/A   27C    P8    18W / 125W |     11MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage    

### Connecting to Your Lab Instance

You are required to connect to the lab instance over SSH - much like you would when working on a real system.  You can do this by:
* Using a local SSH client, connect to **ec2-54-92-248-21.compute-1.amazonaws.com** with username **ubuntu** and password **FcN7Dy6BtVs**
* If you don't have an SSH client installed, you can use the provided <a href="ssh" target="_blank">browser-based client.</a> and the same username and password as above.  
  * **NOTE**: If you right-click in the browser-based client, you can select "Paste from browser" to easily copy & paste in the password.

Once connected, please proceed!

## Introduction

In this lab you will profile the provided application using either NVIDIA
nvprof or gprof and the PGI compiler. After profiling the application, you will
use OpenACC to express the parallelism in the 3 most time-consuming routines.
You will use CUDA Unified Memory and the PGI "managed" option to manage host
and device memories for you. You may use either the `kernels` or `parallel loop` 
directives to express the parallelism in the code. Versions of the code
have been provided in C99 (directory `/home/ubuntu/c99`) and Fortran 90 (directory `/home/ubuntu/f90`). 
The `nano`, `vim`, and `emacs` file editors are all available to use during this lab. 
If you are not experienced with Linux text editors, `nano` is the simplest choice.


![Lecture 2 steps: Identify and Express Parallelism](files/Lecture-2-Steps.png)

As discussed in the associated lecture, this lab will focus solely on *Identifying Parallelism* in the 
code by profiling the application and *Expressing Parallelism* using OpenACC. We will use CUDA Unified Memory 
to allow the data used on the GPU to be automatically migrated to and from the GPU as needed. Please be
aware that you may see an application slowdown until you have completed each step of this lab. This is expected
behavior due to the need to migrate data between the CPU and GPU memories.

**Important** You should repeat steps 2 and 3 for each function identified in step 1
in order of function importance. Gather a new GPU profile each time and observe
how the profile changes after each step.

## Step 0 - Building the code

Makefiles have been provided for building both the C and Fortran versions of the code. Change directory to your language of choice and run the `make` command to build the code.

### C/C++

```
$ cd ~/c99
$ make
```
    
### Fortran

```
$ cd ~/f90
$ make
```
    
This will build an executable named `cg` that you can run with the `./cg` command. You may change the options passed to the compiler by modifying the `CFLAGS` variable in `c99/Makefile` or `FCFLAGS` in `f90/Makefile`. You should not need to modify anything in the Makefile except these compiler flags.

## Step 1 - Identify Parallelism

In this step, use the command-line NVPROF profiler, or your preferred performance analysis
tool, to idetify the important routines in the application and examine the
loops within these routines to determine whether they are candidates for
acceleration. Run the command below to gather a CPU profile.

```
~/c99$ nvprof --cpu-profiling on --cpu-profiling-mode top-down ./cg
Rows: 8120601, nnz: 218535025
Iteration: 0, Tolerance: 4.0067e+08
Iteration: 10, Tolerance: 1.8772e+07
Iteration: 20, Tolerance: 6.4359e+05
Iteration: 30, Tolerance: 2.3202e+04
Iteration: 40, Tolerance: 8.3565e+02
Iteration: 50, Tolerance: 3.0039e+01
Iteration: 60, Tolerance: 1.0764e+00
Iteration: 70, Tolerance: 3.8360e-02
Iteration: 80, Tolerance: 1.3515e-03
Iteration: 90, Tolerance: 4.6209e-05
Total Iterations: 100 Total Time: 39.722421s

======== CPU profiling result (top down):
99.83% main
| 75.08% matvec(matrix const &, vector const &, vector const &)
| 18.19% waxpby(double, vector const &, double, vector const &, vector const &)
| 4.25% dot(vector const &, vector const &)
| 2.29% allocate_3d_poission_matrix(matrix&, int)
| 0.02% free_matrix(matrix&)
|   0.02% munmap
0.15% __c_mset8
0.02% dot(vector const &, vector const &)

======== Data collected at 100Hz frequency
```

We see from the above output that the `matvec`, `waxpy`, and `dot` routines take up the majority of the runtime of this application. We will focus our effort on accelerating these functions.

***NOTE:*** The `allocate_3d_poission_matrix` routine is an initialization
routine that can be safely ignored.

Documentation for nvprof can be found [here](http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview)


## Step 2 - Express Parallelism

Within each of the routines identified above, express the available parallelism
to the compiler using either the `acc kernels` or `acc parallel loop`
directive. As an example, here's the OpenACC code to add to the `matvec` routine.

```
void matvec(const matrix& A, const vector& x, const vector &y) {

  unsigned int num_rows=A.num_rows;
  unsigned int *restrict row_offsets=A.row_offsets;
  unsigned int *restrict cols=A.cols;
  double *restrict Acoefs=A.coefs;
  double *restrict xcoefs=x.coefs;
  double *restrict ycoefs=y.coefs;

#pragma acc kernels
  {
    for(int i=0;i<num_rows;i++) {
      double sum=0;
      int row_start=row_offsets[i];
      int row_end=row_offsets[i+1];
      for(int j=row_start;j<row_end;j++) {
        unsigned int Acol=cols[j];
        double Acoef=Acoefs[j];
        double xcoef=xcoefs[Acol];
        sum+=Acoef*xcoef;
      }
      ycoefs[i]=sum;
    }
  }
}
```

Add the necessary directives to each routine **one at a time** in order of importance. After adding the directive, recompile the code, check that the answers have remained the same, and note the performance difference from your
change.

```
$ make
pgc++ -fast -acc -ta=tesla:managed -Minfo=accel main.cpp -o cg

matvec(const matrix &, const vector &, const vector &):
      8, include "matrix_functions.h"
          15, Generating copyout(ycoefs[:num_rows])
              Generating
copyin(xcoefs[:],Acoefs[:],cols[:],row_offsets[:num_rows+1])
          16, Loop is parallelizable
              Accelerator kernel generated
              Generating Tesla code
              16, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
          20, Loop is parallelizable
```

The performance may slow down as you're working on this step. Be sure
to read the compiler feedback to understand how the compiler parallelizes the
code for you. If you are doing the C/C++ lab, it may be necessary to declare
some pointers as `restrict` in order for the compiler to parallelize them. You
will know if this is necessary if the compiler feedback lists a "complex loop
carried dependency."

### Step 3 - Re-Profile Application

Once you have added the OpenACC directives to your code, you should obtain a
new profile of the application. For this step, use the NVIDIA Visual Profiler
to obtain a GPU timeline and see how the the GPU computation and data movement
from CUDA Unified Memory interact. 

- If you are doing this lab via qwikLABs, launch the NVIDIA Visual Profiler by following these steps:
 1. First connect to the Ubuntu remote desktop. There are few ways to do this:
   * Using the <a href="/vnc" target="_blank">browser-based VNC client</a> (easiest but lowest performance of the options)
   * Connecting with a local VNC client to **ec2-54-92-248-21.compute-1.amazonaws.com** using password **FcN7Dy6BtVs**
   * Connecting with NoMachine 4.x or 5.x client to **ec2-54-92-248-21.compute-1.amazonaws.com** with username **ubuntu** and password **FcN7Dy6BtVs** on port 4000 - the NX protocol (best performance of the options)
 2. Once you're connected to the Ubuntu remote desktop, click the Ubuntu icon in the upper-left of the desktop, and type *nvvp* in the search box and hit enter.
 3. After a short-time you will see the NVIDIA Visual Profiler application
- If you are doing this lab on your own machine, either launch Visual Profiler
  from its application link or via the `nvvp` command.

Once Visual Profiler has started, create a new session by selecting *File -> New
Session*. Then select the executable that you built by pressing the *Browse*
button next to *File*, browse to `/home/ubuntu/c99` or `/home/ubuntu/f90`, 
select `cg`,  and then press *Next*. On the next screen ensure that
*Enable unified memory profiling* is checked and press *Finish*. The result
should look like the image below. Experiment with Visual Profiler to see what
information you can learn from it.

![Image of NVIDIA Visual Profiler after completing lab 2 with the kernels
directive](https://github.com/NVIDIA-OpenACC-Course/nvidia-openacc-course-sources/raw/master/labs/lab2/visual_profiler_lab2.png)


## Conclusion

After completing the above steps for each of the 3 important routines your application should show a speed-up over the unaccelerated version. You can verify this by removing the `-ta` flag from your compiler options. 

If you have code like what is in the `solution.kernels` or `solution.parallel` directories, you should see a roughly 14% speed-up over the CPU version.  If you were to use a GPU such as a K40 vs the K520 in this g2.2xlarge instance, you can get speeds closer to 8.4 seconds!  Here's a table showing the speeds on different CPUs and GPUs:

| Processor | Time |
| --------- | ---- |
| Haswell CPU  | 30.519176 | 
| K40 GPU      | 8.460459 | 
| g2.2xlarge CPU | 36.647187 |
| g2.2xlarge GPU | 32.084089 |

In the next lecture and lab we will replace CUDA Unified Memory with explicit memory management using OpenACC and then further optimize the loops using the OpenACC `loop` directive.

## Bonus Task

1. If you used the `kernels` directive to express the parallelism in the code,
try again with the `parallel loop` directive. Remember, you will need to take
responsibility of identifying any reductions in the code. If you used 
`parallel loop`, try using `kernels` instead and observe the differences both in
developer effort and performance.

## Post-Lab Summary

If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well.

You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below.

In [None]:
%%bash
rm -f openacc_files.zip
zip -r openacc_files.zip ~/c99/* ~/f90/*

**After** executing the above zip command, you should be able to download the zip file [here](files/openacc_files.zip)