# Get started with OpenACC

What will you learn here?

1. Open a parallel region with `#pragma acc parallel`
2. Activate loop parallelism with `#pragma acc loop`
3. Open a structured data region with `#pragma acc data`
4. Compile a code with OpenACC support

## OpenACC directives

If you have a CPU code and you want to get some parts on the GPU, you can add OpenACC directives to it.

A directive has the following structure:

<img alt="OpenACC directive" src="../../pictures/directive_acc.png" style="float:none" width="30%"/>

If we break it down, we have these elements:

- The sentinel is special instruction for the compiler. It tells it that what follows has to be interpreted as OpenACC
- The directive is the action to do. In the example, _parallel_ is the way to open a parallel region that will be offloaded to the GPU
- The clauses are "options" of the directive. In the example we want to copy some data on the GPU.
- The clause arguments give more details for the clause. In the example, we give the name of the variables to be copied

Some directives need to be opened just before a code block.
```c
#pragma acc parallel
{
    // code block opened with '{' and closed by '}'
}
```

### A short example

With this example you can get familiar with how to run code cells during this session.
`%%idrrun` has to be present at the top of a code cell to compile and execute the code written inside the cell.

The content has to be a valid piece of code otherwise you will get errors.
In C, if you want to run the code, you need to define the `main` function:
```c
int main(void)
{
//
}
```

or:
```c
int main(int argc, char** argv)
{
//
}
```

The example initializes an array of integers.

Example stored in: `../../examples/C/Get_started_init_array_exercise.c`

In [None]:
%%idrrun
#include <stdio.h>
int main(void)
{
    int size = 100000;
    int array[size];
    for (int i=0; i<size; ++i)
        array[i] = 2 * i;
    printf("%d", array[21]);
}


Now we add the support of OpenACC with `-a` option of idrrun.

To offload the computation on the GPU you have to open a parallel region with the directive `acc parallel` and define a code block which is affected.

Modify the cell below to perform this action. No clause are needed here.

Example stored in: `../../examples/C/Get_started_init_array_exercise_acc.c`

In [None]:
%%idrrun -a
#include <stdio.h>
int main(void)
{
    int size = 100000;
    int array[size];
    // Modifications from here
    for (int i=0; i<size; ++i)
        array[i] = 2 * i;
    printf("%d", array[12]);
}


### Solution

Example stored in: `../../examples/C/Get_started_init_array_solution_acc.c`

In [None]:
%%idrrun -a
#include <stdio.h>
int main(void)
{
    int size = 100000;
    int array[size];
    #pragma acc parallel
    {
    for (int i=0; i<size; ++i)
        array[i] = 2 * i;
    }
    printf("%d", array[12]);
}

We can have a look at a different behavior when the compiler is doing implicit stuff:

Example stored in: `../../examples/C/Get_started_init_array_solution_acc_2.c`

In [None]:
%%idrrun -a
#include <stdio.h>
int main(void)
{
    int size = 100000;
    int array[size];
    #pragma acc parallel
    {
    #pragma acc loop
    for (int i=0; i<size; ++i)
        array[i] = 2 * i;
    }
    printf("%d", array[12]);
}

### Let's analyze what happened

The following steps are printed:

1. the compiler command to generate the executable
2. the output of the command (displayed on red background)
3. the command line to execute the code
4. the output/error of the execution

We activated the verbose mode for the NVIDIA compilers for information about optimizations and OpenACC (compiler option -Minfo=all) and __strongly recommend that you do the same in your developments__.

The compiler found in the `main` function a __kernel__ (this is the name of code blocks offloaded to the GPU) and was able to generate code for GPU.
The line refers to the directive `acc parallel` included in the code.

By default NVIDIA compilers (formerly PGI) make an analysis of the parallel region and try to find:

- loops that can be parallelized
- data transfers needed
- operations like reductions
- etc

It might result in unexpected behavior since we did not write explicitly the directives to perform those actions.
Nevertheless, we decided to keep this feature on during the session since it is the default.
This is the reason you can see that a directive `acc loop` (used to activate loop parallelism on the GPU) was added implicitly to our code and a data transfer with `copyout`.

## Loops parallelism

Most of the parallelism in OpenACC (hence performance) comes from the loops in your code and especially from loops with __independent iterations__.
Iterations are independent when the results do not depend on the order in which the iterations are done.
Some differences due to non-associativity of operations in limited precision are usually OK.
You just have to be aware of that problem and decide if it is critical.

Another condition is that the runtime needs to know the number of iterations.
So keep incrementing integers!

### Directive

The directive to parallelize loops is:
```c
#pragma acc loop
```

### Non independent loops

Here are some cases where the iterations are not independent:

- Infinite loops
```c
while(error > tolerance)
{
    //compute error
}
```

- Current iteration reads values computed by previous iterations
```c
array[0] = 0;
array[1] = 1;
for (int i = 2; i<size; ++i)
    array[i] = array[i-1]+array[i-2];
```

- Current iteration reads values that will be changed by subsequent iterations
```c
for (int i=0; i< size-1; ++i)
    array[i] = array[i+1] + 1
```

- Current iteration writes values that will be read by subsequent iterations
```c
for (int i = 0; i<size-1; ++i)
{
    array[i]++;
    array[i+1] = array[i]+2;
}
```

These kind of loops can be offloaded to the GPU but might not give correct results if not run in sequential mode.
You can try to modify the algorithm to transform them into independent loop:

- Use temporary arrays
- Modify the order of the iterations
- etc

## Managing data in compute regions

During the porting of your code the data on which you work in the _compute regions_ might have to go back and forth between the host and the GPU.
This is important to minimize the number of data transfers because of the cost of these operations.

For each _compute region_ (i.e. `acc parallel` directive or _kernel_) a _data region_ is created.
OpenACC gives you several clauses to manage efficiently the transfers.

```c
#pragma acc parallel copy(var1[first_index:num_elements]) copyin(var2[first_index_i:num_elements_i][first_index_j:num_elements_j], var3) copyout(var4, var5)
```

| clause      | effect when entering the region                                               | effect when leaving the region                                          |
|-------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------|
| create      | Allocate the memory needed on the GPU                                         | Free the memory on the GPU                                              |
| copyin      | Allocate the memory and initialize the variable with the values it has on CPU | Free the memory on the GPU                                              |
| copyout     | Allocate the memory needed on the GPU                                         | Copy the values from the GPU to the CPU then free the memory on the GPU |
| copy        | Allocate the memory and initialize the variable with the values it has on CPU | Copy the values from the GPU to the CPU then free the memory on the GPU |
| present     | Check if data is present: an error is raised if it is not the case            | None                                                                    |

<img alt="Data clauses" src="../../pictures/data_clauses.png" style="float:none" width="30%"/>

To choose the right data clause you need to answer the following questions:

- Does the kernel need the values computed on the host beforehand? (Before)
- Are the values computed inside the kernel needed on the host afterhand? (After)

|                  | Needed after        | Not needed after  |
|------------------|---------------------|-------------------|
|Needed Before     |  copy(var1, ...)    | copyin(var2, ...) |
|Not needed before |  copyout(var3, ...) | create(var4, ...) |

Usually it is not mandatory to specify the clauses.
The compiler will analyze your code to guess what the best solution and will tell you that one operation was done implicitely.
As a good pratice, we recommend to make all implicit operations explicit.

## Exercise: Gaussian blurring filter

In this exercise, we read a picture, load it on the GPU and then we apply a blur filter. For each pixel, the value is computed as the weighted sum of the 24 neighbors and itself with the stencil shown below:

The original picture is stored in the pictures directory. We have to convert it to RAW before loading it.

In [None]:
import os
picture = os.path.join("..", "..", "pictures", "midris.jpg")
from idrcomp import convert_jpg_to_raw
convert_jpg_to_raw(picture, "pic.rgb")

<img alt="Stencil for Gaussian Blur" src="../../pictures/stencil_tp_blur.png" style="float:none"/>

Note: In Fortran the weights are adjusted because we do not have unsigned integers.

Your job is to offload the blur function.
Make sure that you use the correct data clauses for "pic" and "blurred" variables.

The original picture is 2232x4000 pixels.
We need 1 value for each RGB channel it means that the actual size of the matrix is 4000x12000 (3x4000).

Example stored in: `../../examples/C/blur_simple_exercise.c`

In [None]:
%%idrrun -a
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <math.h>
/**
 * Apply a gaussian blurring filter to a picture generated on the fly
 *
 * List of functions:
 *   - void blur(unsigned char* pic,  unsigned char* blurred, size_t rows, size_t cols)
 *     the actual filter
 *   - void fill(unsigned char* pic, size_t rows, size_t cols)
 *     generate the original picture
 *   - void out_pic(unsigned char* pic, char* name, size_t rows, size_t cols)
 *     create a .rgb file 
 */

void blur(unsigned char* pic,  unsigned char* blurred, size_t rows, size_t cols, int passes)
{
    /**
     * Perform the blurring of the picture
     * @ param pic(in): a pointer to the original picture
     * @ param blurred(out): a pointer to the blurred picture
     */
   size_t i, j, l, i_c, j_c;
   unsigned char *temp;
   unsigned int pix;
   unsigned char coefs[5][5] = { {1,  4,  6,  4,  1},
                                 {4, 16, 24, 16,  4},
                                 {6, 24, 36, 24,  6},
                                 {4, 16, 24, 16,  4},
                                 {1,  4,  6,  4,  1}};
   for (int pass = 0; pass < passes; ++pass){
      for (i=2; i<rows-2; ++i)
         for (j=2; j<cols-2; ++j)
            for (l=0; l<3; ++l)
            {
               pix = 0;
               for (i_c=0; i_c<5; ++i_c)
                  for (j_c=0; j_c<5; ++j_c)
                     pix += (pic[(i+i_c-2)*3*cols+(j+j_c-2)*3+l]
                              *coefs[i_c][j_c]);

               blurred[i*3*cols+j*3+l] = (unsigned char)(pix/256);
            }
      temp = pic;
      pic = blurred;
      blurred = temp;
   }
}

void out_pic(unsigned char* pic, char* name, size_t rows, size_t cols)
{
    /**
     * Output of the picture into a sequence of pixel
     * Use show_rgb(filepath, rows, cols) to display
     * @param rows(in) the number of rows in the picture
     * @param cols(in) the number of columns in the picture
     */
   FILE* f = fopen(name, "wb");
   fwrite(pic, sizeof(unsigned char), rows*3*cols, f);
   fclose(f);
}

void read_matrix_from_file(char *filename, unsigned char *pic, int rows, int cols)
{
   /**
    * @brief Reads a 3D matrix from a binary file.
    *
    * This function reads a binary file and stores the data in a 3D matrix.
    * The data is assumed to be stored in binary format and is read in one pass.
    *
    * @param filename The name of the file to read from.
    * @param pic A pointer to a pointer to a pointer to an integer.
    * This is the 3D matrix that will store the data.
    * @param rows The number of rows in the matrix.
    * @param cols The number of columns in the matrix.
    */
   FILE *file = fopen(filename, "rb");

   if (file == NULL)
   {
      printf("Could not open file\n");
      return;
   }

   size_t total_size = rows * cols * 3 * sizeof(unsigned char);
   size_t read_size = fread(pic, sizeof(unsigned char), total_size, file);
   if (read_size != total_size)
   {
      printf("Could not read all values from file\n %ld instead of %ld\n", read_size, total_size);
      return;
   }

   fclose(file);
}

int main(void)
{
   size_t rows,cols;

   rows = 2252;
   cols = 4000;

   printf("Size of picture is %ld x %ld\n", rows, cols); 
   unsigned char* pic = (unsigned char*) malloc(rows*3*cols*sizeof(unsigned char));
   unsigned char* blurred_pic = (unsigned char*) malloc(rows*3*cols*sizeof(unsigned char));

   // Reads the original picture
   read_matrix_from_file("pic.rgb", pic, rows, cols);

   // Apply the blurring filter
   int passes = 40;
   blur(pic, blurred_pic, rows, cols, passes);

   out_pic(pic, "pic_read.rgb", rows, cols);
   out_pic(blurred_pic, "blurred.rgb", rows, cols);

   free(pic);
   free(blurred_pic);

   return 0;
}


In [None]:
from idrcomp import compare_rgb
"""compare the original and blurred pictures.
It is possible to display a cropped version of the images for better visualization.
For example (0.0,1.0,0.0,1.0) will display the whole image.
and (0.5,1.0,0.5,1.0) will display the bottom right part of the pictures"""
compare_rgb("pic.rgb", "blurred.rgb", (0.0, 1.0, 0.0, 1.0), 2232, 4000)

### Solution

In [None]:
import os
picture = os.path.join("..", "..", "pictures", "midris.jpg")
from idrcomp import convert_jpg_to_raw
convert_jpg_to_raw(picture, "pic.rgb")

Example stored in: `../../examples/C/blur_simple_solution.c`

In [None]:
%%idrrun -a
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <math.h>
/**
 * Apply a gaussian blurring filter to a picture generated on the fly
 *
 * List of functions:
 *   - void blur(unsigned char* pic,  unsigned char* blurred, size_t rows, size_t cols)
 *     the actual filter
 *   - void fill(unsigned char* pic, size_t rows, size_t cols)
 *     generate the original picture
 *   - void out_pic(unsigned char* pic, char* name, size_t rows, size_t cols)
 *     create a .rgb file 
 */

void blur(unsigned char* pic,  unsigned char* blurred, size_t rows, size_t cols, int passes)
{
    /**
     * Perform the blurring of the picture
     * @ param pic(in): a pointer to the original picture
     * @ param blurred(out): a pointer to the blurred picture
     */
   size_t i, j, l, i_c, j_c;
   unsigned char *temp;
   unsigned int pix;
   unsigned char coefs[5][5] = { {1,  4,  6,  4,  1},
                                 {4, 16, 24, 16,  4},
                                 {6, 24, 36, 24,  6},
                                 {4, 16, 24, 16,  4},
                                 {1,  4,  6,  4,  1}};
   for (int pass = 0; pass < passes; ++pass){
      #pragma acc parallel loop copyin(pic[0:rows*3*cols],coefs[:5][:5]) copyout(blurred[0:rows*3*cols])
      for (i=2; i<rows-2; ++i)
         for (j=2; j<cols-2; ++j)
            for (l=0; l<3; ++l)
            {
               pix = 0;
               for (i_c=0; i_c<5; ++i_c)
                  for (j_c=0; j_c<5; ++j_c)
                     pix += (pic[(i+i_c-2)*3*cols+(j+j_c-2)*3+l]
                              *coefs[i_c][j_c]);

               blurred[i*3*cols+j*3+l] = (unsigned char)(pix/256);
            }
      temp = pic;
      pic = blurred;
      blurred = temp;
   }
}

void out_pic(unsigned char* pic, char* name, size_t rows, size_t cols)
{
    /**
     * Output of the picture into a sequence of pixel
     * Use show_rgb(filepath, rows, cols) to display
     * @param rows(in) the number of rows in the picture
     * @param cols(in) the number of columns in the picture
     */
   FILE* f = fopen(name, "wb");
   fwrite(pic, sizeof(unsigned char), rows*3*cols, f);
   fclose(f);
}

void read_matrix_from_file(char *filename, unsigned char *pic, int rows, int cols)
{
   /**
    * @brief Reads a 3D matrix from a binary file.
    *
    * This function reads a binary file and stores the data in a 3D matrix.
    * The data is assumed to be stored in binary format and is read in one pass.
    *
    * @param filename The name of the file to read from.
    * @param pic A pointer to a pointer to a pointer to an integer.
    * This is the 3D matrix that will store the data.
    * @param rows The number of rows in the matrix.
    * @param cols The number of columns in the matrix.
    */
   FILE *file = fopen(filename, "rb");

   if (file == NULL)
   {
      printf("Could not open file\n");
      return;
   }

   size_t total_size = rows * cols * 3 * sizeof(unsigned char);
   size_t read_size = fread(pic, sizeof(unsigned char), total_size, file);
   if (read_size != total_size)
   {
      printf("Could not read all values from file\n %ld instead of %ld\n", read_size, total_size);
      return;
   }

   fclose(file);
}

int main(void)
{
   size_t rows,cols;

   rows = 2252;
   cols = 4000;

   printf("Size of picture is %ld x %ld\n", rows, cols); 
   unsigned char* pic = (unsigned char*) malloc(rows*3*cols*sizeof(unsigned char));
   unsigned char* blurred_pic = (unsigned char*) malloc(rows*3*cols*sizeof(unsigned char));

   // Reads the original picture
   read_matrix_from_file("pic.rgb", pic, rows, cols);

   // Apply the blurring filter
   int passes = 40;
   blur(pic, blurred_pic, rows, cols, passes);

   out_pic(pic, "pic_read.rgb", rows, cols);
   out_pic(blurred_pic, "blurred.rgb", rows, cols);

   free(pic);
   free(blurred_pic);

   return 0;
}

Now we can compare the original picture with its blurred version:

In [None]:
from idrcomp import compare_rgb
"""compare the original and blurred pictures.
It is possible to display a cropped version of the images for better visualization.
For example (0.0,1.0,0.0,1.0) will display the whole image.
and (0.5,1.0,0.5,1.0) will display the bottom right part of the pictures"""
compare_rgb("pic.rgb", "blurred.rgb", (0.0, 1.0, 0.0, 1.0), 2232, 4000)

## Reductions with OpenACC

Your code is performing a reduction when a loop is updating at each cycle the same variable:

For example, if you perform the sum of all elements in an array:

```c
for (int i=0; i<size_array; ++i)
    sum += array[i];
```


If you run your code sequentially no problems occur.
However we are here to use a massively parallel device to accelerate the computation.

In this case we have to be careful since simultaneous read/write operations can be performed on the same variable.
The result is not sure anymore because we have a race condition.

For some operations, OpenACC offers an efficient mechanism if you use the _reduction(operation:var1,var2,...)_ clause which is available for the directives:
- `#pragma acc loop reduction(op:var1)` 
- `#pragma acc parallel reduction(op:var1)` 
- `#pragma acc kernels reduction(op:var1)` 
- `#pragma acc serial reduction(op:var1)`

__Important__: Please note that for a lot of cases, the NVIDIA compiler (formerly PGI) is able to detect that a reduction is needed and will add it implicitly.
We advise you make explicit all implicit operations for code readability/maintenance.

### Available operations

The set of operations is limited. We give here the most common:

| Operator   | Operation    | Syntax                     |
|------------|--------------|----------------------------|
| +          | sum          | `reduction(+:var1, ...)`   |
| *          | product      | `reduction(*:var2, ...)`   |
| max        | find maximum | `reduction(max:var3, ...)` |
| min        | find minimum | `reduction(min:var4,...)`  |

Other operators are available, please refer to the OpenACC specification for a complete list.

#### Reduction on several variables

If you perform a reduction with the same operation on several variables then you can give a comma separated list after the colon:
```c
#pragma acc parallel loop reduction(+:var1, var2,...)
```


If you perform reductions with different operators then you have to specify a _reduction_ clause for each operator:
```c
#pragma acc parallel reduction(+:var1, var2) reduction(max:var3) reduction(*: var4)
```

### Exercise

Let's do some statistics on the exponential function.
The goal is to compute

- the integral of the function between 0 and $\pi$ using the trapezoidal method
- the maximum value
- the minimum value

You have to:

- Run the following example on the CPU. How much time does it take to run?
- Add the directives necessary to create one kernel for the loop that will run on the GPU
- Run the computation on the GPU. How much time does it take?

Your solution is considered correct if no implicit operation is reported by the compiler.

Example stored in: `../../examples/C/reduction_exponential_exercise.c`

In [None]:
%%idrrun -a
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <float.h>
int main(void)
{
    // current position and value
    double x,y,x_p;
    // Number of divisions of the function
    int nsteps = 1e9;
    // x min
    double begin = 0.;
    // x max
    double end = M_PI;
    // Sum of elements
    double sum = 0.;
    // Length of the step
    double step_l = (end-begin)/nsteps;

    double dmin = DBL_MAX;
    double dmax = DBL_MIN;
    for (int i=0 ; i < nsteps ; ++i )
    {
        x = i*step_l;
        x_p = (i+1)*step_l;
        y = (exp(x)+exp(x_p))/2;
        sum += y;
        if (y < dmin)
            dmin = y;
        if (y > dmax)
            dmax = y;
    }
    // Print the stats
    printf("The MINimum value of the function is: %f\n",dmin);
    printf("The MAXimum value of the function is: %f\n",dmax);
    printf("The integral of the function on [%f,%f] is: %f\n",begin,end,sum*step_l);
    printf("   difference is: %5.2e",exp(end)-exp(begin)-sum*step_l);
}


#### Solution

Example stored in: `../../examples/C/reduction_exponential_solution.c`

In [None]:
%%idrrun -a
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <float.h>
int main(void)
{
    // current position and value
    double x,y,x_p;
    // Number of divisions of the function
    int nsteps = 1e9;
    // x min
    double begin = 0.;
    // x max
    double end = M_PI;
    // Sum of elements
    double sum = 0.;
    // Length of the step
    double step_l = (end-begin)/nsteps;

    double dmin = DBL_MAX;
    double dmax = DBL_MIN;
#pragma acc parallel loop reduction(+:sum) reduction(min:dmin) reduction(max:dmax)
    for (int i=0 ; i < nsteps ; ++i )
    {
        x = i*step_l;
        x_p = (i+1)*step_l;
        y = (exp(x)+exp(x_p))/2;
        sum += y;
        if (y < dmin)
            dmin = y;
        if (y > dmax)
            dmax = y;
    }
    // Print the stats
    printf("The MINimum value of the function is: %f\n",dmin);
    printf("The MAXimum value of the function is: %f\n",dmax);
    printf("The integral of the function on [%f,%f] is: %f\n",begin,end,sum*step_l);
    printf("   difference is: %5.2e",exp(end)-exp(begin)-sum*step_l);
}

### Important Notes

- A special kernel is created for reduction. With NVIDIA compiler its name is the name of the "parent" kernel with \_red appended.
- You may want to use other directives to "emulate" the behavior of a reduction (it is possible by using _atomic_ operations).
  We strongly discourage you from doing this. The _reduction_ clause is much more efficient.