# Using OpenACC in modular programming

---
**Requirements:**

- [Get started](./Get_started.ipynb)
- [Data management](./Data_management.ipynb)
- [Loop configuration](./Loop_configuration.ipynb)

---

Most modern codes use modular programming to make the readability and maintenance easier.
You will have to deal with it inside your own code and be careful to make all functions accessible where you need.

If you call a function inside a kernel, then you need to tell the compiler to create a version for the GPU.
With OpenACC you have to use the `acc routine` directive for this purpose.

With Fortran you will have to take care of the variables that are declared inside modules and use `acc declare create`.

## `acc routine <max_level_of_parallelism>`

This directive is used to tell the compiler to create a function for the GPU as well as for the CPU.
Since the function is available for the GPU you will be able to call it inside a kernel.

When you use this directive you sign a contract with the compiler (normally no soul selling, but check it twice!) 
and promise that the function will be called inside a section of code for which work sharing at this level is not yet activated.
The clauses available are:

- gang
- worker
- vector
- seq: the function is executed sequentially by one GPU thread

The directive is added before the function definition or declaration:
```c
#pragma acc routine seq
double mean_value(double* array, size_t array_size)
{
    // compute the mean value
}
```

### Wrong examples

Since it might be a bit tricky here are some wrong examples with an explanation:
This example is wrong because `acc parallel loop worker` activates work sharing at the _worker_ level of parallelism.
The `acc routine worker` indicates that the function can activate _worker_ and _vector_ level of parallelism and you cannot activate twice the same level.
```c
#pragma acc routine worker
void my_worker_func(){...}

...
#pragma acc parallel loop worker
for (int i=0; i<size; ++i)
    my_worker_func();

```

For a similar reason this is forbidden:
```c
#pragma acc routine gang
void my_gang_func(){...}

...
#pragma acc parallel
{
    #pragma acc loop gang
    for (int i=0; i<size; ++i)
        my_gang_func();
}
```

This example is wrong since it breaks the promise you make with the compiler:
A vector routine cannot have loops at the _gang_ and _worker_ levels of parallelism.
```c
#pragma acc routine vector
void my_wrong_routine()
{
    #pragma acc loop gang worker
    for (int i=0; i<size; ++i)
        // some loop stuff
}
```

## Named `acc routine(name) <max_level_of_parallelism>`

You can declare the `acc routine` directive anywhere a function prototype is allowed.
It has to be done *before* the definition of the function or its usage in that scope.

```c
#pragma acc routine(beautiful_name) seq
...

char* beautiful_name(char* name)
{
    // Do something
}
```

or:

```c
#pragma acc routine(beautiful_name) seq
...
#pragma acc routine
int another_brick(char* name)
{
    char* beauty = beautiful_name(name);
    // Integers are beauty
    ...
    return int_beauty;
}
```

## Directives inside an `acc routine`

Routines you declare with `acc routine` shall not contain directives to create kernels (_parallel_, _serial_, _kernels_).
You have to consider the content of the function already inside a kernel.

```c
#pragma acc routine vector
void init(int* array, size_t size){
    #pragma acc loop
    for (int i=0; i<size; ++i)
        array[i] = i;
}
```

## Exercise

In this exercise, you have to compute the mean value of each row of a matrix.
The value is computed by a function `mean_value` working on one row at a time.
This function can use parallelism.

To have correct results, you will need to make the variable `local_mean` private for each thread.
To achieve this you have to use the _private(vars, ...)_ clause of the `acc loop` directive.

Example stored in: `../../examples/C/Modular_programming_mean_value_exercise.c`

In [None]:
%%idrrun -a
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

double mean_value(double* array, size_t array_size){
    double sum = 0.0;
    for(size_t i=0; i<array_size; ++i)
        sum += array[i];
    return sum/array_size;
}

void rand_init(double* array, size_t array_size)
{
     srand((unsigned) 12345900);
     for (size_t i=0; i<array_size; ++i)
         array[i] = 2.*((double)rand()/RAND_MAX -0.5);
}

void iterate(double* array, size_t array_size, size_t cell_size)
{
    double local_mean;
    for (size_t i = cell_size/2; i< array_size-cell_size/2; ++i)
    {
        local_mean = mean_value(&array[i-cell_size/2], cell_size);
        array[i] += signbit(local_mean) * 0.1;
        if (local_mean < 0.)
            array[i] += 0.1;
        else if (local_mean > 0.)
            array[i] -= 0.1;
    }
}

int main(void){
    size_t num_cols = 500000;
    size_t num_rows = 3000;

    double* table = (double*) malloc(num_rows*num_cols*sizeof(double)); 
    double* mean_values = (double*) malloc(num_rows*sizeof(double));
    // We initialize the first row with random values between -1 and 1
    rand_init(table, num_cols);

    for (size_t i=1; i<num_rows; ++i)
       iterate(&table[i*num_cols], num_cols, 32); 
    
    for (size_t i=0; i<num_rows; ++i) 
    {
        mean_values[i] = mean_value(&(table[i*num_cols]), num_cols);
    }

    for (size_t i=0; i<10; ++i)
        printf("Mean value of row %6d=%10.5f\n", i, table[i]);
    printf("...\n");
    for (size_t i=num_rows-10; i<num_rows; ++i)
        printf("Mean value of row %6d=%10.5f\n", i, table[i]);
    return 0;
}

### Solution

Example stored in: `../../examples/C/Modular_programming_mean_value_solution.c`

In [None]:
%%idrrun -a
#include <stdio.h>
#include <stdlib.h>
#pragma acc routine vector
double mean_value(double* array, size_t array_size){
    double sum = 0.0;
    #pragma acc loop vector reduction(+:sum)
    for(size_t i=0; i<array_size; ++i)
        sum += array[i];
    return sum/array_size;
}

void rand_init(double* array, size_t array_size)
{
     srand((unsigned) 12345900);
     for (size_t i=0; i<array_size; ++i)
         array[i] = 2.*((double)rand()/RAND_MAX -0.5);
}

void iterate(double* array, size_t array_size, size_t cell_size)
{
    double local_mean;
    #pragma acc parallel loop private(local_mean) present(array[:array_size])
    for (size_t i = cell_size/2; i< array_size-cell_size/2; ++i)
    {
        local_mean = mean_value(&array[i-cell_size/2], cell_size);
        if (local_mean < 0.)
            array[i] += 0.1;
        else if (local_mean > 0.)
            array[i] -= 0.1;
    }
}

int main(void){
    size_t num_cols = 1000000;
    size_t num_rows = 3000;

    double* table = (double*) malloc(num_rows*num_cols*sizeof(double)); 
    double* mean_values = (double*) malloc(num_rows*sizeof(double));
    // We initialize the first row with random values between -1 and 1
    rand_init(table, num_cols);
    #pragma acc enter data copyin(table[0:num_rows*num_cols])

    for (size_t i=1; i<num_rows; ++i)
       iterate(&table[i*num_cols], num_cols, 32); 
    
    #pragma acc parallel loop gang present(table[0:num_rows*num_cols]) copyout(mean_values[0:num_rows])
    for (size_t i=0; i<num_rows; ++i) 
    {
        mean_values[i] = mean_value(&(table[i*num_cols]), num_cols);
    }

    for (size_t i=0; i<10; ++i)
        printf("Mean value of row %6d=%10.5f\n", i, table[i]);
    printf("...\n");
    for (size_t i=num_rows-10; i<num_rows; ++i)
        printf("Mean value of row %6d=%10.5f\n", i, table[i]);
    return 0;
}