# Advanced loop configuration

---
**Requirements:**

- [Get started](./Get_started.ipynb)
- [Variables_status](./Variables_status.ipynb)
- [Data management](./Data_management.ipynb)

---

Different levels of parallelism are generated by the _gang_, _worker_ and _vector_ clauses.
The _loop_ directive is responsible for sharing the parallelism across the different levels.

The degree of parallelism in a given level is determined by the numbers of gangs, workers and threads. These numbers are defined by the implementation. This default behavior depends on not only the target architecture but also on the portion of code on which the parallelism is applied.
No modifications of this default behavior is recommended as it presents good optimization.

It is however possible to specify the numbers of gangs, workers and threads in the parallel construct with the _num_gangs_, _num_workers_ and _vector_length_ clauses. These clauses are allowed with the _parallel_ and _kernel_ construct. You might want to use these clauses in order to:

- debug (to restrict the execution on a single gang (without restrictions on the vectors as the _serial_ clause will do), to vary the parallelism degree in order to expose a race condition ...)
- limit the number of gang to lower the memory occupancy when you have to privatize arrays

## Syntax

Clauses to specify the numbers of gangs, workers and vectors are _num_gangs_, _num_workers_ and _vector_length_.

It is possible to use either numbers or variables as arguments of these clauses. The numbers must be positive integers and the variable should refer to a scalar integer variable.

```c
#pragma acc parallel num_gangs(3500) vector_length(256)
{
    #pragma acc loop gang
    for(int i = 0 ; i < size_i ; ++i)
    {
        #pragma acc loop vector
        for(int j = 0 ; j < size_j ; ++j)
        {
            // A Fabulous calculation
        }
    }
}

#pragma acc parallel loop gang num_gangs(size_i/2) vector_length(256)
for(int i = 0 ; i < size_i ; ++i)
{
    #pragma acc loop vector
    for(int j = 0 ; j < size_j ; ++j)
    {
        // A Fabulous calculation
    }
}
```

## Restrictions

The restrictions described here are for NVIDIA architectures.

- The number of gang is limited to 2³¹-1 (65535 if the compute capability is lower than 3.0).
- The product num_workers x vector_length can not be higher than 1024 (512 if the compute capability is lower than 2.0).
- To achieve performances, it is better to set the vector_length as a multiple of 32 (up to 1024).
- Using routines with a _vector_ level of parallelization or higher sets the _vector_length_ to 32 (compiler limitation).

This restrictions can vary with the architecture and it is advised to refer to the "Cuda C programming Guide" (Section G "Features and Technical Specifications") for future implementations.

## Example

Example stored in: `../../examples/C/Loop_configuration_example.c`

In [None]:
%%idrrun -a
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
int main(void)
{ 
    int n = 200;
    int ngangs = 1, nworkers = 2, nvectors = 32;
    size_t table[n*n*n];
    
#pragma acc parallel loop gang num_gangs(ngangs) num_workers(nworkers) vector_length(nvectors) copyout(table[0:n*n*n])
    for (int i=0; i<n; ++i)
    {
#pragma acc loop worker
        for (int j=0; j<n; ++j)
        {
            #pragma acc loop vector
            for (int k=0; k<n; ++k) table[i*n*n + j*n + k] = k + 1000*j + 1000*1000*i;
        }
    }
    printf("%d %d\n",table[0],table[n*n*n-1]);
}

## Exercise

A simple exercise can be to modify the value of the _num_gang_ clause (and add a variation to the vector length) and then compare the execution time.

For a change, we will make an exercise that don't make sense physically. It can however come handy, especially if you try the practical work on HYDRO. In this exercise, you will have to:

- parallelize a few lines of codes
- be sure that it reproduces well the CPU behavior
- manually modify the number of gangs
- observe that the number of gangs will be limited by the system's size in this code

Example stored in: `../../examples/C/Loop_configuration_exercise.c`

In [None]:
%%idrrun --cliopts "500"
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>

int main(int argc, char** argv)
{
    size_t size=50000;
    double table[size];
    double sum_val;
    double res;

    unsigned int ngangs = (unsigned int) atoi(argv[1]);

    res = 0.0;
    for (size_t i=0; i<size; ++i)
    {
        for(size_t j=0; j<size; ++j)
        {
            table[j] = (i+j);
        }
        sum_val = 0.0;
        for(size_t j=0; j<size; ++j)
        {
            sum_val += table[j];
        }
        res += sum_val;
    }
    printf("result: %lf \n",res);
    return 0;
}

### Solution

Example stored in: `../../examples/C/Loop_configuration_solution.c`

In [None]:
%%idrrun -a --cliopts "500"
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>

int main(int argc, char** argv)
{
    size_t size=50000;
    double array[size];
    double table[size];
    double sum_val;
    double res;

    unsigned int ngangs = (unsigned int) atoi(argv[1]);

    res = 0.0;
#pragma acc parallel num_gangs(ngangs) copyout(array[0:size]) private(table[0:size])
{
    #pragma acc loop gang reduction(+:res)
    for (size_t i=0; i<size; ++i)
    {
        #pragma acc loop vector
        for(size_t j=0; j<size; ++j)
        {
           table[j] = (i+j);
        }
        sum_val = 0.0;
        #pragma acc loop vector reduction(+:sum_val)
        for(size_t j=0; j<size; ++j)
        {
            sum_val += table[j];
        }
        array[i] = sum_val;
        res += sum_val;
    }
}

    printf("result: %lf\n",res);
    return 0;
}