# Compute constructs

## Giving more freedom to the compiler: `acc kernels`

We focus the training course on the usage of the `acc parallel` compute construct since it gives almost full control to the developer.

The OpenACC standard offers the possibility to give more freedom to the compiler with the `acc kernels` compute construct.
The behavior is different as several kernels might be created from one `acc kernels` region.
One kernel is generated for each nest of loops.

### Syntax

The following example would generate 2 kernels (if reductions are present more kernels are generated to deal with it):

```c
#pragma acc kernels
{
    // 1st kernel generated
    #pragma acc loop
    for(int i=0; i<size_i; ++i)
    {
        for(int j=0; j<size_j; ++j)
        {
            // Perform some computation
        }
    }

    // 2nd kernel generated
    #pragma acc loop
    for(int i=0; i<size_i; ++i)
    {
        // Some more computation
    }
}
```

It is almost equivalent to this example:

```c
#pragma acc data <data clauses>
{
    // 1st kernel generated
    #pragma acc parallel loop
    for(int i=0; i<size_i; ++i)
    {   
        for(int j=0; j<size_j; ++j)
        {
            // Perform some computation
        }
    } 

    // 2nd kernel generated
    #pragma acc parallel loop
    for(int i=0; i<size_i; ++i)
    {   
        // Some more computation
    } 

}
```

The main difference is the status of the scalar variables used in the compute construct.
With `acc kernels` they are shared whereas with `acc parallel` they are private at the gang level.

The configuration of the kernels (number of gangs, workers and vector length) can be different.

### Independent loops

The compiler is a very prudent software. If it detects that parallelizing your loops can cause the results to be wrong it will run them sequentially.
Have a look at the compilation report to see if the compiler struggles with some loops.

However it might be a bit too prudent. If you know that parallelizing your loops is safe then you can tell the compiler with the *independent* clause of `acc loop directive`.

```c
#pragma acc kernels
{
    #pragma acc loop independent
    for (int i=0; i<size; ++i)
    {
        // A very safe loop
    }
}
```

## Running sequentially on the GPU? The `acc serial` compute construct

The GPUs are not very efficient to run sequential code however there 2 cases where it can be useful:

- Debugging a code
- Avoid some data transfers

The OpenACC standard gives you the `acc serial` directive for this purpose.

It is equivalent to having a parallel kernel which uses only one thread.

### Syntax

```c
#pragma acc serial <clauses>
{
    // My sequential kernel
}
```

which is equivalent to:

```c
#pragma acc parallel num_gangs(1) num_workers(1) vector_length(1)
{
    // My sequential kernel
}
```

## Data region associated with compute constructs

You can manage your data transfers with data clauses:

| clause    | effect when entering the region | effect when leaving the region |
|-----------|---------------------------------|---------------------------------|
| create    | **If the variable is not already present on the GPU**: allocate the memory needed on the GPU | **If the variable is not in another active data region**: free the memory on the GPU |
| copyin    | **If the variable is not already present on the GPU**: allocate the memory and initialize the variable with the values it has on CPU| **If the variable is not in another active data region**: free the memory on the GPU |
| copyout   | **If the variable is not already present on the GPU**: allocate the memory needed on the GPU | **If the variable is not in another active data region**: copy the values from the GPU to the CPU then free the memory on the GPU |
| copy      | **If the variable is not already present on the GPU**: allocate the memory and initialize the variable with the values it has on CPU | **If the variable is not in another active data region**: copy the values from the GPU to the CPU then free the memory on the GPU |
| present   | None | None |

*IMPORTANT*: If your `acc kernels` is included in another data region then you have to be careful because you can not use the data clauses to update data.
You need to use `acc update` for data already in another data region.