# The most important directives and clauses

## Directive syntax

<img alt="OpenACC directive" src="../../pictures/directive_acc.png" style="float:none" width="30%"/>

If we break it down, we have those elements:

- The sentinel is a special instruction for the compiler. It tells him that what follows has to be interpreted as OpenACC directives
- The directive is the action to do. In the example, _parallel_ is the way to open a parallel region that will be offloaded to the GPU
- The clauses are "options" of the directive. In the example we want to copy some data on the GPU.
- The clause arguments give more details for the clause. In the example, we give the name of the variables to be copied

## Creating kernels: Compute constructs

<!--| Language |      Sentinel |                               Directive | Action                                                         |
|----------|---------------|-----------------------------------------|----------------------------------------------------------------|
|    C/C++ | `#pragma acc` | `parallel` <br/> `kernels`<br/> `serial`| Create one kernels for the enclosed source code. Developer has full control. <br/> Create one kernel for each loop nest. Compiler has control <br/> Run sequentially the enclosed source code.|
|  Fortran | `!$acc`       | `parallel` <br/> `kernels`<br/> `serial`| Create one kernels for the enclosed source code. Developer has full control. <br/> Create one kernel for each loop nest. Compiler has control <br/> Run sequentially the enclosed source code.|
-->
| Directive | Number of kernels created | Who's in charge? | Comment |
|-----------|---------------------------|------------------|---------|
| [`acc parallel`](./Get_started.ipynb) | One for the enclosed region | The developer!| |
| [`acc kernels`](./Compute_constructs.ipynb)      | One for each loop nest in the enclosed region | The compiler | |
| [`acc serial`](./Compute_constructs.ipynb)        | One for the enclosed region | The developer | Only one thread is used. It is mainly for debug purpose |

### Clauses

| Clause                                       | Available for                   | Effect                                                                                                                |
|----------------------------------------------|---------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| num\_gangs(#gangs)                           | `parallel`, `kernels`           | Set the **number of gangs** used by the kernel(s)                                                                     |
| num\_workers(#workers)                       | `parallel`, `kernels`           | Set the **number of workers** used by the kernel(s)                                                                   |
| vector\_length(#length)                      | `parallel`, `kernels`           | Set the number of threads in a worker                                                                                 |
| reduction(op:vars, ...)                      | `parallel`, `kernels`, `serial` | Perform a reduction of _op_ kind on _vars_                                                                            |
| private(vars, ...)                           | `parallel`, `serial`            | Make _vars_ private at _gang_ level                                                                                   |
| firstprivate(vars, ...)                      | `parallel`, `serial`            | Make _vars_ private at _gang_ level and initialize the copies with the value that variable originally has on the host |

## Managing data

### Data regions

| Region              | Directive                                                                  |
|---------------------|----------------------------------------------------------------------------|
| Program lifetime    | [`acc enter data` & `acc exit data`](./Data_management.ipynb) |
| Function/Subroutine | [`acc declare`](./Data_management.ipynb)                                 |
| Structured          | [`acc data`](./Data_management.ipynb)                                      |
| Kernels             | Compute constructs directives                                              |

### Data clauses

To choose the right data clause you need to answer the following questions:

- Does the kernel need the values computed on the host (CPU) beforehand? (Before)
- Are the values computed inside the kernel needed on the host (CPU) afterhand? (After)

|                  | Needed after        | Not needed after  |
|------------------|---------------------|-------------------|
|Needed Before     |  copy(var1, ...)    | copyin(var2, ...) |
|Not needed before |  copyout(var3, ...) | create(var4, ...) |

#### Effects

| clause      | effect when entering the region                                                                                      | effect when leaving the region                                                                  |
|-------------|----------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
| create      | **If not already present on the GPU**: allocate the memory needed on the GPU                                         | **If not in another active data region**: free the memory on the GPU                            |
| copyin      | **If not already present on the GPU**: allocate the memory and initialize the variable with the values it has on CPU | **If not in another active data region**: free the memory                                       |
| copyout     | **If not already present on the GPU**: allocate the memory needed on the GPU                                         | **If not in another active data region**: copy the values from the GPU to the CPU then free the memory from the GPU |
| copy        | **If not already present on the GPU**: allocate the memory and initialize the variable with the values it has on CPU | **If not in another active data region**: copy the value                                        |
| present     | Check if data is present: an error is raised if it is not the case                                                   | None                                                                                            |

<img alt="Data clauses" src="../../pictures/data_clauses.png" style="float:none" width="30%"/>

### Updating data

| What to update   | Directive                   |
|------------------|-----------------------------|
| The host (CPU)   | `acc update self(vars, ...) |
| The device (GPU) | `acc update device(vars)    |

## Managing loops

### Combined constructs

The `acc loop` directive can be combined with the compute construct directives if there is only one loop nest in the parallel region:

- `acc parallel loop <union of clauses>`
- `acc kernels loop <union of clauses>`
- `acc serial loop <union of clauses>`

### Loop clauses

Here are some clauses for the [`acc loop`](./Loop_configuration.ipynb) directive:

| Clause                                       | Effect                                                              |
|----------------------------------------------|---------------------------------------------------------------------|
| [gang](./Loop_configuration.ipynb)           | The loop activates work distribution over gangs                      |
| [worker](./Loop_configuration.ipynb)         | The loop activates work distribution over workers                    |
| [vector](./Loop_configuration.ipynb)         | The loop activates work distribution over the threads of the workers |
| [seq](./Loop_configuration.ipynb)            | The loop is run sequentially                                        |
| [auto](./Loop_configuration.ipynb)           | Let the compiler decide what to do (default)                        |
| [independent](./Compute_constructs.ipynb)               | For `acc kernels`: tell the compiler the loop iterations are independent |
| [collapse(n)](./Compute_constructs.ipynb)               | The _n_ tightly nested loop are fused in one iteration space        |
| [reduction(op:vars, ...)](./Get_started.ipynb) | Perform a reduction of _op_ kind on _vars_                          |
| [tile(sizes ...)](./Loop_tiling.ipynb)       | Create tiles in the iteration space                                 |

## GPU routines

You can write a device routine with the [`acc routine <max level>`](./Routines.ipynb) directive:
**max_level** is the maximum parallelism level inside the routine including the function calls inside.
It can be _gang_, _worker_, _vector_.

## Asynchronous behavior

You can run several streams at the same time on the device using [_async(queue)_ and _wait_ clauses or `acc wait` directive](./Asynchronism.ipynb).

| Directive          | _async(queue)_ | _wait(queues,...)_ |
|--------------------|----------------|--------------------|
| `acc parallel`     | X              | X                  |
| `acc kernels`      | X              | X                  |
| `acc serial`       | X              | X                  |
| `acc enter data`   | X              | X                  |
| `acc exit data`    | X              | X                  |
| `acc wait`         | X              |                    |

For the _async_ clause, _queue_ is an integer specifying the stream on which you enqueue the directive.
If omitted a default stream is used.

## Using data on the GPU with GPU aware libraries

To get a pointer to the device memory for a variable you have to use [`acc host_data use_device(data)`](./MultiGPU.ipynb).
Useful for:

- Using GPU libraries (ex. CUDA)
- MPI CUDA-Aware to avoid spurious data transfers

## Atomic construct

To make sure that only one thread performs a read/write on a variable you have to use the [`acc atomic <operation>`](./Atomic_operations.ipynb) directive.

_operation_ is one of the following:

- read
- write
- update (read + write)
- capture (update + saving to another variable)