# Atomic operations

---
**Requirements:**

- [Get started](./Get_started.ipynb)
- [Data management](./Data_management.ipynb)

---

The `acc atomic` is kind of a generalization of the concept of reduction that we saw in [Get started](../Get_started.ipynb).
However the mechanism is different and less efficient than the one used for reductions.
So if you have the choice, use a _reduction_ clause.

The idea is to make sure that only one thread at a time can perform a read and/or write operation on a **shared** variable.

The syntax of the directive depends on the clause you use.

## Syntax

### _read_, _write_, _update_
```c
#pragma acc atomic <clause>
// One atomic operation
```


The clauses _read_, _write_ and _update_ only apply to the line immediately below the directive.

### _capture_

The _capture_ clause can work on a block of code:
```c
#pragma acc atomic capture
{
//Several atomic operations
}
```
In C it can also work on the capture operation just below.

```c
#pragma acc atomic capture
// One capture operation
```

## Restrictions

The complete list of restrictions is available in the OpenACC specification.

We need the following information to understand the restrictions for each clause:

- **v** and **x** are scalar values
- _binop_: binary operator (for example: +, -, \*, /, ++, --, etc)
- _expr_ is an expression that reduces to a scalar and must have precedence over _binop_

### _read_

The expression must be of the form:

```c
#pragma acc atomic read
v = x;
```

### _write_

The expression must have the form:

```c
#pragma acc atomic write
x = expr;
```

### _update_

Several forms are available:

```c
//x = x _binop_ expr;
#pragma acc update
x = x + (3*10);

//x_binop_;
#pragma acc update
x++;

//_binop_x
#pragma acc update
--x;

//x _binop_= expr
#pragma acc update
x += 30;
```

### _capture_

A capture is an operation where you set a variable with the value of an updated variable:
```c
//v = x = x _binop_ expr;
#pragma acc capture
v = x = x + (3*10);

//v = x_binop_;
#pragma acc capture
v = x++;

//v = _binop_x
#pragma acc capture
v = --x;

//v = x _binop_= expr
#pragma acc capture
v = x += 30;
```

## Exercise

Let's check if the default random number generator provided by the standard library gives good results.

In the example we generate an array of integers randomly set from 0 to 9.
The purpose is to check if we have a uniform distribution.

We cannot perform the initialization on the GPU since the rand() function is not OpenACC aware.

You have to:

- Create a kernel for the integer counting
- Make sure that the results are correct (you should have around 10% for each number)

Example stored in: `../../examples/C/atomic_exercise.c`

In [None]:
%%idrrun -a
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
    // Histogram allocation and initialization
    int histo[10];
    for (int i=0; i<10; ++i)
        histo[i] = 0;
    size_t nshots = (size_t) 1e9;
    
    // Allocate memory for the random numbers
    int* shots = (int*) malloc(nshots*sizeof(int));

    srand((unsigned) 12345900);     
    
    // Fill the array on the CPU (rand is not available on GPU with Nvidia Compilers)
    for (size_t i=0; i< nshots; ++i)
    {
        shots[i] = (int) rand() % 10;
    }
    
    // Count the number of time each number was drawn 
    for (size_t i=0; i<nshots; ++i)
    {
        histo[shots[i]]++;
    }
    
    // Print results
    
    for (int i=0; i<10; ++i)
        printf("%3d: %10d (%5.3f)\n", i, histo[i], (double) histo[i]/1.e9);
      
    return 0;
}


### Solution

Example stored in: `../../examples/C/atomic_solution.c`

In [None]:
%%idrrun -a
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
    // Histogram allocation and initialization
    int histo[10];
    for (int i=0; i<10; ++i)
        histo[i] = 0;
    size_t nshots = (size_t) 1e9;
    
    // Allocate memory for the random numbers
    int* shots = (int*) malloc(nshots*sizeof(int));

    srand((unsigned) 12345900);     
    
    // Fill the array on the CPU (rand is not available on GPU with Nvidia Compilers)
    for (size_t i=0; i< nshots; ++i)
    {
        shots[i] = (int) rand()%10;
    }
    
    // Count the number of time each number was drawn
    #pragma acc parallel loop  copyin(shots[:nshots]) copyout(histo[0:10])
    for (size_t i=0; i<nshots; ++i)
    {
        #pragma acc atomic update
        histo[shots[i]]++;
    }
    // Print results
    for (int i=0; i<10; ++i)
        printf("%3d: %10d (%5.3f)\n", i, histo[i], (double) histo[i]/1.e9);
      
    return 0;
}

#### Important Note

With recent NVidia compilers you can use reduction on tables. It will be more efficient than using atomic operations.