## **3 - Memory Model and Data-Synchronization**

In any multithreaded programming environment, the definition of a memory model becomes crucial. This constitutes an agreement between the programmer and the environment, wherein the system operates under the assumption of rule adherence. When these rules are upheld, the system aligns its behavior with the programmer's expectations.

The OpenMP memory model comprises two fundamental components: a collective global memory accessible to all threads, and a temporary memory for each thread. Every read and write operation pertains to this temporary perspective, which may not necessarily align with the state of the global memory. 

Full consistency is assured solely subsequent to a *flush* operation, which ensures synchronization between the written content in one's private memory view and its visibility in the global memory, as well as making data flushed by others available for reading in one's private view. 

OpenMP inherently carries out a flush procedure upon entering or exiting a parallel region, as well as within the critical section.

The concept of a "temporary view" is merely an abstract depiction of how the system is assured to function. This does not pertain to any specific hardware or software component. In practical terms, the notion of a "temporary view" encompasses various elements, including (but not limited to):
1. The extent of compiler optimizations applied.
2. The operational mechanisms of the CPU's cache memory system.

For instance, when you modify shared data in your code, the compiler could opt to retain the present value within CPU registers until a flush is required. Alternatively, the CPU might choose to maintain the current value in the L1 cache until flushing becomes necessary.

Assume that we run the following code with **two** threads.

```
int a = 0;
#pragma omp parallel
{
    #pragma omp critical
    {
        a += 1;
    }
}
```

If thread 0 enters the critical section first, the timeline of the global memory and the temporary views of the threads are as follows:

<table><tr><th > Memory	 <th><th> Thread 0 <th><th>	Thread 1    <th><th>	Explanation <tr><tr>
<tr><td> a = ?          <td><td>	a = ?   <td><td> -          <td><td> -		<tr><tr>
<tr><td> a = ?          <td><td>	a = 0   <td><td> -          <td><td>		Thread 0 sets a = 0 <tr><tr>
<tr><td> a = 0          <td><td>	a = 0   <td><td>	a = 0   <td><td>	Threads 0 and 1 enter parallel region, both flush and everybody has a consistent view of everything <tr><tr>
<tr><td> a = 0          <td><td>	a = 0   <td><td>	a = 0   <td><td>	Thread 0 enters critical section, flush (that does nothing) <tr><tr>
<tr><td> a = ?          <td><td>	a = 1   <td><td>	a = ?   <td><td>	Thread 0 modifies a, we do not know if the change is visible in the global memory or in the temporary view of thread 1 <tr><tr>
<tr><td> a = 1          <td><td>	a = 1   <td><td>	a = ?   <td><td>	Thread 0 leaves critical section, flush, modification visible in the global memory, but thread 1 has not flushed its temporary view yet <tr><tr>
<tr><td> a = 1          <td><td>	a = 1   <td><td>	a = 1   <td><td>	Thread 1 enters critical section, flush, the latest value of a is now visible in its temporary view <tr><tr>
<tr><td> a = ?          <td><td>	a = ?   <td><td>	a = 2   <td><td>	Thread 1 modifies a, we do not know if the change is visible in the global memory or in the temporary view of thread 0 <tr><tr>
<tr><td> a = 2          <td><td>	a = ?   <td><td>	a = 2   <td><td>	Thread 1 leaves critical section, flush, modification visible in the global memory, but thread 0 has not flushed its temporary view yet <tr><tr>
<tr><td> a = 2          <td><td>	a = 2   <td><td>	a = 2   <td><td>	Threads 0 and 1 leave parallel region, flush, now also thread 0 sees the latest value <tr><tr>
<tr><td> a = 2          <td><td>	a = 2   <td><td>    -   <td><td>		Only thread 0 running, it can now read a and see what we would expect <tr><tr><table>


Each shared data element *X* can be categorized into different classes:

1. **Only one Thread**: In the parallel region, only one thread accesses *X*, either for reading or writing. This scenario necessitates no synchronization as the value of *X* remains accessible within one thread's temporary view throughout the parallel region and is exclusively flushed at the region's culmination.

2. **Read-Only Variable**: Within the parallel region, *X* remains untouched by write operations. That is, all interactions involve reading. This context eliminates the need for synchronization.

3. **Other Scenarios**: It happens when *X* is accessed for reading and writing from different threads. This needs synchronization. Accessing or modifying *X* is confined to a critical section or a synchronization construct.


### **3.1 - Critical Operations**

Critical regions are defined to prevent race conditions and force threads to execute a specific code segment one by one. In OpenMP, the ```#pragma omp critical``` directive is used to define this kind of region. In short, it specifies a code block that is restricted to access by only one thread at a time.

When this pragma is used, a thread waits at the beginning of a *critical* section until no other thread in the team is executing a *critical* section having the same name. Therefore, if there are *N* threads, each thread will execute this region. While a thread *I* is executing this region, other threads will have to wait before another thread will start executing the region.

The following example contains the situation where different functions are executed in sequential (```a``` and ```z```), and in parallel (```b```, ```c```), and in a critical reigon (```d```). The image below the example depicts the execution of the code with four threads.

```
a();
#pragma omp parallel
{
    b();
    #pragma omp for
    for (int i = 0; i < 10; ++i) {
        c(i);
    }
    #pragma omp critical
    {
        d();
    }
}
z();
```

<img src="../imgs/parallel_for_critical.png" alt="alt text" width="800" height="250" class="blog-image">


#### **Example 1**

The following code shows an example of using ```pragma omp critical``` to ensure that only one thread will perform the calculation of ```sum_shared``` variable at a time. 

```
int main() {
    /* shared variable */
    int sum_shared = 0;
    omp_set_num_threads(4);
    #pragma omp parallel
    {
        /* private variable */
        int sum_local = 0;
        #pragma omp for nowait
        for (int i = 0; i < 10; ++i) {
            sum_local += i;
        }
        /* critical section to update the shared variable */
        printf("Thread %d, local_sum = %d\n", omp_get_thread_num(), sum_local);
        #pragma omp critical
        {
            sum_shared += sum_local;
        }
    }
    printf("total sum: %d\n", sum_shared);
  return 0;
}
```

Take a time to understand the code and then, play the cell above to see the output of [critical_region](../src/introduction/data/critical_region.c) code. You can change the number of threads that are created to see different behaviors of the execution.

In [None]:
!cd ../src/introduction/data/ && gcc critical_region.c -o critical_region -fopenmp && ./critical_region

In the example above, what happens if you remove the ```pragma omp critical``` directive? Modify the source code and re-run the cell above.

### **3.2 - Atomic Operations**

The omp atomic directive allows access of a specific memory location atomically. It ensures that race conditions are avoided through direct control of concurrent threads that might read or write to or from the particular memory location. With the omp atomic directive, you can write more efficient concurrent algorithms with fewer locks.

When atomic operations are employed, the following syntax is used:

```
/* parallel region */
    ...
    #pragma omp atomic clause
        operation to be performed atomically
    ...
/* end of parallel region */

```

Four clauses can be considered:

- **update**: the value of a variable is updated atomically. It ensures that only one thread a time updates the shared variable, avoiding errors from simultaneous writes to the same variable. When the clause is not present, the default clause is the ```update```.
- **read**: the values are read atomically, avoiding the danger of reading an intermediate value of the variable when it is accessed simultaneously by a concurrent thread.
- **write**: the value of a variable is writen atomically. 
- **capture**: the value of a variable is updated while capturing the original or final value of the variable atomically.

For more information regarding each clause, please visit this [page](https://www.ibm.com/docs/en/zos/2.2.0?topic=SSLTBW_2.2.0/com.ibm.zos.v2r2.cbclx01/prag_omp_atomic.html) from IBM.


#### **Example 2**

The code below performs the atomic update operation

```
    extern float x[], *p = x, y;

    /* Protect against race conditions among multiple updates. */
    #pragma omp atomic
    x[index[i]] += y;

    /* Protect against race conditions with updates through x. */
    #pragma omp atomic
    p[i] -= 1.0f;
```

#### **Example 3**

The code below performs the atomic read, write, and update operations

```
    extern int x[10], f(int);
    int temp[10];

    for(int i = 0; i < 10; i++){
        #pragma omp atomic read
        temp[i] = x[f(i)];

        #pragma omp atomic write
        x[i] = temp[i]*2;

        #pragma omp atomic update
        x[i] *=2;

    }
```

#### **Example 4**

The code below performs the atomic capture operation

```
    extern int x[10], f(int);
    int temp[10];

    for(int i = 0; i < 10; i++){
        #pragma omp atomic capture
        temp[i] = x[f(i)]++;

        #pragma omp atomix capture
        {
            temp[i] = x[f(i)];  //the two occurences of x[f(i)] must evaluate to the
            x[f(i)] -=3;        //same memory location, otherwise behavior is undefined
        }
    }
```


### **3.3 - Exercises**

#### **3.3.1 - Atomic operation**

Given the source code implemented for the critical region above, your work is to reimplement the code using atomic operations. The code to be
 parallelized is [here](../src/introduction/data/atomic_region.c). Once you are done, play the cell bellow to run the application.

In [None]:
!cd ../src/introduction/data/ && gcc atomic_region.c -o atomic_region -fopenmp && ./atomic_region