## Quick Recap

...

## Goals for today

...

# Parallelism and Concurrency
For **shared-memory** scenarios, the *OpenMP* `pragma`-based interface for C++ allows a straightforward "high-level" parallelization of many prominent use cases for parallelization (.e.g, nested for loops) and also provides mechanisms to implement synchronization between parallel running threads (e.g., critical regions or atomic updates). 
OpenMP implementations typically come along with a compiler and support a certain version of the OpenMP standard.
A prominent alternative is `TBB` which supports similar use cases but is shipped as a third-party library (i.e., the integration is not `pragma`-based).


**TBB**
```cpp
#include <tbb/parallel_for.h>
...
auto values = std::vector<double>(10000);
tbb::parallel_for(tbb::blocked_range<int>(0, values.size()),
                  [&](tbb::blocked_range<int> r) {
                    for (auto &&i : r) {
                      values[i] = 5;
                    }
                  });
```
**OpenMP**
```cpp
#include <omp.h>
...
auto values = std::vector<double>(10000);
#pragma omp for
  for (size_t n = 0; n < values.size(); ++n) {
    values[n] = 5;
  }
```

Currently, the C++ standard library provides support for concurrency [(cppref)](https://en.cppreference.com/w/cpp/thread). We will have a look at some parts of the options for concurrency in the stdlib to illustrate "how many batteries are included".


Before we proceed let's have a discussion to disambiguate between important terminology and related background:

- process vs. thread (vs. task) 
  - `processes.cpp`
  - `threads.cpp`  
- context switching/task scheduling   
  - `threads_yield.cpp`
- non-shared memory vs. shared-memory
  - `threads_shared.cpp`
  - `threads_shared_lock.cpp`  
- atomic operations & memory ordering
  - `threads_shared_fetchadd.cpp`  
- lock/mutual exclusion/critical section 
  - `threads_shared_spinlock.cpp`  
- race condition vs. data race 

## std::thread
Constructing a `std::thread` [(cppref)](https://en.cppreference.com/w/cpp/thread/thread) in C++ can look likes this, when using a callable which requires some arguments:

In [None]:
// cell fails (cling config issue)
double c = 0;  
auto callable = [&c](int a, int b) {
	++c;
	return a + b + c;
};

int arg1 = 1;
int arg2 = 1;

std::thread thread(callable, arg1, arg2); 
// // how costly is it creating a thread?
std::thread thread2(callable, arg1, arg2);      
// // (A) how many threads active ? potentially 2 thread from the statements above and the main thread
// // (B) can I have any clue when a thread starts really? can there be HUGE delays?
// // (C) how to get a new THREAD? 
// // (D) can you have this program run on a single-core?
thread.join(); // wait for finish this thread: branching ends here
thread2.join();
// (2) c value now?

Here, a function object obtained from a lambda expression is used.
After construction `thread` immediately invokes the callable using the provided arguments in a new thread of execution:

- local variables of "original-scope" are not accessible
- global variables are accessible


The construction of a thread does not support passing references as constructor arguments, this is why the following is not immediately possible:

In [None]:
      auto callable = [](int &a, int &b) { ... };
      int arg1 = 2;
      int arg2 = 2;
      std::thread thread(callable, arg1, arg2); // does not compile

**How to overcome this problem if we want to pass a reference (e.g., a large object to be manipulated by the thread)?**
```cpp
std::thread thread(callable, std::ref(arg1), std::ref(arg2));
```

As we have seen above, a `std::thread` requires an explicit `.join()` before the application ends. 
A lightweight wrapper can be used if desired to automatically join the thread when the variable is destructed:
```cpp
struct jthread {
 std::thread t;
  template <class... Args>
  explicit jthread(Args &&... args) : t(std::forward<Args>(args)...) {}
  ~jthread() { t.join(); }
};
```
Up to now we saw how to create threads which execute a provided callable but we did not really care about the returned value of the callable.

## std::future and std::promise

The approach to conveniently observe and obtain return values of callables executed in an another thread provided by the standard library are `std::promise` [(cppref)](https://en.cppreference.com/w/cpp/thread/promise) and `std::future`  [(cppref)](https://en.cppreference.com/w/cpp/thread/future).
Let's see an example which does not even involve different threads:

```cpp
  auto promise = std::promise<int>(); // create promise: no future attached
  auto future = promise.get_future(); // paired with future
  {
    auto status = future.wait_for(std::chrono::milliseconds(1));
    assert(std::future_status::timeout == status);
  }
  promise.set_value(2); // promise fullfilled
  {
    auto status = future.wait_for(std::chrono::milliseconds(1));
    assert(std::future_status::ready == status);
    future.wait(); // blocking
    auto value = future.get(); // get 2
  }
```

**What is the basic idea of `std::promise`/`std::future` pair?**

- provides synchronization point
- there is always a pair future/promise
- promise side can set a value
- future side can wait() and get()
- typically: promise is set in different thread than future


Now let's see the same example when using a thread to "fulfill the promise":
```cpp
  auto promise = std::promise<int>();
  auto future = promise.get_future();
  auto callable = [&promise]() {
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
    promise.set_value(4); // not a native return but similar
  };
  std::thread t(std::move(callable)); // new thread -> promise set
  {
    auto status = future.wait_for(std::chrono::milliseconds(1));
    assert(std::future_status::timeout == status);
  }
  {
    future.wait();             // blocking
    auto value = future.get(); // get 4
  }
  t.join();
```


**We can see that the callable had to be adapted (compared to having a regular `return` value). Is this desirable?**
- no, not desirable

A convenient approach to utilize "unmodified" callables with non-void return types with threads is `std::packaged_task` [(cppref)](https://en.cppreference.com/w/cpp/thread/packaged_task):

```cpp
  auto callable = []() {
    return 6;
  };
  auto task = std::packaged_task<int()>(std::move(callable));
  auto future = task.get_future(); // get future handle before moving in execution
  std::thread t(std::move(task));
  {
    future.wait();             // blocking
    auto value = future.get(); // get 6
    std::cout << value << std::endl;
  }
  t.join();
```
We can see that a callable object can stay unmodified when executed by another thread.


## std::async
To even further simplify the triggering of an execution of a callable in a separate thread `std::async` [(cppref)](https://en.cppreference.com/w/cpp/thread/async) can be used:

```cpp
  auto callable = [](int N, const std::string &str) {
    for (int i = 0; i < N; ++i)
      std::cout << str << std::endl;
      return 5;
  };
  int arg1 = 3;
  auto f1 = std::async(callable, arg1, "default");
  auto f2 = std::async(std::launch::deferred,  callable, arg1, "deferred");
  auto f3 = std::async(std::launch::async, callable, arg1, "async");
  auto f4 = std::async(std::launch::async, callable, arg1, "async2");
  f4.wait();
  f3.wait();
  f1.wait();
  f2.wait();  
```

**What can we expect in terms of asynchronicity w.r.t to the different launch policies `async` and `deferred` [(cppref)](https://en.cppreference.com/w/cpp/thread/launch)?**
- deferred: might be lazy-evaluated

Also `std::async` exhibits some properties which might be unexpected:

**Example #1**
```cpp
  { // (1a)
    auto future = std::async(std::launch::async, callable, arg1, "async");
    func(std::move(future));
  } 
  { // (1b)
    auto future = std::async(std::launch::async, callable, arg1, "async");
  }  
```
**Example #2**
```cpp
  { // (2)
    std::async(callable, arg1, "is this ...");

    std::async(callable, arg1, "... async?");
  }
```

**For the two examples above, will the two calls result in an overlapping execution of `callable` in two threads?**
- Example 1: yes "fire and forget"
- Example 2: destructor of returned future is blocking until future is resolved/ready "`~future(){wait();}`"

## Critical Sections (locking) 

OpenMP example:
```cpp
#pragma omp parallel
{
    // executed by all threads indep.
#pragma critical {
    // only one thread at a time can enter
}
 // executed by all threads indep.
}
```

For the probably most common synchronization task, i.e., protecting read or write access to a shared variable, the standard library provides `std::mutex` [(cppref)](https://en.cppreference.com/w/cpp/thread/mutex) which is recommended to be used only in conjunction with a `std::unique_lock` [(cppref)](https://en.cppreference.com/w/cpp/thread/unique_lock)  or `std::lock_guard` [(cppref)](https://en.cppreference.com/w/cpp/thread/lock_guard).
If a mutex would be used without a lock this can look like this:

```cpp
    std::mutex m;
    std::vector<double> shared_data;
    auto manip = [&m, &shared_data]() {
      m.lock();
      // manipulate shared_data
      ...
      m.unlock();
    };
    // this lambda could be running on different thread simult.
```

**Why is this usage error-prone?**
- manual unlock required (easy to forget in longer code)


When using a `lock_guard` the example transforms to this:
```cpp
    std::mutex m;
    std::vector<double> shared_data;
    auto manip = [&m, &shared_data]() {
      std::lock_guard<std::mutex> lock(m);
      // manipulate shared_data
    };
```
In situations where is is required to acquire multiple mutexes before performing a manipulation, `unique_lock` can be utilized like this:
```cpp
    std::mutex m1;
    std::mutex m2;
    std::vector<double> shared_data1;
    std::vector<double> shared_data2;
    auto manip = [&m1, &m2, &shared_data1, &shared_data2]() {

      // proper "multi-lock"  
      std::unique_lock<std::mutex> dlock1(m1, std::defer_lock);
      std::unique_lock<std::mutex> dlock2(m2, std::defer_lock);
      std::lock(dlock1, dlock2); // locked in "atomic op, one sweep"
      // ... manipulate shared_data1 and shared_data2 together



    };

```

**Why is the snippet above preferable over a sequential locking using two `lock_guards`?**

  ```cpp
  func1(){
    ...
    // "not so good alternative"
    std::lock_guard<std::mutex> lock1(m1);    // t1 is here and lock1    
    std::lock_guard<std::mutex> lock2(m2);  
    // ... manipulate here      
    ...
  }

  func2(){
    ...
    std::lock_guard<std::mutex> lock2(m2);  // t2 is here and lock2
    std::lock_guard<std::mutex> lock1(m1); 
    // ... manipulate here    
    ...
  }
  ```

Locking is in no way a lightweight approach: only a single thread can execute the locked region and all other threads are blocked on entry.
Let's look at a performance comparison without even using more than one thread:

```cpp
    std::vector<int> vec(N, 1.0);
    int sum = 0;
    auto accumulate = [&sum, &vec]() {
      for (auto &&item : vec) {
        sum = sum + 1; // critical section: benchmark std::atomic vs lock_guard vs no synchronization
      }
    };
```

```
g++ -std=c++17 serial_atomic_vs_lock.cpp -O3  && ./a.out 
```

## Atomic operations (std::atomic)
The standard library provides a wrapper for synchronizing access to entities which exhibit native support for atomic operations:
- integer types
- pointer types

	```cpp
	std::atomic<int> a(0);
	a++;            // (1a) perform atomic increment (specialization for int)
	a.fetch_add(1); // (1b) equivalent
	a += 5;         // (2a) perform atomic addition (specialization for int)
	a.fetch_add(5); // (2b) equivalent
	```

**Is the expression `a = a + 5;` below atomic as a whole?**
- NO: but the read on RHS is atomic and load/store on the LHS is atomic 
- no guarantee what happens meantime/between atomics on other threads

```cpp
std::atomic<int> a(0);
a = a + 5;  // (3a)
a += 5;         // (3b) equivalent?
```

Let's now move to an example where the synchronization is actually required because multiple threads are involved:
```cpp
struct Other {
  int a = 5;
  int b = 5;
}; // a+b is always 10;

struct Widget { 
  Other o;
  void mod1() {
      --o.a;
      ++o.b;
  }
  void mod2() {
      ++o.a;
      --o.b;
  }
  int inspect() const { return o.a + o.b; }
};
```
We will look at how a multi-threaded access of a `Other` through a `Widget` can be synchronized to guarantee the invariant of `Other`, namely `a+b==10`.

```
clang++ -std=c++17 mutex_lock.cpp -O3 -pthread && ./a.out
```

## std::condition_variable
Another important synchronization primitive in the standard library is `std::condition_variable`: it allows to suspend the execution of threads and to notify a single or all of them if a condition becomes true. This can be used to avoid busy waiting of existing threads which have completed their tasks and shall be reused once new tasks are available.


**Why can it be attractive to reuse threads for subsequent tasks?**
- overhead when spawning a new one is >> than reusing


The `std::condition_variable` is always used in combination with a lock, let's seen a minimal example to demonstrate it's usefulness:
```
g++ -std=c++17 convar.cpp -O3 -pthread && ./a.out 
```


## Summary

- ...
