Accelerating portable HPC Applications with Standard C++
===

# Lab 1: select

If Lab 0 (DAXPY) was quick to complete for you, Lab 1 proposes a slightly more advanced example which requires the decomposition of a problem into multiple algorithm calls. You will use different approaches, sequential and parallel, to write a function `select` which selects some elements of an input vector `v` according to a general, user-provided criterion and copies the selected element consecutively into a new vector `w`.

This problem is easy to solve sequentially but faces an issue in a concurrent run: the index of write operations into `w` depends on operations performed by other threads.

## Initial condition

For all the exercises, the vector `v` is filled with pseudo-random numbers that are seeded with a constant value and are therefore identical from one execution to another.



## Exercise 0: serial implementation

The goal of this first exercise is to write a version of `select` which calls the algorithm `copy_if`. It is simple an elegant, but not parallelizable.

A template for the solution is provided in [exercise0.cpp]. The `TODO` indicates the part of the template that must be completed.
A first version of `select` is provided, which looks as follows:

```c++
template<class UnaryPredicate>
std::vector<int> select(const std::vector<int>& v, UnaryPredicate pred)
{
    // TODO Instead of the line below, create a vector w and use a "copy_if" algorithm
    // call to copy all elements from v to w that are selected by the unary predicate.
    auto w = v;
    return w;
}
```

[exercise0.cpp]: ./exercise0.cpp

The example compiles and runs as provided, but it produces incorrect results due to the erroneous `select` implementation.
Replace the erroneous line by an appropriate call to the `copy_if` algorithm.
Hint: You can't allocate the right number of elements for `w` in advance, because you don't know how many elements the `copy_if` algorithm is going to copy. Instead, use `std::back_inserter` to create an iterator which inserts elements at the back of `w` and resizes the vector appropriately as the algorithm progresses.
Once you fix the code, the following block should compile and run correctly:



In [None]:
!g++ -std=c++20 -Ofast -DNDEBUG -o select exercise0.cpp
!./select 30

In [None]:
!clang++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o select exercise0.cpp
!./select 30

In [None]:
!nvc++ -std=c++20 -fast -DNDEBUG -o select exercise0.cpp
!./select 30

In [None]:
#!dpcpp -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o select exercise0.cpp
#!./select 30

### Solutions Exercise 0

The solutions for each example are available in the `solutions/` sub-directory.

The following compiles and runs the solutions for Exercise 0 using different compilers and C++ standard versions.

In [None]:
!g++ -std=c++17 -Ofast -DNDEBUG -o select solutions/exercise0.cpp
!./select 30

In [None]:
!clang++ -std=c++17 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o select solutions/exercise0.cpp
!./select 30

In [None]:
!nvc++ -std=c++20 -fast -DNDEBUG -o select solutions/exercise0.cpp
!./select 30

In [None]:
#!dpcpp -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o select solutions/exercise0.cpp
#!./select 30

## Exercise 1

The code of exercise 0 cannot run in parallel, because the back inserter adds elements to `w` sequentially.
Open [exercise1.cpp], look out for the `TODO` comment and correct the code by proceeding in three steps:
1. Use `transform` to create a vector `v_sel` which has the same length as `v` and is filled with 0/1 values, depending on the result of the unary predicate applied to the corresponding element of `v`.
2. Use `inclusive_scan` to compute the cumulative sum of `v_sel` and store the result into the vector `index`, which will provide indices of the selected elements of `v` into `w`. *Attention: with `inclusive_scan`, the indices are off by one. We wouldn't have this off-by-one error with `exclusive_scan`, but `inlusive_scan` is quite convenient here: its last element indicates the total number of selected elements, and thus, the size of elements to allocate for `w`*.
3. Use a `for_each` statement to copy values from `v` to `w`, depending on the outcome of the unary predicate.

Once the code is completed, the following blocks should complete properly and produce the same output as in exercise 0.

[exercise1.cpp]: ./exercise1.cpp



In [None]:
!g++ -std=c++20 -Ofast -DNDEBUG -o select exercise1.cpp -ltbb
!./select 30

In [None]:
!clang++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o select exercise1.cpp
!./select 30

In [None]:
!nvc++ -stdpar=multicore -std=c++20 -fast -DNDEBUG -o select exercise1.cpp
!./select 30

In [None]:
!nvc++ -stdpar=gpu -std=c++20 -fast -DNDEBUG -o select exercise1.cpp
!./select 30

In [None]:
#!dpcpp -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o select exercise1.cpp
#!./select 30

### Solutions Exercise 1

The solutions for each example are available in the `solutions/` sub-directory.

The following compiles and runs the solutions for Exercise 1 using different compilers and C++ standard versions.

In [None]:
!g++ -std=c++17 -Ofast -DNDEBUG -o select solutions/exercise1.cpp -ltbb
!./select 30

In [None]:
!clang++ -std=c++17 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o select solutions/exercise1.cpp -ltbb
!./select 30

In [None]:
!nvc++ -stdpar=multicore -std=c++20 -fast -DNDEBUG -o select solutions/exercise1.cpp
!./select 30

In [None]:
!nvc++ -stdpar=gpu -std=c++20 -fast -DNDEBUG -o select solutions/exercise1.cpp
!./select 30

In [None]:
#!dpcpp -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o select solutions/exercise1.cpp
#!./select 30

## Exercise 2

In exercise 1, we decomposed the selection process in three algorithm calls to clearly illustrate the different steps involved in the parallelizable approach. But C++ offers the algorithm `transform_inclusive_scan` which combines the two first steps and avoids the need for allocating the vector `v_sel`.
Open [exercise2.cpp], look out for the `TODO` comment and implement `select` through a parallelizable approach as before, but in two steps, using `transform_inclusive_scan`.

Once the code is completed, the following blocks should complete properly and produce the same output as in exercise 0 and exercise 1.

[exercise2.cpp]: ./exercise2.cpp



In [None]:
!g++ -std=c++20 -Ofast -DNDEBUG -o select exercise2.cpp -ltbb
!./select 30

In [None]:
!clang++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o select exercise2.cpp -ltbb
!./select 30

In [None]:
!nvc++ -stdpar=multicore -std=c++20 -fast -DNDEBUG -o select exercise2.cpp
!./select 30

In [None]:
!nvc++ -stdpar=gpu -std=c++20 -fast -DNDEBUG -o select exercise2.cpp
!./select 30

In [None]:
#!dpcpp -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o select exercise2.cpp
#!./select 30

### Solutions Exercise 2

The solutions for each example are available in the `solutions/` sub-directory.

The following compiles and runs the solutions for Exercise 1 using different compilers and C++ standard versions.

In [None]:
!g++ -std=c++17 -Ofast -DNDEBUG -o select solutions/exercise2.cpp -ltbb
!./select 30

In [None]:
!clang++ -std=c++17 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o select solutions/exercise2.cpp -ltbb
!./select 30

In [None]:
!nvc++ -stdpar=multicore -std=c++20 -fast -DNDEBUG -o select solutions/exercise2.cpp
!./select 30

In [None]:
!nvc++ -stdpar=gpu -std=c++20 -fast -DNDEBUG -o select solutions/exercise2.cpp
!./select 30

In [None]:
#!dpcpp -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o select solutions/exercise2.cpp
#!./select 30