Lab 1: DAXPY - Accelerating portable HPC Applications with Standard C++
===

This tutorial will familiarize you with the C++ parallel algorithms. We'll parallelize Double-precision $A \cdot X + Y$ (`daxpy`), one of the main algorithms in the standard Basic Linear Algebra Subroutines (BLAS) library. It scales elements of the vector $X$ with the scalar $A$, and adds its result to the vector $Y$.

We initialize `x[i] = i` and `y[i] = 2`, and all exercises validate your implementation of `daxpy` by checking `y` afterwards. 

## Sequential implementation

A working sequential implementation is provided in [starting_point.cpp]. All exercises focus on the following two main functions:

```c++
/// Intialize vectors `x` and `y`: raw loop sequential version
void initialize(std::vector<double> &x, std::vector<double> &y) {
  for (std::size_t i = 0; i < x.size(); ++i) {
    x[i] = (double)i;
    y[i] = 2.;
  }
}

/// DAXPY: AX + Y: raw loop sequential version
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  for (std::size_t i = 0; i < y.size(); ++i)
    y[i] += a * x[i];
}
```

[starting_point.cpp]: ./starting_point.cpp

Let's start by checking the version of some of the compilers installed in the image:

In [None]:
!g++ --version
!clang++ --version
!nvc++ --version

---

Now let's compile and run the starting point:

In [None]:
!g++ -std=c++20 -o daxpy starting_point.cpp
!./daxpy 1000000

Here the `-std=c++20` controls the C++ language version.

Let's try again with optimizations using `-Ofast`, disabling debug checks `-DNDEBUG`, and compiling for the current CPU using `-march=native`:

In [None]:
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy starting_point.cpp
!./daxpy 10000000

## Exercise 1: From raw DAXPY loop to sequential C++ `std::for_each_n` algorithm

The goal of this first exercise is to re-write the raw DAXPY loop by combining:
- the C++ standard library [`std::for_each_n`] algorithm, and
- [`std::views::iota`] to create an iterator over a range of integers.

You can click on the links to access their documentation. 

A template for the solution is provided in [exercise1.cpp]. Please implement your solution by modifying only the sections of the code that have `// TODO` comments.
There is no need to modify any other place in the program.

```c++
#include <chrono>
// TODO: add C++ standard library includes as necessary

/// DAXPY: AX + Y: sequential algorithm version
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  // TODO: Implement using
  // - std::views::iota(0).begin() iterator
  // - std::for_each_n algorithm 
}
```

To test your solution, compile and run the template with the commands provided in the next cells. As is, the template compiles, but produces incorrect results because the `daxpy` implementation provided is empty. Once you fix it, the following cells should compile and run correctly.

The following commands compile and run the [Exercise 1 Template](./exercise1.cpp):

[exercise1.cpp]: ./exercise1.cpp
[`std::for_each_n`]: https://en.cppreference.com/w/cpp/algorithm/for_each_n
[`std::views::iota`]: https://en.cppreference.com/w/cpp/ranges/iota_view

In [None]:
!rm daxpy || true
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise1.cpp
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise1.cpp
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -O4 -fast -march=native -Mllvm-fast -DNDEBUG -o daxpy exercise1.cpp
!./daxpy 1000000

### Solutions Exercise 1

The solutions for each example are available in the [`solutions/`] sub-directory; for the first exercise: [`solutions/exercise1.cpp`].

The following block compiles and run the solutions for Exercise 1 using different compilers.

[`solutions/`]: ./solutions
[`solutions/exercise1.cpp`]: ./solutions/exercise1.cpp

In [None]:
!rm daxpy || true
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise1.cpp
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise1.cpp
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise1.cpp
!./daxpy 1000000

## Exercise 2: Parallelizing DAXPY with execution policies

To run DAXPY in parallel, the only thing we need to do are:
- obtain access to the execution policies by `#include <execution>` header, and
- pass the [`std::execution::par`] policy availablas the first argument of the [`std::for_each_n`] algorithm.

```c++
#include <algorithm>
// TODO: add C++ standard library includes as necessary
// #include <...>

/// DAXPY: AX + Y: parallel algorithm version
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  std::for_each_n(// TODO: pass std::execution::par, as first argument 
                  std::views::iota(0).begin(), x.size(), [&](int i) {
    y[i] += a * x[i];
  });
}
```

A template for the solution is provided in the [exercise2.cpp] file; the following cell compiles and runs this template. Notice that now the compilation options have changed:
- `clang` and `gcc`: need to link with the TBB library using the `-ltbb` flag
- `nvc++`: need to use the `-stdpar=multicore` or `-stdpar=gpu` flags

Once you make the changes, you should see the performance increase while the tests still pass.

[exercise2.cpp]: ./exercise2.cpp
[`std::for_each_n`]: https://en.cppreference.com/w/cpp/algorithm/for_each_n
[`std::execution::par`]: https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag

In [None]:
!rm daxpy || true
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise2.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise2.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy exercise2.cpp
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy exercise2.cpp
!./daxpy 100000000

### Solutions Exercise 2

The following block compiles and run the [`solutions/exercise2.cpp`]:

[`solutions/exercise2.cpp`]: ./solutions/exercise2.cpp

In [None]:
!rm daxpy || true
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise2.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise2.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy solutions/exercise2.cpp
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy solutions/exercise2.cpp
!./daxpy 100000000

## Exercise 3: Improving lambda captures for GPU performance

In the previous execise, our parallel implementation captures everything in the lambda by reference, i.e., with a `[&](...) { ... }` capture clause. This only works on heterogeneous platforms that are coherent, like the one this notebook is running on, and on modern hardware-coherent platforms like Grace Hopper, this works really well. However, this notebook is running on a software-coherent platform, and on those, it costs us some performace.

In this exercise, we will learn how to recover that performance, and how to write code that also works on non-coherent platforms: the solution to both issues is the same, and only requires modifying our lambda's capture clause to capture by value:

```c++
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  std::for_each_n(std::execution::par,
                  std::views::iota(0).begin(), x.size(),    
    [/* TODO: capture by value: a, x.data() and y.data() */](int i) {
        y[i] += a * x[i];
  });
```

Making a copy of the scalar `a` is fine, so we can just capture it by value `[a]`. However, we do not want to make a copy of the vectors `x` and `y`, since that would duplicate their elements. Instead, we just want to copy pointers to the data, to be able to directly access it, which we can do as follows `[x = x.data()]`.

A template for the solution is provided in the [exercise3.cpp] file; the following cell compiles and runs this template. Once you make the changes, you should see the performance increase while the tests still pass.

[exercise3.cpp]: ./exercise3.cpp

In [None]:
!rm daxpy || true
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise3.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise3.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy exercise3.cpp
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy exercise3.cpp
!./daxpy 100000000

### Solutions Exercise 3

The following block compiles and run the [`solutions/exercise3.cpp`]:

[`solutions/exercise3.cpp`]: ./solutions/exercise3.cpp

In [None]:
!rm daxpy || true
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise3.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise3.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy solutions/exercise3.cpp
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy solutions/exercise3.cpp
!./daxpy 100000000

## Exercise 4: Know your algorithms: `transform_reduce`

In this exercise, we'll parallelize a variant of `daxpy` we'll call `daxpy_sum`: 

```c++
/// DAXPY: AX + Y and returns sum(Y)
double daxpy_sum(double a, std::vector<double> const &x, std::vector<double> &y) {
  auto ints = std::views::iota(0, (int)x.size());
  double sum = 0.;
  for (auto i : ints) {
    y[i] += a * x[i];
    sum += y[i];
  }
  return sum;
}
```

This new algorithm, like all previous exercises, performs a `daxpy`, but it also adds all elements of `y` up, that is, it also performs a reduction. 

We **cannot** solve this exercise by just using `std::for_each_n` like we did above, and directly updating `sum` concurrently from within the lambda, because C++ does not allow multiple threads to mutate a single shared value without extra synchronization (e.g. locks or atomics), which is in general expensive. 

The most efficient way to solve this problem, is to use the [`std::transform_reduce`] algorithm (click for documentation; we'll be using the `(3)` overload). This algorithm iterates over all elements of the sequence, and:
- applies a function that takes one argument, `f`, to every element: `m(e)`,
- combines multiple elements using a function that takes two arguments, `r`: `r(m(e0), m(e1))` or `r(m(e2), r(m(e0), m(e1)))`, etc.

The API of [`std::transform_reduce`] we will be using is the following:
- Need to `#include <numeric>` header to access [`std::transform_reduce`].

```c++
template <typename Iter, typename T, typename BinaryReduction, typename UnaryFunction>
T transform_reduce(std::execution::par,     // Execution policy
                   Iter begin, Iter end,    // [begin, end) range 
                   T init,                  // Inital value for reduction
                   BinaryReduction r,       // Binary reduction: r(x, y) -> T above
                   UnaryFunction   m);      // Unary function m(e) applied to every element in [begin, end)
```

Since we want to add all elements, we'll use `std::plus` from the `#include <functional>` header as our binary reduction, so that our `daxpy_sum` becomes: 

```c++
double daxpy_sum(double a, std::vector<double> const &x, std::vector<double> &y) {
  auto ints = std::views::iota(0, (int)x.size());
  return std::transform_reduce(std::execution::par, ints.begin(), ints.end(), 0., std::plus{}, 
    [a, x = x.data(), y = y.data()](int i) {
        // TODO: y[i] += a * x[i];
        return /* TODO: y[i] */;
  });
}
```

[`std::transform_reduce`]: https://en.cppreference.com/w/cpp/algorithm/transform_reduce

In [None]:
!rm daxpy || true
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise4.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise4.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy exercise4.cpp
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy exercise4.cpp
!./daxpy 100000000

### Solutions Exercise 4

The following block compiles and run the [`solutions/exercise4.cpp`]:

[`solutions/exercise4.cpp`]: ./solutions/exercise4.cpp

In [None]:
!rm daxpy || true
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise4.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise4.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy solutions/exercise4.cpp
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy solutions/exercise4.cpp
!./daxpy 100000000

## [Optional] Exercise 5: Know your algorithms: `fill_n`

In this exercise, we are going to parallelize the `initialize` function as follows: 
- Initialize `x[i] = i;` using [`std::for_each_n`] with [`std::views::iota`], just like in the previous exercise.
- Initialize `y[i] = 2.;` using the [`std::fill_n`] algorithm, which writes the same value to all elements of a range.

```c++
/// Intialize vectors `x` and `y`: parallel algorithm version
void initialize(std::vector<double> &x, std::vector<double> &y) {
  // TODO: parallelize the initialization using
  //  - for_each_n + views::iota to initialize x
  //  - fill_n to initialize y
  // for (std::size_t i = 0; i < x.size(); ++i) {
  //   x[i] = (double)i;
  //   y[i] = 2.;
  // }
}
```

The API of [`std::fill_n`] is (click on link for documentation):

```c++
std::fill_n(std::execution::par, // Execution policy
            iterator,            // Iterator to the elements, e.g., a pointer
            number_of_elements,  // Number of elements
            value);              // Value to initialize all elements to
```

A template for the solution is provided in [exercise5.cpp]; it compiles and runs as provided, but produces incorrect results due to the incomplete implementation of the `initialize` function. Once you fix it, the following block should compile and run correctly.

[`std::fill_n`]: https://en.cppreference.com/w/cpp/algorithm/fill_n 
[`std::for_each_n`]: https://en.cppreference.com/w/cpp/algorithm/for_each_n 
[`std::views::iota`]: https://en.cppreference.com/w/cpp/ranges/iota_view
[exercise4.cpp]: ./exercise4.cpp

In [None]:
!rm daxpy || true
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise5.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise5.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy exercise5.cpp
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy exercise5.cpp
!./daxpy 100000000

### Solutions Exercise 5

The following block compiles and run the [`solutions/exercise5.cpp`]:

[`solutions/exercise5.cpp`]: ./solutions/exercise5.cpp

In [None]:
!rm daxpy || true
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise5.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise5.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy solutions/exercise5.cpp
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy solutions/exercise5.cpp
!./daxpy 100000000

If you are done quickly, please continue with the optional [Lab 1: Select](../lab1_select/select.ipynb).

## Exercise 6: Process multiple elements per iteration with multi-dimensional span

In this exercise, we'll process multiple elements per task by _tiling_ our 1D vectors as 2D matrices using [`std::mdspan`].
We then perform a parallel for over the number of rows, `nrows`, and then sequentially process the `ncols` elements of each row.

```c++
void daxpy(double a, std::vector<double> &x, std::vector<double> &y, size_t ncols = 1) {
  assert(x.size() == y.size());
  if (x.size() % ncols != 0) { 
      std::cerr << "ERROR: size " << x.size() << " not divisible by " << ncols << std::endl; 
      std::abort(); 
  }
  size_t nrows = x.size() / ncols;

  std::mdspan xs { x.data(), nrows, ncols };
  std::mdspan ys { y.data(), nrows, ncols };
  std::for_each_n(std::execution::par,
                  std::views::iota(0).begin(), nrows, [=](int row) {
        for (size_t col = 0; col < ncols; ++col) ys(row, col) += a * xs(row, col);
  });
}
```

[`std::mdspan`]: https://en.cppreference.com/w/cpp/container/mdspan

In [None]:
!rm daxpy || true
!g++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy exercise6.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy exercise6.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++23 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy exercise6.cpp
!./daxpy 1000000
!nvc++ -std=c++23 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy exercise6.cpp
!./daxpy 100000000

### Solutions Exercise 6

The following block compiles and run the [`solutions/exercise6.cpp`]:

[`solutions/exercise6.cpp`]: ./solutions/exercise6.cpp

In [None]:
!rm daxpy || true
!g++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise6.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise6.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++23 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy solutions/exercise6.cpp
!./daxpy 1000000
!nvc++ -std=c++23 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy solutions/exercise6.cpp
!./daxpy 100000000

## [Optional] Exercise 7: Modify mdspan layout

In this optional exercise, we'll change the layout of the [`std::mdspan`] to be [`std::layout_left`]:

```c++
void daxpy(double a, std::vector<double> &x, std::vector<double> &y, size_t ncols = 1) {
  assert(x.size() == y.size());
  if (x.size() % ncols != 0) { 
      std::cerr << "ERROR: size " << x.size() << " not divisible by " << ncols << std::endl; 
      std::abort(); 
  }
  size_t nrows = x.size() / ncols;

  auto l = std::layout_right::mapping(std::dextents<size_t, 2>(nrows, ncols));
  std::mdspan xs { x.data(), l };
  std::mdspan ys { y.data(), l };
  std::for_each_n(std::execution::par,
                  std::views::iota(0).begin(), nrows, [=](int row) {
        for (size_t col = 0; col < ncols; ++col) {
            ys(row, col) += a * xs(row, col);
        }
  });
}
```

[`std::mdspan`]: https://en.cppreference.com/w/cpp/container/mdspan
[`std::layout_left`]: https://en.cppreference.com/w/cpp/container/mdspan/layout_left

In [None]:
!rm daxpy || true
!g++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy exercise7.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy exercise7.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++23 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy exercise7.cpp
!./daxpy 1000000
!nvc++ -std=c++23 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy exercise7.cpp
!./daxpy 100000000

### Solutions Exercise 7

The following block compiles and run the [`solutions/exercise7.cpp`]:

[`solutions/exercise7.cpp`]: ./solutions/exercise7.cpp

In [None]:
!rm daxpy || true
!g++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise7.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise7.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++23 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy solutions/exercise7.cpp
!./daxpy 1000000
!nvc++ -std=c++23 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy solutions/exercise7.cpp
!./daxpy 100000000

## Exercise 8: `std::views::cartesian_product`

In this exercise, we'll learn how to use [`std::views::cartesian_product`] to iterate over multi-dimensional data such as the two-dimensional [`std::mdspan`] we've used in the previous exercises. We've been using the [`std::for_each_n`] algorithm with an iterator and a count, combined with a sequential loop as follows:

```c++
  std::for_each_n(std::execution::par,
                  std::views::iota(0).begin(), nrows, [=](int row) {
        for (size_t col = 0; col < ncols; ++col) {
            ys(row, col) += a * xs(row, col);
        }
  });
```

The goal of this exercise is to convert the above to use the [`std::for_each`] algorithm (without the `_n`), to iterate in parallel over a  [`std::views::cartesian_product`] view and, within the loop, obtain the indices for each dimension:

```c++
  // Create a std::views::cartesian_product range spanning (0, nrows)x(0, ncols):
  auto is = std::views::cartesian_product(
    std::views::iota(0, nrows),
    std::views::iota(0, ncols)
  );
  // Use the std::for_each (without _n) algorithm to iterate in parallel over the cartesian_product range:
  std::for_each(std::execution::par, is.begin(), is.end(), [=](auto i) {
    // Each element of the cartesian_product range is a tuple containing one index per dimension.
    // Extract the individual indices using structured bindings:
    auto [row, col] = i;
    ys(row, col) += a * xs(row, col);
  });
```

[`std::views::cartesian_product`]: https://en.cppreference.com/w/cpp/ranges/cartesian_product_view
[`std::mdspan`]: https://en.cppreference.com/w/cpp/container/mdspan
[`std::for_each_n`]: https://en.cppreference.com/w/cpp/algorithm/for_each_n
[`std::for_each`]: https://en.cppreference.com/w/cpp/algorithm/for_each

In [None]:
!rm daxpy || true
!g++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy exercise8.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy exercise8.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++23 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy exercise8.cpp
!./daxpy 1000000
!nvc++ -std=c++23 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy exercise8.cpp
!./daxpy 100000000

### Solutions Exercise 8

The following block compiles and run the [`solutions/exercise8.cpp`]:

[`solutions/exercise8.cpp`]: ./solutions/exercise8.cpp

In [None]:
!rm daxpy || true
!g++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise8.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!clang++ -std=c++23 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise8.cpp -ltbb
!./daxpy 1000000
!rm daxpy || true
!nvc++ -std=c++23 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy solutions/exercise8.cpp
!./daxpy 1000000
!nvc++ -std=c++23 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy solutions/exercise8.cpp
!./daxpy 100000000

## More optional exercises

For more optional exercises, check out [Lab 1: Select](../lab1_select/select.ipynb).