Accelerating portable HPC Applications with Standard C++
===

# Lab 1: DAXPY

In this tutorial we will familiarize ourselves with the C++ parallel algorithms and related features by implementing Double-precision AX Plus Y (DAXPY): $A \cdot X + Y$, one of the main functions in the standard Basic Linear Algebra Subroutines (BLAS) library.

The operation is a combination of scalar multiplication and vector adition. It takes two vectors of 64-bit floats, `x` and `y` and a scalar value `a`.
It multiplies each element `x[i]` by `a` and adds the result to `y[i]`.

A working implementation is provided in [starting_point.cpp].
Please take 2-3 minutes to skim through it.

## Validating solutions

For all the exercises, we assume that initially the values are `x[i] = i` and `y[i] = 2`.
The `check` function then verifies the effect of applying `daxpy` to these two vectors.

We will run this check always once

## Sequential implementation

The "core" of the sequential implementation provided in [starting_point.cpp] is split into two separate functions:


```c++
/// Intialize vectors `x` and `y`: raw loop sequential version
void initialize(std::vector<double> &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  for (std::size_t i = 0; i < x.size(); ++i) {
    x[i] = (double)i;
    y[i] = 2.;
  }
}

/// DAXPY: AX + Y: raw loop sequential version
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  for (std::size_t i = 0; i < y.size(); ++i) {
    y[i] += a * x[i];
  }
}
```

We initialize the vectors to the `x[i] = i` and `y[i] = 2.` expressions covered above for testing purposes.

The `daxpy` function implements a loop over all vector elements, reading from both `x` and `y` and writing the solution to `y`.

[starting_point.cpp]: ./starting_point.cpp

## Getting started

Let's start by checking the version of some of the compilers installed in the image:


In [None]:
!g++ --version
!clang++ --version
!nvc++ --version

---

Now let's compile and run the starting point:

In [None]:
!g++ -std=c++20 -o daxpy starting_point.cpp
!./daxpy 1000000

Here the `-std=c++20` controls the C++ language version.

Let's try again with optimizations using `-Ofast`, disabling debug checks `-DNDEBUG`, and compiling for the current CPU using `-march=native`:

In [None]:
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy starting_point.cpp
!./daxpy 10000000

## Exercise 1: from raw DAXPY loop to serial C++ transform algorithm

The goal of this first exercise is to re-write the raw DAXPY loop using the C++ standard library `transform` algorithms (see the documentation of [transform] to pick the right overload - number (3)).

[transform]: https://en.cppreference.com/w/cpp/algorithm/transform

A template for the solution is provided in [exercise1.cpp]. The `TODO`s indicate the parts of the template that must be completed.
To complete this first exercise, the `daxpy` function needs to be rewritten to use the C++ standatd library algorithms and this will require adding some headers:

```c++
#include <chrono>
// TODO: add C++ standard library includes as necessary

/// DAXPY: AX + Y: sequential algorithm version
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  // TODO: Implement using SEQUENTIAL algorithm
  // ...
}
```

[exercise1.cpp]: ./exercise1.cpp

The example compiles and runs as provided, but it produces incorrect results due to the incomplete `daxpy` implementation.
Once you fix it, the following blocks should compile and run correctly:


In [None]:
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise1.cpp
!./daxpy 1000000

In [None]:
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise1.cpp
!./daxpy 1000000

In [None]:
!nvc++ -std=c++20 -O4 -fast -march=native -Mllvm-fast -DNDEBUG -o daxpy exercise1.cpp
!./daxpy 1000000

### Solutions Exercise 1

The solutions for each example are available in the [`solutions/`] sub-directory.

[`solutions/`]: ./solutions

The solution for this first exercise is in [`solutions/exercise1.cpp`].

[`solutions/exercise1.cpp`]: ./solutions/exercise1.cpp

The following blocks compile and run the solutions for Exercise 1 using different compilers.

In [None]:
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise1.cpp
!./daxpy 1000000

In [None]:
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise1.cpp
!./daxpy 1000000

In [None]:
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise1.cpp
!./daxpy 1000000

# Exercise 2: from raw initialization to `std::fill_n` and `std::for_each_n`

In Exercise 3 we will parallelize `daxpy` to allow it to run on accelerator devices like a GPUs.
When doing so, it is important to avoid unnecessary memory migrations across devices.

The goal of this exercise is to initialize the memory using the standard library algorithms, so that when we parallelize the initialization in Exercise 3, it will happen on the accelerator device itself.

Since we need to initialize two vectors - `x` and `y` - lets use a different approach to initialize each:

* Initialize `x` using the `std::for_each_n` algorithm (see [for_each_n] documentation) combined with `std::views::iota` (see [iota_view] documentation).
* Initialize `y` using the `std::fill_n` algorithm (see [fill_n] documentation), which is ideal for initializing data to the same value.

[fill_n]: https://en.cppreference.com/w/cpp/algorithm/fill_n 
[for_each_n]: https://en.cppreference.com/w/cpp/algorithm/for_each_n 
[iota_view]: https://en.cppreference.com/w/cpp/ranges/iota_view

A template for the solution is provided in [exercise2.cpp]. The `TODO`s indicate the parts of the template that must be completed.
To complete this first exercise, the `initialize` function needs to be rewritten to use the C++ standatd library algorithms and this will require adding some headers for accessing `std::views::iota`:

```c++
#include <algorithm>
// TODO: add C++ standard library includes as necessary

/// Intialize vectors `x` and `y`: raw loop sequential version
void initialize(std::vector<double> &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  // TODO: Initialize `x` using SEQUENTIAL std::for_each_n algorithm with std::views::iota
  // TODO: Initialize `y` using SEQUENTIAL std::fill_n algorithm
}
```

[exercise2.cpp]: ./exercise2.cpp

The example compiles and runs as provided, but it produces incorrect results due to the incomplete `initialize` implementation.
In the compilation commands below, the C++ standard version is now C++20, to enable the use of `views::iota`.

Once you fix the `initialize` implementation, the following blocks should compile and run correctly:

In [None]:
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise2.cpp
!./daxpy 1000000

In [None]:
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -isystem/usr/local/range-v3/include -o daxpy exercise2.cpp
!./daxpy 1000000

In [None]:
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise2.cpp
!./daxpy 1000000

### Solutions Exercise 2

The solution for this exercise is in [`solutions/exercise2.cpp`].

[`solutions/exercise2.cpp`]: ./solutions/exercise2.cpp

The following compiles and runs the solutions for Exercise 1 using different compilers.

In [None]:
# Using iota range for initialize 
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise2.cpp
!./daxpy 1000000

In [None]:
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise2.cpp
!./daxpy 1000000

In [None]:
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise2.cpp
!./daxpy 1000000

## Exercise 3: parallelizing DAXPY and Initialization using C++ parallel algorithms

The goal of this final exercise in this section is to parallelize the `initialize` and `daxpy` functions to compute the results in parallel using CPUs or GPUs.

A template for the solution is provided in [exercise3.cpp].

```c++
#include <ranges>
// TODO: add C++ standard library includes as necessary

/// Intialize vectors `x` and `y`: parallel algorithm version
void initialize(std::vector<double> &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  // TODO: Parallelize initialization of `x`
  auto ints = std::views::iota(0);
  std::for_each_n(ints.begin(), x.size(), [&x)](int i) { x[i] = (double)i; });
  // TODO: Parallelize initialization of `y`
  std::fill_n(y.begin(), y.size(), 2.);
}

/// DAXPY: AX + Y: sequential algorithm version
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  /// TODO: Parallelize DAXPY computation
  std::transform(x.begin(), x.end(), y.begin(), y.begin(),
                 [&](double x, double y) { return a * x + y; });
}
```

[exercise3.cpp]: ./exercise3.cpp

Compiling with support for the parallel algorithms requires:
* `g++` and `clang++`: link against Intel TBB with `-ltbb`
* `nvc++`: compile and link with `-stdpar` flag:
  * `-stdpar=multicore` runs parallel algorithms on CPUs
  * `-stdpar=gpu` runs parallel algorithms on GPUs, further `-gpu=` flags control the GPU target
  * See the [Parallel Algorithms Documentation](https://docs.nvidia.com/hpc-sdk/compilers/c++-parallel-algorithms/index.html).
    
The example compiles, runs, and produces correct results as provided.
Parallelize it using the C++ standard library parallel algorithms and ensure that the results are still correct.

In [None]:
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise3.cpp -ltbb
!./daxpy 1000000

In [None]:
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy exercise3.cpp -ltbb
!./daxpy 1000000

In [None]:
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy exercise3.cpp
!./daxpy 1000000

In [None]:
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy exercise3.cpp
!./daxpy 1000000

### Solutions for Exercise 3

The solution for this exercise is in [`solutions/exercise3.cpp`].

[`solutions/exercise3.cpp`]: ./solutions/exercise3.cpp

The following blocks compile and run the solutions for Exercise 3 using different compilers on the CPU.

The last block compiles and runs the solution for Exercise 3 on the GPU. If you get an error, make sure that the lambda captures are captiruing scalars by value, and that when capturing a vector to access its data, one captures a pointer to its data by value as well using `[x = x.data()]`.

In [None]:
!g++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise3.cpp -ltbb
!./daxpy 1000000

In [None]:
!clang++ -std=c++20 -Ofast -march=native -DNDEBUG -o daxpy solutions/exercise3.cpp -ltbb
!./daxpy 1000000

In [None]:
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o daxpy solutions/exercise3.cpp
!./daxpy 1000000

In [None]:
!nvc++ -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o daxpy solutions/exercise3.cpp
!./daxpy 1000000