Accelerating portable HPC Applications with Standard C++
===

# Lab 0: DAXPY

In this tutorial we will familiarize ourselves with the C++ parallel algorithms and related features by implementing Double-precision AX Plus Y (DAXPY): $A \cdot X + Y$, one of the main functions in the standard Basic Linear Algebra Subroutines (BLAS) library.

The operation is a combination of scalar multiplication and vector adition. It takes two vectors of 64-bit floats, `x` and `y` and a scalar value `a`.
It multiplies each element `x[i]` by `a` and adds the result to `y[i]`.

A working implementation is provided in [starting_point.cpp].
Please take 2-3 minutes to skim through it.

## Validating solutions

For all the exercises, we assume that initially the values are `x[i] = i` and `y[i] = 2`.
The `check` function then verifies the effect of applying `daxpy` to these two vectors.

We will run this check always once

## Sequential implementation

The "core" of the sequential implementation provided in [starting_point.cpp] is split into two separate functions:


```c++
/// Initializes the vectors `x` and `y`
void initialize(std::vector<double> &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  for (std::size_t i = 0; i < x.size(); ++i) {
    x[i] = (double)i;
    y[i] = 2.;
  }
}

/// DAXPY: AX + Y
void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  for (std::size_t i = 0; i < y.size(); ++i) {
    y[i] += a * x[i];
  }
}
```

We initialize the vectors to the `x[i] = i` and `y[i] = 2.` expressions covered above for testing purposes.

The `daxpy` function implements a loop over all vector elements, reading from both `x` and `y` and writing the solution to `y`.

[starting_point.cpp]: ./starting_point.cpp

## Getting started

Let's start by checking the version of some of the compilers installed in the image:


In [1]:
!g++ --version
!clang++ --version
!nvc++ --version
!dpcpp --version

g++ (Ubuntu 11.1.0-1ubuntu1~20.04) 11.1.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Ubuntu clang version 15.0.0-++20220418052932+e0736e742922-1~exp1~20220418173019.225
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

nvc++ 22.3-0 64-bit target on x86-64 Linux -tp zen2 
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
Intel(R) oneAPI DPC++/C++ Compiler 2022.0.0 (2022.0.0.20211123)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2022.0.2/linux/bin-llvm


---

Now let's compile and run the starting point:

In [None]:
!g++ -std=c++11 -o daxpy starting_point.cpp
!./daxpy 1000000

Here the `-std=c++11` controls the C++ language version.

Let's try again with optimizations using `-Ofast` and `-DNDEBUG` (to remove debug checks):

In [None]:
!g++ -std=c++11 -Ofast -DNDEBUG -o daxpy starting_point.cpp
!./daxpy 10000000

## Exercise 0: from raw DAXPY loop to serial C++ algorithm

The goal of this first exercise is to re-write the raw DAXPY loop using the C++ standard library algorithms:

A template for the solution is provided in [exercise0.cpp]. The `TODO`s indicate the parts of the template that must be completed.
To complete this first exercise, the `daxpy` function needs to be rewritten to use the C++ standatd library algorithms and this will require adding some headers:

```c++
#include <chrono>
// TODO: add some headers here

void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  // TODO: Implement using the C++ Standard Template Library algorithms
  // ...
}
```

[exercise0.cpp]: ./exercise0.cpp

The example compiles and runs as provided, but it produces incorrect results due to the incomplete `daxpy` implementation.
Once you fix it, the following block should compile and run correctly:


In [None]:
!g++ -std=c++20 -Ofast -DNDEBUG -o daxpy exercise0.cpp
!./daxpy 1000000

In [None]:
!clang++ -std=c++20 -Ofast -DNDEBUG -o daxpy exercise0.cpp
!./daxpy 1000000

In [None]:
!nvc++ -std=c++20 -fast -DNDEBUG -o daxpy exercise0.cpp
!./daxpy 1000000

In [2]:
!dpcpp -std=c++20 -Ofast -DNDEBUG -o daxpy exercise0.cpp
!./daxpy 1000000

ERROR!


### Solutions Exercise 0

The solutions for each example are available in the `solutions/` sub-directory.

The following compiles and runs the solutions for Exercise 0 using different compilers and C++ standard versions.

In [None]:
!g++ -std=c++17 -Ofast -DNDEBUG -o daxpy solutions/exercise0.cpp
!./daxpy 1000000

In [6]:
!clang++ -std=c++17 -Ofast -DNDEBUG -o daxpy solutions/exercise0.cpp
!./daxpy 1000000

OK!
Bandwidth [GB/s]: 31.9783


In [4]:
!nvc++ -std=c++17 -fast -DNDEBUG -o daxpy solutions/exercise0.cpp
!./daxpy 1000000

OK!
Bandwidth [GB/s]: 27.5983


In [5]:
!dpcpp -std=c++20 -Ofast -DNDEBUG -o daxpy solutions/exercise0.cpp
!./daxpy 1000000

OK!
Bandwidth [GB/s]: 31.7483


# Exercise 1

Later in this tutorial we will move towards running `daxpy` on an accelerator device like a GPU.
Then it becomes important to keep the memory from unnecessarily being accessed by different devices.

The goal of this exercise is to re-write the initialization function using the C++ standard library algorithms, so that we can later perform the initialization on the device itself.

A template for the solution is provided in [exercise1.cpp]. If you have completed Exercise 0, it is no longer necessary to modify the includes much.

However, for the reasons mentioned in the presentation about "Indexing, Ranges and Views", we will be using range-v3 as a fallback for some compilers like clang, and we modify the includes as follows:

```c++
#include <algorithm>
#if defined(__clang__)
  // clang does not support libstdc++ ranges
  #include <range/v3/all.hpp>
  namespace views = ranges::views;
#elif __cplusplus >= 202002L
  #include <ranges>
  namespace views = std::views;
  namespace ranges = std::ranges;
#endif
```

Notice that in the compiler invocation below for `clang++` we will be including the range-v3 library as a system library using `-isystem/usr/local/range-v3/include`.

The core of the exercise consists in implementing the `initialize` function using the "indexing" techniques discussed in the presentation:

```c++
void initialize(std::vector<double> &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  // TODO: Implement using the C++ Standard Template Library algorithms
  // ...
}
```

[exercise1.cpp]: ./exercise1.cpp

In [None]:
!g++ -std=c++20 -Ofast -DNDEBUG -o daxpy exercise1.cpp
!./daxpy 1000000

In [None]:
!clang++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o daxpy exercise1.cpp
!./daxpy 1000000

In [None]:
!nvc++ -std=c++20 -fast -DNDEBUG -o daxpy exercise1.cpp
!./daxpy 1000000

In [8]:
!dpcpp -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o daxpy exercise1.cpp
!./daxpy 1000000

ERROR!


### Solutions Exercise 1

There are two solutions provided showing different ways of doing indexing:

In [None]:
# Using iota range for initialize 
!g++ -std=c++17 -Ofast -DNDEBUG -o daxpy solutions/exercise1.cpp
!./daxpy 1000000

In [None]:
!clang++ -std=c++17 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o daxpy solutions/exercise1.cpp
!./daxpy 1000000

In [None]:
!nvc++ -std=c++17 -fast -DNDEBUG -o daxpy solutions/exercise1.cpp
!./daxpy 1000000

In [10]:
!dpcpp -std=c++17 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o daxpy solutions/exercise1.cpp
!./daxpy 1000000

OK!
Bandwidth [GB/s]: 32.3219


In [None]:
# Using address-based indexing for daxpy
!g++ -std=c++17 -Ofast -DNDEBUG -o daxpy solutions/exercise1_indices.cpp
!./daxpy 1000000

In [None]:
!clang++ -std=c++17 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o daxpy solutions/exercise1_indices.cpp
!./daxpy 1000000

In [None]:
!nvc++ -std=c++17 -fast -DNDEBUG -o daxpy solutions/exercise1_indices.cpp
!./daxpy 1000000

In [11]:
!dpcpp -std=c++17 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o daxpy solutions/exercise1_indices.cpp
!./daxpy 1000000

OK!
Bandwidth [GB/s]: 26.1595


## Exercise 2: parallelizing DAXPY using C++ parallel algorithms

The goal of this final exercise in this section is to parallelize the `initialize` and `daxpy` functions to compute the results in parallel using CPUs or GPUs.

A template for the solution is provided in [exercise2.cpp].

```c++
#include <algorithm>
// TODO: add C++ standard library includes as necessary

void initialize(std::vector<double> &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  // TODO: Implement using the C++ Standard Template Library parallel algorithms
  // ...
}

void daxpy(double a, std::vector<double> const &x, std::vector<double> &y) {
  assert(x.size() == y.size());
  // TODO: Implement using the C++ Standard Template Library parallel algorithms
  // ...
}
```

[exercise2.cpp]: ./exercise2.cpp

Compiling for support for parallel algorithms is slightly more tricky.

There are different implementations of the C++ standard library:

* `libstdc++`: GNU toolchain implementation, default on most Linux distributions
* `libc++`: LLVM/clang toolchain implementation, default on MacOS
* `libnv++`: NVIDIA HPC SDK implementation
* etc.

From the above ones, `libc++` does not implement the C++17 parallel algorithms yet, but on Linux clang uses `libstdc++` by default, so that's ok.
However `clang++` does not support using `libstdc++` with C++20 mode, so we will restrict ourselves to C++17 when using clang for now.

To enable the parallel algorithms in the different standard libraries:

* `libstdc++`:
    * requires `-std=c++17` or newer
    * requires Intel TBB to be:
        * in the include path
        * linked against the final binary using `-ltbb`
* `nvc++`:
    * requires `-std=c++17` or newer
    * requires `-stdpar` flag
        * `-stdpar=multicore` runs parallel algorithms on CPUs
        * `-stdpar=gpu` runs parallel algorithms on GPUs, further `-gpu=` flags control the GPU target
        * See the [Parallel Algorithms Documentation](https://docs.nvidia.com/hpc-sdk/compilers/c++-parallel-algorithms/index.html). Notice that when using `-stdpar=gpu` further restrictions apply.
    

* `g++` (GCC): requires `-std=c++17` or newer, the Intel TBB library must be on the include path, and the binary must be linked against it using `-ltbb`
* `clang++` (LLVM): needs to use the GCC C++ standard library, libstdc++, since LLVM's C++ standard library (libc++) does not support parallel algorithms yet. When using `libstdc++`, same as for GCC applies. Unfortunately, clang does not support using libstdc++ with C++20, so C++17 must be used instead. To use clang with ranges, include the `range-v3` library using `-isystem/usr/local/include`.
* `nvc++` (NVIDIA): requires `-std=c++17` or newer, `-stdpar=gpu` or `-stdpar=multicore` control which device runs the parallel algorithms.

Examples: 

In [37]:
!g++ -std=c++20 -Ofast -DNDEBUG -o daxpy exercise2.cpp -ltbb
!./daxpy 100000000

ERROR!


### Solutions for Exercise 2

In [None]:
!g++ -std=c++20 -Ofast -DNDEBUG -o daxpy solutions/exercise2.cpp -ltbb
!./daxpy 100000000

In [None]:
!clang++ -std=c++17 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o daxpy solutions/exercise2.cpp -ltbb
!./daxpy 100000000

In [None]:
!nvc++ -stdpar=multicore -std=c++17 -fast -Mllvm-fast -DNDEBUG -o daxpy solutions/exercise2.cpp
!./daxpy 100000000

In [None]:
!nvc++ -stdpar=gpu -std=c++20 -fast -Mllvm-fast -DNDEBUG -gpu=cc80 -o daxpy solutions/exercise2.cpp
!./daxpy 100000000

In [36]:
!dpcpp -std=c++17 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o daxpy solutions/exercise2.cpp -ltbb
!./daxpy 1000000

/usr/bin/ld: /tmp/exercise2-4ba927.o: in function `main':
exercise2-451e19.cpp:(.text+0x293): undefined reference to `tbb::detail::r1::isolate_within_arena(tbb::detail::d1::delegate_base&, long)'
/usr/bin/ld: exercise2-451e19.cpp:(.text+0x462): undefined reference to `tbb::detail::r1::isolate_within_arena(tbb::detail::d1::delegate_base&, long)'
/usr/bin/ld: exercise2-451e19.cpp:(.text+0x4c0): undefined reference to `tbb::detail::r1::isolate_within_arena(tbb::detail::d1::delegate_base&, long)'
/usr/bin/ld: /tmp/exercise2-4ba927.o: in function `initialize(std::vector<double, std::allocator<double> >&, std::vector<double, std::allocator<double> >&)':
exercise2-451e19.cpp:(.text+0x755): undefined reference to `tbb::detail::r1::isolate_within_arena(tbb::detail::d1::delegate_base&, long)'
/usr/bin/ld: exercise2-451e19.cpp:(.text+0x790): undefined reference to `tbb::detail::r1::isolate_within_arena(tbb::detail::d1::delegate_base&, long)'
/usr/bin/ld: /tmp/exercise2-4ba927.o:exercise2-451e19.c