Accelerating portable HPC Applications with Standard C++
===

# Lab 1: 2D Unsteady Heat Equation

In this tutorial we will learn how to do multi-dimensional iteration in C++17 and C++23 and how to integrate parallel algorithms with pre-existing MPI applications, by accelerating a 2D heat equation solver (see slides).

A working sequential implementation that does not use MPI is provided in [starting_point.cpp].
Please take 5 minutes to skim through it.

[starting_point.cpp]: ./starting_point.cpp

## Getting started

Let's start by compiling and running the starting point:


In [None]:
!g++ -std=c++20 -Ofast -DNDEBUG -o heat starting_point.cpp

# The binary takes the dimensions of the domain as two separate arguments: NX x NY and the number of iterations as the third parameter
!./heat 1024 1024 4000

The binary writes a solution to an `output` file, that can be converted to a png file using the `vis` script or the following function:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

#plt.style.use('dark_background') # Uncomment for dark background

def visualize(name = 'output'):
    f = open(name, 'rb')
    grid = np.fromfile(f, dtype=np.uint64, count=2, offset=0)

    nx = grid[0]
    ny = grid[1]

    times = np.fromfile(f, dtype=np.float64, count=1, offset=0)
    time = times[0]

    values = np.fromfile(f, dtype=np.float64, offset=0)
    assert len(values) == nx * ny, f'{len(values)} != {nx * ny}'
    values = values.reshape((nx, ny))

    print(f'Plotting grid {nx}x{ny}, t = {time}')

    plt.title(f'Temperature at t = {time:.3f} [s]')
    plt.xlabel('x')
    plt.ylabel('y')
    plt.pcolormesh(values, cmap=plt.cm.jet, vmin=0, vmax=values.max())
    plt.colorbar()
    plt.savefig('output.png', transparent=True, bbox_inches='tight', dpi=300)

In [None]:
visualize()

## Exercise 0: parallelize with C++ parallel algorithms and linear indexing

The goal of this exercise is to parallelize the `stencil` and `initialize` implementations using the C++ parallel algorithms and the index `split`ing technique shown in the presentation.

A template for the solution is provided in [exercise0.cpp].
The only functions that needs to be modified to achieve this are the `stencil` and `initialize` functions.

In the serial implementation, raw loops are used:

```c++
double stencil(double *u_new, double *u_old, grid g, parameters p) {
  double energy = 0.;
  for (long x = g.x_start; x < g.x_end; ++x) {
    for (long y = g.y_start; y < g.y_end; ++y) {
      energy += stencil(u_new, u_old, x, y, p);
    }
  }
  return energy;
}
```

Notice that there is already a function called `index` in the file, which maps 2D indices to 1D indices:

```c++
// Index into the memory using row-major order:
long index(long x, long y, parameters p) {
    assert(x >= 0 && x < p.nx);
    assert(y >= 0 && y < p.ny);
    return x * p.ny + y;
};
```

When implementing the `stencil` API in the [exercise0.cpp] template, create a `split` function that is compatible with it:

```c++
double stencil(double *u_new, double *u_old, grid g, parameters p) {
  double energy = 0.;
  // TODO: implement using parallel algorithms
  
  auto split = [...](long idx) -> std::pair<long, long> {
  
  };
  
  // ...TODO...
  
  return energy;
}
```

Recall that the goal for `split` and `index` is for the following invariant to hold:

```c++
auto [x1, y1] = split(index(x0, y0, p), p);
assert(x0 == x1 && y0 == y1);
```

In the parallel implementation, you should use the C++ parallel algorithms discussed in the lecture:


[exercise0.cpp]: ./exercise0.cpp

While [exercise0.cpp] compiles and runs as provided, but it produces incorrect results due to the incomplete `stencil` and `initialize` implementations.
Search for `TODO`s in the file and fix them until the following blocks compile and run correctly:


In [None]:
!g++ -std=c++20 -Ofast -DNDEBUG -o heat exercise0.cpp -ltbb
!./heat 1024 1024 2000

In [None]:
!clang++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat exercise0.cpp -ltbb
!./heat 1024 1024 2000

In [None]:
!nvc++ -std=c++20 -stdpar=gpu -gpu=cc80 -fast -DNDEBUG -o heat exercise0.cpp
!./heat 1024 1024 2000

### Solutions Exercise 0

The solutions for each example are available in the [solutions/exercise0.cpp] file.

[solutions/exercise0.cpp]: ./solutions/exercise0.cpp

The following compiles and runs the solutions for Exercise 0 using different compilers and C++ standard versions.

In [None]:
!rm output || true
!rm heat || true
!g++ -std=c++20 -Ofast -DNDEBUG -o heat solutions/exercise0.cpp -ltbb
!./heat 1024 1024 10000
!mv output output_gcc

In [None]:
!rm output || true
!rm heat || true
!clang++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o heat solutions/exercise0.cpp -ltbb
!./heat 1024 1024 10000
!mv output output_clang

In [None]:
!rm output || true
!rm heat || true
!nvc++ -std=c++20 -stdpar=gpu -gpu=cc80 -fast -DNDEBUG -o heat solutions/exercise0.cpp
!./heat 1024 1024 10000
!mv output output_nvc

In [None]:
visualize('output_nvc')

# Exercise 1: parallelize with C++ parallel algorithms and `views::cartesian_product`

Same exercise as above, but instead of using linear-indexing, use `views::cartesian_product` for multi-dimensional iteration.

A template for the solution is provided in [exercise1.cpp].

[exercise1.cpp]: ./exercise1.cpp

In [None]:
!g++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat exercise1.cpp -ltbb
!./heat 1024 1024 2000

In [None]:
!clang++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat exercise1.cpp -ltbb
!./heat 1024 1024 2000

### Solutions Exercise 1

The solutions for each example are available in the `solutions/exercise1.cpp` sub-directory.

The following compiles and runs the solutions for Exercise 1 using different compilers and C++ standard versions.

In [None]:
!rm output || true
!rm heat || true
!g++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat solutions/exercise1.cpp -ltbb
!./heat 1024 1024 2000
!mv output output_gcc

In [None]:
!rm output || true
!rm heat || true
!clang++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o heat solutions/exercise1.cpp -ltbb
!./heat 1024 1024 20000
!mv output output_clang

## Exercise 2: MPI implementation

The starting point for this exercise is [starting_point_mpi.cpp].

[starting_point_mpi.cpp]: ./starting_point_mpi.cpp

The goal of this exercise is to accelerate a pre-existing MPI implementation using parallel algorithms.

The main differences with the previous examples is that the computation now involves a data exchange with neighbors, and that the computations have been split into:

* `internal`: processes internal rows that do not depend on data from neighbors
* `prev_boundary`: exchanges data with neighbor at `rank - 1` and processes the rows that depend on the elements received
* `next_boundary`: exchanges data with neighbor at `rank + 1` and processes the rows that depend on the elements received


```c++
double internal(double* u_new, double* u_old, parameters p) {
    grid g { .x_start = 2, .x_end = p.nx, .y_start = 1, .y_end = p.ny - 1 };
    energy += stencil(u_new.get(), u_old.get(), g, p);
}

double prev_boundary(double* u_new, double* u_old, parameters p) {
    // Send window cells, receive halo cells
    if (p.rank > 0) {
      // Send bottom boundary to bottom rank
      MPI_Send(u_old + p.ny, p.ny, MPI_DOUBLE, p.rank - 1, 0, MPI_COMM_WORLD);
      // Receive top boundary from bottom rank
      MPI_Recv(u_old + 0, p.ny,  MPI_DOUBLE, p.rank - 1, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    }
    grid g { .x_start = p.nx, .x_end = p.nx + 1, .y_start = 1, .y_end = p.ny - 1 };
    return stencil(u_new, u_old, g, p);
}

double next_boundary(double* u_new, double* u_old, parameters p) {
    if (p.rank < p.nranks - 1) {
        // Receive bottom boundary from top rank
        MPI_Recv(u_old + (p.nx + 1) * p.ny, p.ny, MPI_DOUBLE, p.rank + 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        // Send top boundary to top rank, and
        MPI_Send(u_old + p.nx * p.ny, p.ny, MPI_DOUBLE, p.rank + 1, 1, MPI_COMM_WORLD);
    }
    grid g { .x_start = 1, .x_end = 2, .y_start = 1, .y_end = p.ny - 1 };
    return stencil(u_new, u_old, g, p);
}
```

You can parallelize this using the C++ parallel algorithms in the exact same way we have done above.
Pick the approach that makes the most sense to you.

A template for the solution is provided in [exercise2.cpp]. Notice that we only need to change the stencil function, there is nothing else required to accelerate an MPI application with C++.

If you notice the performance of the GPU accelerated version degrade over the non-MPI version, see the slides for how to workaround this.

[exercise2.cpp]: ./exercise2.cpp



In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat exercise2.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_gcc

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -stdpar=gpu  -gpu=cc80 -std=c++20 -fast -DNDEBUG -o heat exercise2.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_nvc

### Solution Exercise 2

The solutions for each example are available in the `solutions/exercise2.cpp` sub-directory and the `_nomanaged` variant that uses `thrust::device_vector` instead.

The following compiles and runs the solutions for Exercise 2 using different compilers and C++ standard versions.

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat solutions/exercise2.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_gcc

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -stdpar=gpu  -gpu=cc80 -std=c++20 -fast -DNDEBUG -o heat solutions/exercise2.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_nvc

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -stdpar=gpu  -gpu=cc80,nomanaged -std=c++20 -fast -DNDEBUG -o heat solutions/exercise2_nomanaged.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_nvc

## Exercise 3: overlapping communication and computation

The goal of this exercise is to extend the solution of Exercise 2 to use multiple `std::thread`s to overlap computation with communication.

The main `TODO`s are:

* using `atomic` operations to modify the energy concurrently
* using a `std::barrier` to synchronize the different threads
* using one `std::thread` per computation
* performing some of the operations in a critical section between the three threads:
  * `MPI_Reduce` of the energy
  * reset the energy to `0.` for the next iteration
  
See:


```c++
  // TODO: use a dynamically-allocated atomic variable for the energy
  double* energy = new double{0.};
    
  // TODO: use a barrier for synchronization
  // ...bar = ...

  // TODO: use threads for the different computations
  auto thread_prev = std::thread([/*TODO: complete capture */]() {
      for (long it = 0; it < p.nit(); ++it) {
          // TODO: perform the prev exchange and computation
          // TODO: update the atomic energy
          // TODO: synchronize with the barrier
      }
  });
    
  auto thread_next = /* TODO: similar for prev */;
      
  auto thread_internal = /*
    TODO: same as for next and prev
    TODO: need to perform the reduction in one of the threads (for example this one)
    TODO: need to reset the atomic in one of the threads (for example this one)
  */;

  // TODO: join all threads

```

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat exercise3.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_gcc

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -stdpar=gpu  -gpu=cc80,nomanaged -std=c++20 -fast -DNDEBUG -o heat exercise3.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_nvc

### Solution Exercise 3

The solutions for each example are available in the `solutions/exercise3.cpp` sub-directory.

The following compiles and runs the solutions for Exercise 3 using different compilers and C++ standard versions.

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat solutions/exercise3.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_gcc

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -stdpar=gpu  -gpu=cc80 -std=c++20 -fast -DNDEBUG -o heat solutions/exercise3.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_nvc