Accelerating portable HPC Applications with Standard C++
===

# Topic 1: 2D Unsteady Heat Equation

In this tutorial we will learn how to do multi-dimensional iteration in C++17 and C++23 and how to integrate parallel algorithms with pre-existing MPI applications, by accelerating a 2D heat equation solver (see slides).

A working sequential implementation that does not use MPI is provided in [starting_point.cpp].
Please take 5 minutes to skim through it.

[starting_point.cpp]: ./starting_point.cpp

## Getting started

Let's start by compiling and running the starting point:


In [1]:
!g++ -std=c++20 -Ofast -DNDEBUG -o heat starting_point.cpp

# The binary takes the dimensions of the domain as two separate arguments: NX x NY and the number of iterations as the third parameter
!./heat 1024 1024 10000

E(t=0) = 1.94931e-05
E(t=0.000190735) = 0.00423877
E(t=0.00038147) = 0.00604144
E(t=0.000572205) = 0.00740573
E(t=0.000762939) = 0.00854288
E(t=0.000953674) = 0.00953482
E(t=0.00114441) = 0.0104236
E(t=0.00133514) = 0.0112341
E(t=0.00152588) = 0.0119828
E(t=0.00171661) = 0.0126809
Domain 1024x1024 (0.0167772 GB): 4.55098 GB/s


The binary writes a solution to an `output` file, that can be converted to a png file using the `vis` script or the following function:

In [40]:
import numpy as np
import matplotlib.pyplot as plt

#plt.style.use('dark_background') # Uncomment for dark background

def visualize(name = 'output'):
    f = open(name, 'rb')
    grid = np.fromfile(f, dtype=np.uint64, count=2, offset=0)

    nx = grid[0]
    ny = grid[1]

    times = np.fromfile(f, dtype=np.float64, count=1, offset=0)
    time = times[0]

    values = np.fromfile(f, dtype=np.float64, offset=0)
    assert len(values) == nx * ny, f'{len(values)} != {nx * ny}'
    values = values.reshape((nx, ny))

    print(f'Plotting grid {nx}x{ny}, t = {time}')

    plt.title(f'Temperature at t = {time:.3f} [s]')
    plt.xlabel('x')
    plt.ylabel('y')
    plt.pcolormesh(values, cmap=plt.cm.jet, vmin=0, vmax=values.max())
    plt.colorbar()
    #plt.savefig('output.png', transparent=True, bbox_inches='tight', dpi=300)

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-28kaaqsv because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.


In [None]:
visualize()

## Exercise 0: accelerate using parallel algorithms with linear indexing

The goal of this exercise is to accelerate the stencil implementation using the C++ standard library algorithms techniques described in the presentation:

* C++17 linear indexing 

A template for the solution is provided in [exercise0.cpp].
The only function that needs to be modified to achieve this is the `stencil` function.

In the serial implementation, raw loops are used:

```c++
template <typename Idx> 
double stencil(double *u_new, double *u_old, grid g, Idx idx, parameters p) {
  double energy = 0.;
  for (long x = g.x_start; x < g.x_end; ++x) {
    for (long y = g.y_start; y < g.y_end; ++y) {
      energy += stencil(u_new, u_old, idx, x, y, p);
    }
  }
  return energy;
}
```

In the parallel implementation, you should use the C++ parallel algorithms discussed in the lecture:


```c++
#include <fstream>
// TODO: will need some new includes here

// ...

template <typename Idx> 
double stencil(double *u_new, double *u_old, grid g, Idx idx, parameters p) {
  double energy = 0.;
  // TODO: implement using parallel algorithms
  return energy;
}
```

[exercise0.cpp]: ./exercise0.cpp

The example compiles and runs as provided, but it produces incorrect results due to the incomplete `stencil` implementation.
Once you fix it, the following block should compile and run correctly:


In [2]:
!g++ -std=c++20 -Ofast -DNDEBUG -o heat exercise0.cpp -ltbb
!./heat 1024 1024 10000

E(t=0) = 0
E(t=0.000190735) = 0
E(t=0.00038147) = 0
E(t=0.000572205) = 0
E(t=0.000762939) = 0
E(t=0.000953674) = 0
E(t=0.00114441) = 0
E(t=0.00133514) = 0
E(t=0.00152588) = 0
E(t=0.00171661) = 0
Domain 1024x1024 (0.0167772 GB): 728541 GB/s


In [3]:
!clang++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat exercise0.cpp -ltbb
!./heat 1024 1024 10000

E(t=0) = 0
E(t=0.000190735) = 0
E(t=0.00038147) = 0
E(t=0.000572205) = 0
E(t=0.000762939) = 0
E(t=0.000953674) = 0
E(t=0.00114441) = 0
E(t=0.00133514) = 0
E(t=0.00152588) = 0
E(t=0.00171661) = 0
Domain 1024x1024 (0.0167772 GB): 1.02753e+06 GB/s


In [4]:
!nvc++ -std=c++20 -stdpar=gpu -gpu=cc80 -fast -DNDEBUG -o heat exercise0.cpp
!./heat 1024 1024 10000

E(t=0) = 0
E(t=0.000190735) = 0
E(t=0.00038147) = 0
E(t=0.000572205) = 0
E(t=0.000762939) = 0
E(t=0.000953674) = 0
E(t=0.00114441) = 0
E(t=0.00133514) = 0
E(t=0.00152588) = 0
E(t=0.00171661) = 0
Domain 1024x1024 (0.0167772 GB): 1.09215e+06 GB/s


### Solutions Exercise 0

The solutions for each example are available in the `solutions/exercise0.cpp` sub-directory.

The following compiles and runs the solutions for Exercise 0 using different compilers and C++ standard versions.

In [7]:
!rm output || true
!rm heat || true
!g++ -std=c++20 -Ofast -DNDEBUG -o heat solutions/exercise0.cpp -ltbb
!./heat 1024 1024 10000
!mv output output_gcc

rm: cannot remove 'output': No such file or directory
rm: cannot remove 'heat': No such file or directory
E(t=0) = 1.94931e-05
E(t=0.000190735) = 0.00423877
E(t=0.00038147) = 0.00604144
E(t=0.000572205) = 0.00740573
E(t=0.000762939) = 0.00854288
E(t=0.000953674) = 0.00953482
E(t=0.00114441) = 0.0104236
E(t=0.00133514) = 0.0112341
E(t=0.00152588) = 0.0119828
E(t=0.00171661) = 0.0126809
Domain 1024x1024 (0.0167772 GB): 2.77715 GB/s


In [10]:
!rm output || true
!rm heat || true
!clang++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o heat solutions/exercise0.cpp -ltbb
!./heat 1024 1024 10000
!mv output output_clang

rm: cannot remove 'output': No such file or directory
rm: cannot remove 'heat': No such file or directory
E(t=0) = 1.94931e-05
E(t=0.000190735) = 0.00423877
E(t=0.00038147) = 0.00604144
E(t=0.000572205) = 0.00740573
E(t=0.000762939) = 0.00854288
E(t=0.000953674) = 0.00953482
E(t=0.00114441) = 0.0104236
E(t=0.00133514) = 0.0112341
E(t=0.00152588) = 0.0119828
E(t=0.00171661) = 0.0126809
Domain 1024x1024 (0.0167772 GB): 19.6848 GB/s


In [11]:
!rm output || true
!rm heat || true
!nvc++ -std=c++20 -stdpar=gpu -gpu=cc80 -fast -DNDEBUG -o heat solutions/exercise0.cpp
!./heat 1024 1024 10000
!mv output output_nvc

rm: cannot remove 'output': No such file or directory
E(t=0) = 1.94931e-05
E(t=0.000190735) = 0.00423877
E(t=0.00038147) = 0.00604144
E(t=0.000572205) = 0.00740573
E(t=0.000762939) = 0.00854288
E(t=0.000953674) = 0.00953482
E(t=0.00114441) = 0.0104236
E(t=0.00133514) = 0.0112341
E(t=0.00152588) = 0.0119828
E(t=0.00171661) = 0.0126809
Domain 1024x1024 (0.0167772 GB): 428.806 GB/s


In [None]:
visualize('output_nvc')

# Exercise 1: accelerate using parallel algorithms and `views::cartesian_product`

Same exercise as above, but instead of using linear-indexing, use `views::cartesian_product` for multi-dimensional iteration.

A template for the solution is provided in [exercise1.cpp].

[exercise1.cpp]: ./exercise1.cpp

In [48]:
!g++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat exercise1.cpp -ltbb
!./heat 1024 1024 2000

E(t=0) = 0
E(t=0.000190735) = 0
Domain 1024x1024 (0.0167772 GB): 338083 GB/s


In [49]:
!clang++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat exercise1.cpp -ltbb
!./heat 1024 1024 2000

E(t=0) = 0
E(t=0.000190735) = 0
Domain 1024x1024 (0.0167772 GB): 323516 GB/s


### Solutions Exercise 1

The solutions for each example are available in the `solutions/exercise1.cpp` sub-directory.

The following compiles and runs the solutions for Exercise 1 using different compilers and C++ standard versions.

In [47]:
!rm output || true
!rm heat || true
!g++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat solutions/exercise1.cpp -ltbb
!./heat 1024 1024 2000
!mv output output_gcc

rm: cannot remove 'output': No such file or directory
E(t=0) = 1.94931e-05
E(t=0.000190735) = 0.00423877
Domain 1024x1024 (0.0167772 GB): 5.65926 GB/s


In [46]:
!rm output || true
!rm heat || true
!clang++ -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include -o heat solutions/exercise1.cpp -ltbb
!./heat 1024 1024 20000
!mv output output_clang

rm: cannot remove 'output': No such file or directory
E(t=0) = 1.94931e-05
E(t=0.000190735) = 0.00423877
E(t=0.00038147) = 0.00604144
E(t=0.000572205) = 0.00740573
E(t=0.000762939) = 0.00854288
E(t=0.000953674) = 0.00953482
E(t=0.00114441) = 0.0104236
E(t=0.00133514) = 0.0112341
E(t=0.00152588) = 0.0119828
E(t=0.00171661) = 0.0126809
E(t=0.00190735) = 0.0133366
E(t=0.00209808) = 0.0139561
E(t=0.00228882) = 0.0145443
E(t=0.00247955) = 0.0151051
E(t=0.00267029) = 0.0156415
E(t=0.00286102) = 0.0161561
E(t=0.00305176) = 0.0166511
E(t=0.00324249) = 0.0171282
E(t=0.00343323) = 0.0175891
E(t=0.00362396) = 0.018035
Domain 1024x1024 (0.0167772 GB): 6.74229 GB/s


## Exercise 2: MPI implementation

The starting point for this exercise is [starting_point_mpi.cpp].

[starting_point_mpi.cpp]: ./starting_point_mpi.cpp

The goal of this exercise is to accelerate a pre-existing MPI implementation using parallel algorithms.

The main differences with the previous examples is that the computation now involves a data exchange with neighbors, and that the computations have been split into:

* `internal`: processes internal rows that do not depend on data from neighbors
* `prev_boundary`: exchanges data with neighbor at `rank - 1` and processes the rows that depend on the elements received
* `next_boundary`: exchanges data with neighbor at `rank + 1` and processes the rows that depend on the elements received


```c++
double internal(double* u_new, double* u_old, parameters p) {
    grid g { .x_start = 2, .x_end = p.nx, .y_start = 1, .y_end = p.ny - 1 };
    energy += stencil(u_new.get(), u_old.get(), g, p);
}

double prev_boundary(double* u_new, double* u_old, parameters p) {
    // Send window cells, receive halo cells
    if (p.rank > 0) {
      // Send bottom boundary to bottom rank
      MPI_Send(u_old + p.ny, p.ny, MPI_DOUBLE, p.rank - 1, 0, MPI_COMM_WORLD);
      // Receive top boundary from bottom rank
      MPI_Recv(u_old + 0, p.ny,  MPI_DOUBLE, p.rank - 1, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    }
    grid g { .x_start = p.nx, .x_end = p.nx + 1, .y_start = 1, .y_end = p.ny - 1 };
    return stencil(u_new, u_old, g, p);
}

double next_boundary(double* u_new, double* u_old, parameters p) {
    if (p.rank < p.nranks - 1) {
        // Receive bottom boundary from top rank
        MPI_Recv(u_old + (p.nx + 1) * p.ny, p.ny, MPI_DOUBLE, p.rank + 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        // Send top boundary to top rank, and
        MPI_Send(u_old + p.nx * p.ny, p.ny, MPI_DOUBLE, p.rank + 1, 1, MPI_COMM_WORLD);
    }
    grid g { .x_start = 1, .x_end = 2, .y_start = 1, .y_end = p.ny - 1 };
    return stencil(u_new, u_old, g, p);
}
```

You can parallelize this using the C++ parallel algorithms in the exact same way we have done above.
Pick the approach that makes the most sense to you.

A template for the solution is provided in [exercise2.cpp]. Notice that we only need to change the stencil function, there is nothing else required to accelerate an MPI application with C++.

If you notice the performance of the GPU accelerated version degrade over the non-MPI version, see the slides for how to workaround this.

[exercise2.cpp]: ./exercise2.cpp



In [44]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat exercise2.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_gcc

rm: cannot remove 'output': No such file or directory
E(t=0) = 0
E(t=0.000190735) = 0
Domain 1024x1024 (0.0167772 GB): 3743.4 GB/s


In [45]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -stdpar=gpu  -gpu=cc80 -std=c++20 -fast -DNDEBUG -o heat exercise2.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_nvc

rm: cannot remove 'output': No such file or directory
E(t=0) = 0
E(t=0.000190735) = 0
Domain 1024x1024 (0.0167772 GB): 44.7711 GB/s


### Solution Exercise 2

The solutions for each example are available in the `solutions/exercise2.cpp` sub-directory and the `_nomanaged` variant that uses `thrust::device_vector` instead.

The following compiles and runs the solutions for Exercise 2 using different compilers and C++ standard versions.

In [38]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat solutions/exercise2.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_gcc

rm: cannot remove 'output': No such file or directory
E(t=0) = 1.94931e-05
E(t=0.000190735) = 0.00423877
Domain 1024x1024 (0.0167772 GB): 2.74724 GB/s


In [37]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -stdpar=gpu  -gpu=cc80 -std=c++20 -fast -DNDEBUG -o heat solutions/exercise2.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_nvc

rm: cannot remove 'output': No such file or directory
E(t=0) = 1.94931e-05
E(t=0.000190735) = 0.00423877
Domain 1024x1024 (0.0167772 GB): 12.1481 GB/s


In [53]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -stdpar=gpu  -gpu=cc80,nomanaged -std=c++20 -fast -DNDEBUG -o heat solutions/exercise2_nomanaged.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_nvc

rm: cannot remove 'output': No such file or directory
rm: cannot remove 'heat': No such file or directory
E(t=0) = 1.94931e-05
E(t=0.000190735) = 0.00423877
Domain 1024x1024 (0.0167772 GB): 15.9786 GB/s


## Exercise 3: overlapping communication and computation

The goal of this exercise is to extend the solution of Exercise 2 to use multiple `std::thread`s to overlap computation with communication.

The main `TODO`s are:

* using `atomic` operations to modify the energy concurrently
* using a `std::barrier` to synchronize the different threads
* using one `std::thread` per computation
* performing some of the operations in a critical section between the three threads:
  * `MPI_Reduce` of the energy
  * reset the energy to `0.` for the next iteration
  
See:


```c++
  // TODO: use a dynamically-allocated atomic variable for the energy
  double* energy = new double{0.};
    
  // TODO: use a barrier for synchronization
  // ...bar = ...

  // TODO: use threads for the different computations
  auto thread_prev = std::thread([/*TODO: complete capture */]() {
      for (long it = 0; it < p.nit(); ++it) {
          // TODO: perform the prev exchange and computation
          // TODO: update the atomic energy
          // TODO: synchronize with the barrier
      }
  });
    
  auto thread_next = /* TODO: similar for prev */;
      
  auto thread_internal = /*
    TODO: same as for next and prev
    TODO: need to perform the reduction in one of the threads (for example this one)
    TODO: need to reset the atomic in one of the threads (for example this one)
  */;

  // TODO: join all threads

```

In [71]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat exercise3.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_gcc

rm: cannot remove 'output': No such file or directory
rm: cannot remove 'heat': No such file or directory
Domain 1024x1024 (0.0167772 GB): 1.97379e+07 GB/s


In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -stdpar=gpu  -gpu=cc80,nomanaged -std=c++20 -fast -DNDEBUG -o heat exercise3.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_nvc

### Solution Exercise 3

The solutions for each example are available in the `solutions/exercise3.cpp` sub-directory.

The following compiles and runs the solutions for Exercise 3 using different compilers and C++ standard versions.

In [68]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -DNDEBUG -isystem/usr/local/range-v3/include  -o heat solutions/exercise3.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_gcc

rm: cannot remove 'output': No such file or directory
E(t=0) = 1.94931e-05
E(t=0.000190735) = 0.00423877
Domain 1024x1024 (0.0167772 GB): 2.74889 GB/s


In [73]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -stdpar=gpu  -gpu=cc80 -std=c++20 -fast -DNDEBUG -o heat solutions/exercise3.cpp -ltbb
!OMPI_MCA_coll_hcoll_enable=0 mpirun --oversubscribe --allow-run-as-root -np 2 ./heat 1024 1024 2000
!mv output output_nvc

rm: cannot remove 'output': No such file or directory
E(t=0) = 1.94931e-05
E(t=0.000190735) = 0.00423877
Domain 1024x1024 (0.0167772 GB): 16.2095 GB/s
