Accelerating portable HPC Applications with Standard C++
===

# Lab 2 Optional Exercises: `std::mdspan`

This notebook contains the optional exercises of the Heat Equation lab. 
In these exercises, we will focus on porting the solution of Exercise 1 to use `std::mdspan`.

The solution of Exercise 1 is available here: [`solutions/exercise1.cpp`](./solutions/exercise1.cpp), and the apply stencil function includes code like this:

```c++
double stencil(double *u_new, double *u_old, long x, long y, parameters p) {
  auto idx = [=](auto x, auto y) { 
      // Index into the memory using row-major order:
      assert(x >= 0 && x < (p.nx + 2));
      assert(y >= 0 && y < p.ny);
      return x * p.ny + y;
  };
  // Apply boundary conditions:
  if (y == 1) {
    u_old[idx(x, y - 1)] = 0;
  }
  // ...
}
```

Notice how in this implementation, we are hardcoding the memory layout of our data structures into the `idx` function, which we then use to perform multi-dimensional access like this: `u_old[idx(x, y - 1)]`.

In these optional exercises, we will update this solution to be generic over the memory layout of our data-structures, so that we can index it directly using C++23's multi-dimensional operator[]: 

```c++
using grid_t = std::mdspan<...>;

double stencil(grid_t u_new, grid_t u_old, long x, long y, parameters p) {
  // Apply boundary conditions:
  if (y == 1) {
    u_old[x, y - 1] = 0;
  }
  // ...
}
```

These optional exercises are structured as follows:

- [`exercise1o.cpp`](./exercise1o.cpp): introduce mdspan and use it everywhere `u_old` and `u_new` are expected.
- [`exercise1o.cpp`](./exercise1o.cpp): make the implementation of file io and MPI communication independent from the memory layout, to support changing the layout in one place.

## Exercise 1o: Using `std::mdspan` everywhere `u_old` and `u_new` are expected.

In this optional exercise, we replace the manual layout and data access computations of Exercise 1 with C++23 multi-dimensional spans `std::mdspan`. To learn more about `std::mdspan` see [A Gentle Introduction to mdspan](https://github.com/kokkos/mdspan/wiki/A-Gentle-Introduction-to-mdspan). 

In C++23, `std::mdspan` is available in the `<mdspan>` header. It provides non-owning multi-dimensional data access using the multi-dimensional square bracket operator (`array[i, j, k]`), which is another C++23 extension. The implementation provided here is backported to C++20, and the macro `MDSPAN_USE_PAREN_OPERATOR` can be defined to `1` to use the paren `operator()` instead (`array(i, j, k)`) to enable using it in older compilers. As compilers gain support for `operator[]` this macro can be removed.

In this exercise we start by including `std::mdspan`:

```c++
#define MDSPAN_USE_PAREN_OPERATOR 1
#include <mdspan>
```

The current implementation provided in the containers is under the `std::experimental` namespace. This will change once C++23 is ratified and the implementation is made standard conforming. In the meantime, writing `std::experimental::mdspan` is quite cumbersome, so we will define a namespace alias `stdex` and use that instead. Once C++23 is ratified and implementations are updated, we can switch this to `std` instead:

```c++
namespace stdex = std::experimental;
```

`mdspan` is customizable through many type parameters:

```c++
template <typename T, typename Extents, typename Layout, ... /* others */>
class mdspan;
```

* `T`: is the value type of the multi-dimensional array, e.g., `double`.
* `Extents`: describe the index type, the number of dimensions, and number of elements per dimension in the array:
   ```c++
   using grid_extents = stdex::extents<std::size_t, stdex::dynamic_extent, stdex::dynamic_extent>;
   ```
* `Layout`: the memory layout of the array
  * `std::layout_right`: elements are contiguous along the _right most_ index (row-major in 2D).
  * `std::layout_left`: elements are contiguous along the _left most_ index (col-major in 2D).

Now that we are all set up, we will create a multi-dimensional span with two runtime dimensions indexed with an integer of type `std::size_t`. 
The dimensions of an [`std::mdspan`] are defined via [`std::extents`] type: 

```
// Two dymensional dynamic extents of type size_t:
using grid_extents = stdex::extents<std::size_t, stdex::dynamic_extent, stdex::dynamic_extent>;
```

Now we need to pick the layout of our `std::mdspan`. The C++23 standard comes with a couple of layouts, two of which are called [`std::layout_right`] and [`std::layout_left`]. Here, "left" and "right" indicate which dimension is contiguously in memory. For row-major format, the right most dimension, i.e., `y` for our 2D grid, is contiguous in memory, so we set our layout to right:

```c++
using grid_layout = stdex::layout_right;
```

And with this we can finally create our two-dimensional `std::mdspan`:

```c++
using grid_t = stdex::mdspan<double, grid_extent, grid_layout>; 
```

Summary: to create a multi-dimensional span to access our grid, we've had to do:

```c++
using grid_extents = stdex::extents<std::size_t, stdex::dynamic_extent, stdex::dynamic_extent>;
using grid_layout = stdex::layout_right;
using grid_t = stdex::mdspan<double, grid_extents, grid_layout>;
```

Since our goal is to use mdspans for accessing our data, we will now rename our storage for the variables to have a `_data` suffix, and will create multi-dimensional spans to access them:

```c++
// Allocate memory
std::vector<double> u_new_data(p.n()), u_old_data(p.n());

grid_t u_new{u_new_data.data(), p.nx+2, p.ny};  // mdspan constructor takes a point to data followed by the extents
grid_t u_old{u_old_data.data(), p.nx+2, p.ny};
```

Before, the type of the `u_old` and `u_new` variables was `double*`, but after the above change it will be an `std::mdspan`.
Since these two types are incompatible, the compiler will helpfully emit an error everywhere we were expecting a `double*`, telling us precisely what to update next. We will chase these errors until the mini-application compiles again, and then we will verify the results.

We will start by changing a bunch of API signatures from `double*` to `grid_t`:

```c++
double inner(grid_t u_new, grid_t u_old, parameters p);
double prev (grid_t u_new, grid_t u_old, parameters p); 
double next (grid_t u_new, grid_t u_old, parameters p);
```

And then continue by implementing our `initial_condition`: 

```c++
void initial_condition(grid_t u_new, grid_t u_old) {
  std::fill_n(std::execution::par, u_old.data_handle(), u_old.size(), 0.0);
  std::fill_n(std::execution::par, u_new.data_handle(), u_new.size(), 0.0);
}
```

Notice that we have dropped the function argument `n`, and that we continue to use `fill_n` by writing directly to the data behind the `mdspan`.
For containers like `std::vector`, a handle to the data can be obtained by just calling the `vec.data()` member function.
For `mdspan`, this function is called `mdspan.data_handle()` instead.

Then we will continue by updating the pointer passed to MPI File I/O for writing the solution:

```c++
MPI_File_iwrite_at(f, values_offset, u_old.data_handle() + p.ny, values_per_rank, MPI_DOUBLE, &req[0]);
```

Finally, we get to update the main function where `std::mdspan` makes a big difference:

```c++
// Finite-difference stencil
double stencil(grid_t u_new, grid_t u_old, long x, long y, parameters p) {
   // Update the content
}
```

Until now, we've been accessing elements with:

```c++
  auto idx = [=](auto x, auto y) { 
      // Index into the memory using row-major order:
      assert(x >= 0 && x < 2 * p.nx);
      assert(y >= 0 && y < p.ny);
      return x * p.ny + y;
  };
  u_new[idx(x, y)] = ...;
```

But now we can access them through the `mdspan` directly:

```c++
  u_new(x,y) = ...;
```

And one last thing, we need to update the handles for the MPI calls in `prev` and `next` as well.

### Compilation and run commands

[exercise1o.cpp]: ./exercise1o.cpp

While [exercise1o.cpp] compiles and runs correctly as provided, it does not use `std::mdspan` yet. Follow the TODOs in the file to port it to use `std::mdspan` as described in the steps above.

In [None]:
%run ../lab2_heat/vis.py

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -march=native -o heat exercise1o.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=clang++ mpicxx -std=c++20 -Ofast -march=native -o heat exercise1o.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -stdpar=multicore -o heat exercise1o.cpp
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -stdpar=gpu -o heat exercise1o.cpp
!UCX_RNDV_FRAG_MEM_TYPE=cuda mpirun -np 2 ./heat 256 256 16000
visualize()

### Solutions Exercise 1o

The solution for this exercise is in [`solutions/exercise1o.cpp`].

[`solutions/exercise1o.cpp`]: ./solutions/exercise1o.cpp

The following compiles and runs the solutions for Exercise 1 using different compilers.

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -o heat solutions/exercise1o.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=clang++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -o heat solutions/exercise1o.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore  -o heat solutions/exercise1o.cpp
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=gpu -o heat solutions/exercise1o.cpp
!UCX_RNDV_FRAG_MEM_TYPE=cuda mpirun -np 2 ./heat 256 256 16000
visualize()

## Exercise 2o: Support varying the `mdspan` layout

In the previous exercise we ported our example to use `std::mdspan` everywhere, but that does not mean that our application is correct for any layout.
For example, the following change will break our results:

```c++
// using grid_layout = stdex::layout_right;  
using grid_layout = stdex::layout_left;  // Change layout from right to left!
```

The starting point at [exercise2o.cpp](./exercise2o.cpp) takes the solution of the Exercise 1o and makes this one change:

In [None]:
!which mpicxx

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -o heat exercise2o.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

This solution is obviously not right. What's the problem?

The problem is that MPI File I/O usage, MPI communication, and the visualization scripts, all expect `std::layout_right`.

We'll solve this problem by:
* File I/O: translating from our data-layout to `std::layout_right` before doing File I/O
* MPI communication: performing packing and unpacking into vectors of contiguous elements


### Compilation and run commands

[exercise2o.cpp]: ./exercise2o.cpp

While [exercise2o.cpp] compiles correctly, it generates incorrect results. 
Follow the TODOs in the file to fix the issues and obtain correct results.

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -march=native -o heat exercise2o.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=clang++ mpicxx -std=c++20 -Ofast -march=native -o heat exercise2o.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -stdpar=multicore -o heat exercise2o.cpp
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -stdpar=gpu -o heat exercise2o.cpp
!UCX_RNDV_FRAG_MEM_TYPE=cuda mpirun -np 2 ./heat 256 256 16000
visualize()

### Solutions Exercise 2o

The solution for this exercise is in [`solutions/exercise2o.cpp`].

[`solutions/exercise2o.cpp`]: ./solutions/exercise2o.cpp

The following compiles and runs the solutions for Exercise 1 using different compilers.

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=g++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -o heat solutions/exercise2o.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=clang++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -o heat solutions/exercise1o.cpp -ltbb
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -Ofast -march=native -DNDEBUG -stdpar=multicore -o heat solutions/exercise1o.cpp
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm output || true
!rm heat || true
!OMPI_CXX=nvc++ mpicxx -std=c++20 -O4 -fast -march=native -DNDEBUG -stdpar=gpu -o heat solutions/exercise1o.cpp
!UCX_RNDV_FRAG_MEM_TYPE=cuda mpirun -np 2 ./heat 256 256 16000
visualize()