Accelerating portable HPC Applications with ISO Fortran
===

# Lab 2: 2D Unsteady Heat Equation

The following cell loads the visualization scripts, you'll need to load it to be able to call `visualize()` below:

In [None]:
%run vis.py

# Exercise 1: multi-gpu heat equation with `do concurrent`

In this exercise, we provide you with a mini-application for the two-dimensional unsteady heat equation.
It is written using Fortran and parallelized for distributed memory with MPI.

The goal of this first exercise is to parallelize each MPI rank using `do concurrent`, so that we end up with a mini-application that uses MPI for distributed memory across nodes, but hybrid parallelization for CPU cores, or GPUs, within a node. All in portable standard-compliant ISO Fortran 2023.

The compilation commands below compile the template of this exercise [exercise1.f90](./exercise1.f90) for CPUs and GPUs. The template produces correct results, albeit sequentially; start by running it (see below). It contains many `! TODO`s. The only code that needs to be changed is the code around the `! TODO`s; if a `subroutine` or `function` does not contain any `! TODO`s, there is nothing to do (hah).

It is recommended to start by parallelizing the `initial_condition`, since it is very similar to the previous DAXPY lab:

```fortran
  ! TODO: parallelize with do-concurrent
  do x = 1,p%nx
    do y = 1,p%ny
      u_old(x, y) = 0.
      u_new(x, y) = 0.
    end do
  end do
```

Then, continue by parellizing the loop that applies the stencil to all elements of a grid tile:

```fortran
  ! TODO: parallelize with do-concurrent
  do x = g%x_start,g%x_end
    do y = g%y_start, g%y_end
      ! Boundary conditions
      ! ...
      u_new(x, y) = stencil(u_old, x, y, p)
      energy = energy + u_new(x, y) * p%dx * p%dx
    end do
   end do
```

This loop:

* performs a reduction for the `energy`, and it requires using the `reduce(op:variable)` locality-specifier for correctness:
  ```fortran
  do concurrent (i = 1:n) reduce(+:s)
     ...
  end do
  ```
* calls a `function` (`stencil`):
  * All functions called from `do concurrent` loops must be `pure`, so we'll need to make that function and any function it calls `pure` (there are `! TODO`s for that).
  * A bug in `nvfortran` 23.5 currently requires annotating functions called by `do concurrent` loops as "device" routines manually. The simplest way to do that is to use OpenAcc `acc routine seq` clause to annotate them:
  ```fortran
  function stencil(u_old, x, y, p) result(o)
    !$acc routine seq
    !...
  end function
  ```

The following compilation commands may be used:

In [None]:
!rm heat || true
!mpifort -Wall -Ofast -stdpar=gpu -cuda exercise1.f90 -o heat
!mpirun -np 1 ./heat 256 256 16000
visualize()

In [None]:
!rm heat || true
!mpifort -Wall -Ofast -stdpar=multicore exercise1.f90 -o heat
!mpirun -np 1 ./heat 256 256 16000
visualize()

## Solution Exercise 1

The [solutions/exercise1.f90](./solutions/exercise1.f90) can be tested with the following compilation commands. The problem size we've picked is very small, to see the benefits of GPU performance pick a grid-size of at least 8192x8192 or 16384x8192, but be mindful of other students if you are sharing computing resources.

In [None]:
!rm heat || true
!mpifort -Wall -Ofast -stdpar=gpu -cuda solutions/exercise1.f90 -o heat
!mpirun -np 1 ./heat 256 256 16000
visualize()

In [None]:
!rm heat || true
!mpifort -Wall -Ofast -stdpar=multicore solutions/exercise1.f90 -o heat
!mpirun -np 1 ./heat 256 256 16000
visualize()

# Exercise 2: Asynchrony

This mini-application uses three kernels:
* `inner`: for the inner tile of the grid, which does not depend on data from neighboring MPI ranks
* `prev` and `next`: for the boundaries of the grid, whose computation depends on exchanging a column with neighboring MPI ranks

NVIDIA is actively contributing to the standardization of asynchrony in ISO Fortran, e.g, [see this issue](https://github.com/j3-fortran/fortran_proposals/issues/271).

In this exercise, we use OpenAcc `acc kernels async` clause to make the `do concurrent` loops asynchronous, waiting on them with the `acc wait` clause:

```fortran
!$acc kernels async
do concurrent (...)
  ! ...
end do
!$acc end kernels

!$acc kernels async
do concurrent (...)
  ! ...
end do
!$acc end kernels

!$acc wait
```

The [exercise2.f90](./exercise2.f90) template provides a starting point with a few `! TODO`s to achieve that using the following compilation commands.

In this exercise, we use a single MPI rank to fully overlap the three kernels.
In the next exercise, we will see how to overlap the kernels when host computation is involved during the overlap.

In [None]:
!rm heat || true
!mpifort -Wall -Ofast -stdpar=gpu -cuda exercise2.f90 -o heat
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm heat || true
!mpifort -Wall -Ofast -stdpar=multicore exercise2.f90 -o heat
!mpirun -np 2 ./heat 256 256 16000
visualize()

## Solution: Exercise 2

The [solutions/exercise2.f90](./solutions/exercise2.f90) can be tested with the followign compilation commands:

In [None]:
!rm heat || true
!mpifort -Wall -Ofast -stdpar=gpu -cuda solutions/exercise2.f90 -o heat
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm heat || true
!mpifort -Wall -Ofast -stdpar=multicore solutions/exercise2.f90 -o heat
!mpirun -np 2 ./heat 256 256 16000
visualize()

# Exercise 3: Overlapping communication and computation

In the previous exercise we saw how to overlap three concurrent `do concurrent` loops using `acc kernels async`. 

In this exercise, we will fully overlap the computation and communication using OpenMP tasks:

```fortran
!$omp parallel
!$omp master

!$omp task
do concurrent (...)
  ! ...
end do
!$omp end task

!$omp task
do concurrent (...)
  ! ...
end do
!$omp end task
    
!$omp end master
!$omp end parallel
```

Follow the `! TODOs` in the [exercise3.f90](./exercise3.f90) template using the following compilation commands:

In [None]:
!rm heat || true
!mpifort -Wall -Ofast -stdpar=gpu -cuda exercise3.f90 -o heat
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm heat || true
!mpifort -Wall -Ofast -stdpar=multicore exercise3.f90 -o heat
!mpirun -np 2 ./heat 256 256 16000
visualize()

## Solution Exercise 3

The [solutions/exercise3.f90](./solutions/exercise3.f90) can be tested using the following compilation commands:

In [None]:
!rm heat || true
!mpifort -Wall -Ofast -stdpar=gpu -cuda solutions/exercise3.f90 -o heat
!mpirun -np 2 ./heat 256 256 16000
visualize()

In [None]:
!rm heat || true
!mpifort -Wall -Ofast -stdpar=multicore solutions/exercise3.f90 -o heat
!mpirun -np 2 ./heat 256 256 16000
visualize()

We'll profile the mini-application with [NSight Systems](https://developer.nvidia.com/nsight-systems) to verify that the communication and computations are overlapped:

In [None]:
!nsys profile --trace=mpi,cuda,openmp,openacc,osrt --force-overwrite=true -o report.nsys-rep  mpirun -np 2 ./heat 16384 8192 3