Accelerating portable HPC Applications with ISO Fortran
===

# Lab 1: DAXPY

In this tutorial we will familiarize ourselves with the Fortran DO CONCURRENT feature by implementing Double-precision AX Plus Y (DAXPY): $A \cdot X + Y$, one of the main functions in the standard Basic Linear Algebra Subroutines (BLAS) library.

The operation is a combination of scalar multiplication and vector adition. It takes two vectors of 64-bit floats, `x` and `y` and a scalar value `a`.
It multiplies each element `x(i)` by `a` and adds the result to `y(i)`.

A sequential working implementation is provided in [daxpy.f90].
Please take 2-3 minutes to skim through it.

## Validating solutions

For all the exercises, we assume that initially the values are `x(i) = i` and `y(i) = 2`.
The `check` function then verifies the effect of applying `daxpy` to these two vectors.
We will run this check always once.

## Sequential implementation

The "core" of the sequential implementation provided in [daxpy.f90] is split into two parts:

```fortran
! Intialize vectors `x` and `y`: raw loop sequential version
do i = 1, n
  x(i)  = i
  y(i)  = 2.
enddo
! daxpy
subroutine daxpy(x, y, n, a)
  real(kind=8), dimension(:) :: x, y
  real(kind=8) :: a
  integer :: n, i  
  ! TODO: use do concurrent here
  do i = 1, n
    y(i) = y(i) + a * x(i)
  enddo  
end subroutine 
```

We initialize the vectors to the `x(i) = 1` and `y(i) = i` expressions covered above for testing purposes.

The `daxpy` subroutine implements a loop over all vector elements, reading from both `x` and `y` and writing the solution to `y`.

[daxpy.f90]: ./daxpy.f90

## Exercise 1 - `do concurrent`

Let's start by compiling and running the [exercise1.f90](./exercise1.f90) template; the binary options are `./daxpy nx niterations`:

In [None]:
!rm daxpy || true
!nvfortran -Wall -Wextra exercise1.f90 -o daxpy
!./daxpy 1000000 100

Fortran 2008 added support for `do concurrent` to express that loop iterations are independent from each other, and therefore safe for the implementation to implicitly parallelize them.

For example, the following two dimensional loop:

```fortran
do i = 1,ni
  do j = 1,nj
     if (A(i, j) >= 0) then
       cycle
     end if
     ...
  end do
end do
```

is rewritten with `do concurrent` as:

```fortran
do concurrent (i = 1:ni, j = 1:nj, A(i, j) < 0)
  ...
end do
```

In this exercise, you'll modify the sections indicated by `! TODO` comments in the [exercise1.f90](./exercise1.f90) template to parallelize the two loops using `do concurrent` as shown above.

The following compilation commands can be used to test the implementation with `gfortran` and `nvfortran`.
For both compilers, we enable optimizations using the `-Ofast` flag. 
For `nvfortran`, the `-stdpar=multicore` and `-stdpar=gpu` options auto-parallelize do-concurrent loops on CPUs and GPUs.

In [None]:
!rm daxpy || true
!gfortran -Ofast -Wall -Wextra exercise1.f90 -o daxpy
!./daxpy 10000000 100

In [None]:
!rm daxpy || true
!nvfortran -Ofast -Wall -Wextra exercise1.f90 -o daxpy
!./daxpy 1000000 100

In [None]:
!rm daxpy || true
!nvfortran -Ofast -Wall -Wextra -stdpar=multicore exercise1.f90 -o daxpy
!./daxpy 10000000 100

In [None]:
!rm daxpy || true
!nvfortran -Ofast -Wall -Wextra -stdpar=gpu exercise1.f90 -o daxpy
!./daxpy 10000000 100

# Solution 1

The solution is available in [solutions/exercise1.f90](./solutions/exercise1.f90):

In [None]:
!rm daxpy || true
!gfortran -Ofast -Wall -Wextra solutions/exercise1.f90 -o daxpy
!./daxpy 10000000 100

In [None]:
!rm daxpy || true
!nvfortran -Ofast -Wall -Wextra solutions/exercise1.f90 -o daxpy
!./daxpy 10000000 100

In [None]:
!rm daxpy || true
!nvfortran -Ofast -Wall -Wextra -stdpar=multicore solutions/exercise1.f90 -o daxpy
!./daxpy 10000000 100

In [None]:
!rm daxpy || true
!nvfortran -Ofast -Wall -Wextra -stdpar=gpu solutions/exercise1.f90 -o daxpy
!./daxpy 10000000 100

## Exercise 2 - `do concurrent` locality specifiers

Fortran 2018 introduced the following locality specifiers:
* `default(none)`: requires every variable used in the loop to have an explicit locality specifier except for loop indices.
* `shared`: different iterations of the loop share the same variable memory
* `local`: every _iteration_ of the loop gets an uninitialized private storage for the variable
* `local_init`: `local` initialized with the variable's value outside the loop

Fortran 2023 introduces the following locality specifiers:

* `reduce(op:variable)` (e.g. `reduce(+:sum)`): different iterations share the same variable memory and reduce to it with the given operation.

The locality specifiers are specified as part of the `do concurrent` loop:

```fortran
integer :: a, b, c

do concurrent(i = 1:ni, j = 1:nj) default(none) shared(a) local_init(b, c)
  ...
end do
```

Multiple variables can be specified within the same specifier, as shown by the usage of `local_init(b, c)` which specifies the locality for both variables.

In this exercise, you'll modify the sections indicated by `! TODO` comments in the [exercise2.f90](./exercise2.f90) template to:
* add the `default(none)` specifier to the two `do concurrent` loops
* add the remaining locality specifiers for all other variables.

You can compile the [exercise2.f90](./exercise2.f90) template with the following compilation commands.

Note: since `gfortran` does not support Fortran 2018 locality specifiers yet, we use `nvfortran` for this exercise.

In [None]:
!rm daxpy || true
!nvfortran -Ofast -Wall -Wextra solutions/exercise2.f90 -o daxpy
!./daxpy 10000000 100

In [None]:
!rm daxpy || true
!nvfortran -Ofast -Wall -Wextra -stdpar=multicore solutions/exercise2.f90 -o daxpy
!./daxpy 10000000 100

In [None]:
!rm daxpy || true
!nvfortran -Ofast -Wall -Wextra -stdpar=gpu solutions/exercise2.f90 -o daxpy
!./daxpy 10000000 100

# Solutions - Exercise 2

In [None]:
!rm daxpy || true
!nvfortran -Ofast -Wall -Wextra solutions/exercise2.f90 -o daxpy
!./daxpy 10000000 100

In [None]:
!rm daxpy || true
!nvfortran -Ofast -Wall -Wextra -stdpar=multicore solutions/exercise2.f90 -o daxpy
!./daxpy 10000000 100

In [None]:
!rm daxpy || true
!nvfortran -Ofast -Wall -Wextra -stdpar=gpu solutions/exercise2.f90 -o daxpy
!./daxpy 10000000 100