### Notebook 1: Compile and Run Benchmarks on Piz Daint

This notebook compiles all Fortran, C++, and CUDA source code. Once the executables are created, the various benchmarks are run. The executables write their results to binary files and the performance data is collected in text files. All resulting files are collected in the `data` folder. The next notebook will read the files collected there and continue with the analysis.

This is the only notebook that depends on the *Piz Daint* programming environment. All other notebooks can either run on *Piz Daint* or on any machine with Python installed.

For each of the executables, we run a reference configuration for validation. The parameters are

- `nx`=`ny`=128  and `nz`=64,
- `num_iter`=1024.

These values are arbitrary, and any other values would serve the same purpose.

During processing, we write some status information from time to time. This helps to understand what happens when this notebook is run indirectly from the `main' notebook.

In [None]:
print('notebook_01: started.')

We first empty the output folder in case a previous run left some files behind. Each run rebuilds all files from scratch.

In [None]:
%%bash
rm -f ./data/*
mkdir -p ./data

### Fortran with OpenACC

In [None]:
print('notebook_01: processing Fortran code ...')

First, we deal with the Fortran implementation: The source file `stencil2d_openacc.F90` is the same as the one used in the lecture, with some minor changes in how the output is formatted. Remember that the relevant loops have already been decorated with OpenACC pragmas.

We compile the code with all optimizations and OpenACC, but without profiling enabled. Note that OpenACC is automatically enabled for the Cray Fortran compiler on *Piz Daint*.

In [None]:
%%bash
module switch PrgEnv-gnu PrgEnv-cray
cd fortran
ftn -O3 -eZ -c m_utils.F90
ftn -O3 -eZ -c stencil2d_openacc.F90
ftn -O3 -eZ m_utils.o stencil2d_openacc.o -o stencil2d_openacc.x

Next we run the executable for the reference values of `nx=128`, `ny=128` and `num_iter=1024`.

In [None]:
%%bash
srun ./fortran/stencil2d_openacc.x --nx 128 --ny 128 --nz 64 --num_iter 1024 > /dev/null

The last run has produced two files. We move the `out_field.dat` file with the resulting field to the `data` folder and keep it for later.

In [None]:
%%bash
rm in_field.dat
mv out_field.dat ./data/field_openacc.fld

Now we run the benchmark. We use the `scan` option, which scans over different field sizes and collects the performance measurements. The output is written to a text file in the `data` folder.

In [None]:
%%bash
srun ./fortran/stencil2d_openacc.x --scan --nz 64 --num_iter 128 > ./data/bench_openacc.txt

### C++ and CUDA

Now we compile and run the C++ code that contains the CUDA kernels. We are going to perform the same steps as we did for the Fortran code.

In [None]:
print('notebook_01: processing C++ code ...')

First, we compile the C++/CUDA code with all optimizations enabled.

In [None]:
%%bash
cd cpp
nvcc --generate-line-info -O3 -c stencil2d_kernels.cu stencil2d_occupancy.cu
CC -O3 stencil2d_common.cpp stencil2d_cuda.cpp stencil2d_host.cpp stencil2d_main.cpp stencil2d_kernels.o stencil2d_occupancy.o -o stencil2d.x

Now we run the code for the reference parameters `x=128`, `y=128` and `i=1024`.

The C++ code contains a pure C++ implementation of the 2D-stencil. This C++ code is not written for speed, but only for verification purposes. Since the code is comparatively slow, we use only one z-component `z=1`. This is good enough for our purposes since the initialization and computation along the z-axis is identical. The `-h` option enables the *host* mode, in contrast to the *GPU* mode.

The output is written to a text file in the `data` folder, where it is kept for later use.

In [None]:
%%bash
./cpp/stencil2d.x -h -x128 -y128 -z1 -i1024 -f./data/field_cpp.fld > /dev/null

We also need to verify that our CUDA code is producing the correct results. We run the reference configuration again, but this time using the CUDA implementation instead of the pure C++ implementation. Now we can also use multiple z-components. This allows us to verify that the results are consistent across all z-components.

In [None]:
%%bash
./cpp/stencil2d.x -g -x128 -y128 -z64 -i1024 -f./data/field_cuda_shared.fld > /dev/null
./cpp/stencil2d.x -g -x128 -y128 -z64 -i1024 --noshared -f./data/field_cuda_noshared.fld > /dev/null

As a last step, we run the benchmark, where we scan over different field sizes and collect the performance measurements. To get stable timing measurements, each measurement is done with three runs (`r=3`). We have noticed that the first run is sometimes slower than the later ones, probably due to the GPU clock boost kicking in.

In [None]:
%%bash
./cpp/stencil2d.x -s -z64 -i128 -r3 -b4 > ./data/bench_cuda_shared04.txt
./cpp/stencil2d.x -s -z64 -i128 -r3 -b12 > ./data/bench_cuda_shared12.txt
./cpp/stencil2d.x -s -z64 -i128 -r3 -b28 > ./data/bench_cuda_shared28.txt
./cpp/stencil2d.x -s -z64 -i128 -r3 -b8 --noshared > ./data/bench_cuda_noshared08.txt
./cpp/stencil2d.x -s -z64 -i128 -r3 -b16 --noshared > ./data/bench_cuda_noshared16.txt
./cpp/stencil2d.x -s -z64 -i128 -r3 -b32 --noshared > ./data/bench_cuda_noshared32.txt

### Cleanup

Except for the result files in the `data` folder, we leave a clean workspace behind.

In [None]:
print('notebook_01: cleaning up ...')

We clean up the workspace and delete all executables and intermediate files created during compilation.

In [None]:
%%bash
rm -f ./cpp/*.o
rm -f ./fortran/*.mod ./fortran/*.cub ./fortran/*.ptx ./fortran/*.i ./fortran/*.o
rm -f ./cpp/*.x ./fortran/*.x

### Report Results

We tell the `main` notebook that everything went well and list all the files that were created.

In [None]:
print('notebook_01: completed.')
print('notebook_01: the following files have been created:')

The `data` folder is now populated with some result files. We list which files were created.

In [None]:
%%bash
ls ./data/*