# Analyzing a Fortran Stencil Program

## Understanding

Open the <tt>stencil_2d.F90</tt> Fortran program by double clicking on the file in the File Browser on the left (if it is hidden, click on the Folder icon).

![open-stencil2d](img/open-stencil2d.png)

If you prefer, you can also open a new Terminal (File -> New -> Terminal), navigate to the location where <tt>stencil_2d.F90</tt> is located and use your favorite linux editor (e.g. vim) to browse the Fortran source code. This will give you better syntax highlighting.

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>1.</b> Read the code and try to understand what the program is doing.<br>
<b>2.</b> There are two global variables of type <tt>integer</tt> named <tt>flop_counter</tt> and <tt>byte_counter</tt>. Change the program to count the number of floating-point operations and bytes transferred to/from memory.<br>
</div>

## Compiling

On Piz Daint, the programming environment is managed using so called modules. We need to load the right modules in order to compile our program. We are going to use the Cray Fortran compiler (<tt>PrgEnv-cray</tt>) for this exercise.

In [2]:
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray
module load perftools-lite

There is a <tt>Makefile</tt> which contains instructions of how to compile our stencil program. <tt>make clean</tt> clean away any previous artefacts from compilation or running our program.

In [3]:
make clean
make

rm -f -rf *~ *.o *.mod *.MOD *.i *.x *.x+orig *.x+[0-9]* core.* *.out
ftn -eZ -ffree -N255 -ec -eC -eI -eF -c m_utils.F90
ftn -eZ -ffree -N255 -ec -eC -eI -eF -c stencil_2d.F90
ftn -eZ -ffree -N255 -ec -eC -eI -eF m_utils.o stencil_2d.o -o stencil_2d.x
INFO: creating the CrayPat-instrumented executable 'stencil_2d.x' (lite-samples) ...OK


## Running

We can run our program on all 12 cores of the Xeon E5-2690 v3 Haswell CPU that we have available using the <tt>srun</tt> command. The command line arguments <tt>nx, ny, nz, num_iter</tt> specify the size of the computational domain as well as the number of iterations.

In [4]:
srun -n 12 ./stencil_2d.x+orig -nx 128 -ny 128 -nz 64 -num_iter 1024

 Total number of GigaFLOP =  121.8896484375
 Total number of GByte transferred =  681.07177734375
 --------------------------------------------------------------------------
  Timers:
   number of workers =   12
 --------------------------------------------------------------------------
  Id      Tag                     #calls        min[s]        max[s]       mean[s] 
   1      work                         1        8.8140        8.8773        8.8514
 --------------------------------------------------------------------------


<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>3.</b> Compute the arithmetic intensity $I$ of our stencil program using the flop and byte counters you have introduced.<br>
<b>4.</b> Compute the performance in GFLOP/s of our stencil program. Is our program memory bound or compute bound? Which % of peak FLOP/s and memory bandwidth do we achieve?<br>
<b>5.</b> Run the stencil program for different <tt>nx</tt> and <tt>ny</tt> (see below). Import the data into a Python notebook and make a loglog-plot of the working set size $n = nx \times ny \times nz \times 4$ in MB versus the runtime per gridpoint $r = \mathrm{runtime} \, / \, nx \, / \, ny \, / nz$ in $\mu s$. What would you expect for a von Neumann architecture? What might be the reason that the behavior is different?
</div>

In [14]:
echo "# nx ny nz GFLOP GByte runtime"
echo "data = np.array( [ \\"
nz=64
for nx in 16 32 48 64 96 128 ; do
for ny in 16 32 48 64 96 128 ; do
    sleep 2
    srun -n 12 ./stencil_2d.x+orig -nx ${nx} -ny ${ny} -nz ${nz} -num_iter 1024 2>&1 1> /tmp/stencil_2d.$$.out
    gflop=`cat /tmp/stencil_2d.$$.out | grep "GigaFLOP" | awk '{print $6}'`
    gbyte=`cat /tmp/stencil_2d.$$.out | grep "GByte" | awk '{print $7}'`
    runtime=`cat /tmp/stencil_2d.$$.out | grep " work " | awk '{print $6}'`
    echo "[${nx}, ${ny}, ${nz}, ${gflop}, ${gbyte}, ${runtime}], \\"
    /bin/rm -f /tmp/stencil_2d.*.out 2>/dev/null 1>/dev/null
done
done
echo "] )"

nx ny nz GFLOP GByte runtime
[ \
[16, 16, 64, 2.1240234375, 11.70703125, 0.0422], \
[16, 32, 64, 4.1162109375, 22.77392578125, 0.0813], \
[16, 48, 64, 6.1083984375, 33.8408203125, 0.1179], \
[16, 64, 64, 8.1005859375, 44.90771484375, 0.1547], \
[16, 96, 64, 12.0849609375, 67.04150390625, 0.2273], \
[16, 128, 64, 16.0693359375, 89.17529296875, 0.3015], \
[32, 16, 64, 4.1162109375, 22.77392578125, 0.0646], \
[32, 32, 64, 7.9833984375, 44.33935546875, 0.1194], \
[32, 48, 64, 11.8505859375, 65.90478515625, 0.1726], \
[32, 64, 64, 15.7177734375, 87.47021484375, 0.2262], \
[32, 96, 64, 23.4521484375, 130.60107421875, 0.6715], \
[32, 128, 64, 31.1865234375, 173.73193359375, 1.8501], \
[48, 16, 64, 6.1083984375, 33.8408203125, 0.0878], \
[48, 32, 64, 11.8505859375, 65.90478515625, 0.1605], \
[48, 48, 64, 17.5927734375, 97.96875, 0.2347], \
[48, 64, 64, 23.3349609375, 130.03271484375, 0.6159], \
[48, 96, 64, 34.8193359375, 194.16064453125, 2.1668], \
[48, 128, 64, 46.3037109375, 258.28857421875

![runtime_vs_size](img/runtime_vs_size.png)

## Performance Analysis Tool (perftool-lite)

We can also use a performance analysis tool from Cray named <tt>perftools-lite</tt> to analyze the performance of our stencil program. In fact, our program has already been compiled for performance analysis with <tt>perftools-lite</tt> since we have loaded the corresponding module. <tt>stencil_2d.x+orig</tt> is the original executable without instrumentation for performance analysis and <tt>stencil_2d.x</tt> is an executable specifically prepared for performance analysis.

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>6.</b> Run your program with <tt>perftools-lite</tt>. Read the report generated by <tt>perftools-lite</tt> carefully and compared the memory transfer numbers against your values. By which factor are you off?<br>
<b>7.</b> Run again with $nx = 64$ and $ny = 32$. Did the factor by which you are off change? Is this consistent with the findings above?<br>
</div>

In [25]:
srun -n 12 ./stencil_2d.x -nx 128 -ny 128 -nz 64 -num_iter 1024

CrayPat/X:  Version 7.1.1 Revision 7c0ddd79b  08/19/19 16:58:46
 Total number of GigaFLOP =  121.8896484375
 Total number of GBytes transferred =  681.07177734375
 --------------------------------------------------------------------------
  Timers:
   number of workers =   12
 --------------------------------------------------------------------------
  Id      Tag                     #calls        min[s]        max[s]       mean[s] 
   1      work                         1        8.8329        8.8961        8.8599
 --------------------------------------------------------------------------

#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################

CrayPat/X:  Version 7.1.1 Revision 7c0ddd79b  08/19/19 16:58:46
Experime