# Analyzing a Fortran Stencil Program

## Understanding

Open the <tt>stencil_2d.F90</tt> Fortran program by double clicking on the file in the File Browser on the left (if it is hidden, click on the Folder icon).

![open-stencil2d](img/open-stencil2d.png)

If you prefer, you can also open a new Terminal (File -> New -> Terminal), navigate to the location where <tt>stencil_2d.F90</tt> is located and use your favorite linux editor (e.g. vim) to browse the Fortran source code. This will give you better syntax highlighting.

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>1.</b> Read the code of <tt>stencil_2d.F90</tt> and understand what the program is doing.<br>
<b>2.</b> Compile the code and run it (see below).<br>
<b>3.</b> There are two global variables of type <tt>integer</tt> named <tt>flop_counter</tt> and <tt>byte_counter</tt>. Change the program to count the number of floating-point operations and bytes transferred to/from memory.<br>
</div>

## Compiling

On Piz Daint, the programming environment is managed using so called modules. We need to load the right modules in order to compile our program. We are going to use the Cray Fortran compiler (<tt>PrgEnv-cray</tt>) for this exercise.

In [1]:
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray
module load perftools-lite

There is a <tt>Makefile</tt> which contains instructions of how to compile our stencil program. <tt>make clean</tt> clean away any previous artefacts from compilation or running our program.

In [2]:
make clean
make

rm -f -rf *~ *.o *.mod *.MOD *.i *.x *.x+orig *.x+[0-9]* core.* *.out
ftn -eZ -ffree -N255 -ec -eC -eI -eF -c m_utils.F90
ftn -eZ -ffree -N255 -ec -eC -eI -eF -c stencil_2d.F90
ftn -eZ -ffree -N255 -ec -eC -eI -eF m_utils.o stencil_2d.o -o stencil_2d.x
INFO: creating the CrayPat-instrumented executable 'stencil_2d.x' (lite-samples) ...OK


## Running

We can run our program on all 12 cores of the Xeon E5-2690 v3 Haswell CPU that we have available using the <tt>srun</tt> command. The command line arguments <tt>nx, ny, nz, num_iter</tt> specify the size of the computational domain as well as the number of iterations.

In [3]:
srun -n 12 ./stencil_2d.x+orig -nx 128 -ny 128 -nz 64 -num_iter 1024

 Counted floating-point operations [GFLOP] =  25.904525756835938
 Counted memory transfers [GB] =  255.2354736328125
 --------------------------------------------------------------------------
  Timers:
   number of workers =   12
 --------------------------------------------------------------------------
  Id      Tag                     #calls        min[s]        max[s]       mean[s] 
   1      work                         1       11.3525       11.4692       11.4283
 --------------------------------------------------------------------------


<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>3.</b> Compute the arithmetic intensity $I$ of our stencil program using the flop and byte counters you have introduced.<br>
<b>4.</b> Compute the performance in GFLOP/s of our stencil program. Is our program memory bound or compute bound? Which % of peak FLOP/s and memory bandwidth do we achieve?<br>
<b>5.</b> Run the stencil program for different <tt>nx</tt> and <tt>ny</tt> (see below). Import the data into a Python notebook and make a loglog-plot of the runtime per gridpoint $r = \mathrm{runtime} \, / \, nx \, / \, ny \, / nz$ in $\mu s$ versus the working set size $n = nx \times ny \times nz \times 3$ in MB. What would you expect for a von Neumann architecture? What might be the reason that the behavior is different?
</div>

The arithmetic intensity of the code is $I=0.1$ and can be computed as follows:<br>
<tt><font color="gray">
gflop_counter = 25.904525756835938<br>
gb_counter = 255.2354736328125<br>
arithmetic_intensity = gflop_counter / gb_counter = 0.1<br>
</font></tt>

The code is memory bandwidth bound (we are achieving 35% of theoretical peak memory bandiwdth). The percentages of theoretical peak values can be computed as follows:<br>
<tt><font color="gray">
runtime = 11.4283<br>
performance_in_gflops = gflop_counter / runtime = 2.266699837844293<br>
bandwidth_in_gbs = gb_counter / runtime = 22.333634366687303<br>
performance_percent_peak = performance_in_gflops / peak_performance_in_gflops = 0.91%<br>
bandwidth_percent_peak = bandwidth_in_gbs / peak_bandwidth_in_gbs = 35%<br>
</font></tt>

In [5]:
echo "# nx ny nz perf[GFLOP] mem[GB] runtime[s]"
echo "data = np.array( [ \\"
nz=64
for nx in 16 32 48 64 96 128 ; do
for ny in 16 32 48 64 96 128 ; do
    sleep 2
    srun -n 12 ./stencil_2d.x+orig -nx ${nx} -ny ${ny} -nz ${nz} -num_iter 1024 2>&1 1> /tmp/stencil_2d.$$.out
    gflop=`cat /tmp/stencil_2d.$$.out | grep "GFLOP" | awk '{print $6}' | tr -d '\n'`
    gbyte=`cat /tmp/stencil_2d.$$.out | grep "GB" | awk '{print $6}' | tr -d '\n'`
    runtime=`cat /tmp/stencil_2d.$$.out | grep " work " | awk '{print $6}' | tr -d '\n'`
    echo "[${nx}, ${ny}, ${nz}, ${gflop}, ${gbyte}, ${runtime}], \\"
    /bin/rm -f /tmp/stencil_2d.*.out 2>/dev/null 1>/dev/null
done
done
echo "] )"

# nx ny nz perf[GFLOP] mem[GB] runtime[s]
data = np.array( [ \
[16, 16, 64, 0.4081878662109375, 4.8453369140625, 0.0386], \
[16, 32, 64, 0.8143157958984375, 9.1182861328125, 0.0831], \
[16, 48, 64, 1.2204437255859375, 13.3912353515625, 0.1235], \
[16, 64, 64, 1.6265716552734375, 17.6641845703125, 0.1690], \
[16, 96, 64, 2.4388275146484375, 26.2100830078125, 0.2509], \
[16, 128, 64, 3.2510833740234375, 34.7559814453125, 0.3311], \
[32, 16, 64, 0.8143157958984375, 9.1182861328125, 0.0708], \
[32, 32, 64, 1.6247406005859375, 17.2803955078125, 0.1367], \
[32, 48, 64, 2.4351654052734375, 25.4425048828125, 0.2038], \
[32, 64, 64, 3.2455902099609375, 33.6046142578125, 0.2677], \
[32, 96, 64, 4.8664398193359375, 49.9288330078125, 0.5550], \
[32, 128, 64, 6.4872894287109375, 66.2530517578125, 1.9840], \
[48, 16, 64, 1.2204437255859375, 13.3912353515625, 0.0983], \
[48, 32, 64, 2.4351654052734375, 25.4425048828125, 0.1899], \
[48, 48, 64, 3.6498870849609375, 37.4937744140625, 0.2787], \
[48, 64,

![runtime_vs_size](img/runtime_vs_size.png)

## Performance Analysis Tool (perftool-lite)

We can also use a performance analysis tool from Cray named <tt>perftools-lite</tt> to analyze the performance of our stencil program. In fact, our program has already been compiled for performance analysis with <tt>perftools-lite</tt> since we have loaded the corresponding module. <tt>stencil_2d.x+orig</tt> is the original executable without instrumentation for performance analysis and <tt>stencil_2d.x</tt> is an executable specifically prepared for performance analysis.

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>6.</b> Run your program with <tt>perftools-lite</tt>. Read the report generated by <tt>perftools-lite</tt> carefully. What can you learn from the profiling?<br>
<b>7.</b> Compare the memory transfer numbers against your values. By which factor are you off? What could be the reason?<br>
<b>8.</b> Run again with $nx = 64$ and $ny = 32$. Did the factor by which you are off change? Is this consistent with the findings above? Do you have an idea why this might be the case?<br>
</div>

*Solution:*** <br>
Running at nx=128 ny=128 nz=65 and num_iter=1024 we have 255.2 GB from the counters and perftools-lite reports 128.2 GB, that's only 50% of our estimate from the counters.<br>
Running at nx=64 ny=32 nz=64 and num_iter=1024 we have 15.42 GB from the counters and perftools-lite reports 0.08 GB, that's only 0.52% of our estimate from the counters.<br>
The reason for the large discrepancy is that memory accesses are cached. We have 2.5 MB of L3 cache per core. A float32 field occupies 4 MB for <tt>nx x ny x nz</tt> = 128 x 128 x 64 and 0.125 MB for <tt>nx x ny x nz</tt> = 64 x 32 x 64. We have 3 fields in the code (<tt>in_field, tmp_field, out_field</tt>). In the first case we do not fit into L3 cache. In the second case, we easily fit into L3 cache and only have to read the fields once, after that they can be read from / written to in cache.

In [4]:
srun -n 12 ./stencil_2d.x -nx 128 -ny 128 -nz 64 -num_iter 1024

CrayPat/X:  Version 7.1.1 Revision 7c0ddd79b  08/19/19 16:58:46
 Counted floating-point operations [GFLOP] =  25.904525756835938
 Counted memory transfers [GB] =  255.2354736328125
 --------------------------------------------------------------------------
  Timers:
   number of workers =   12
 --------------------------------------------------------------------------
  Id      Tag                     #calls        min[s]        max[s]       mean[s] 
   1      work                         1       11.3495       11.4757       11.4193
 --------------------------------------------------------------------------

#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################

CrayPat/X:  Version 7.1.1 Revision 7c0ddd79b  08/19/19

In [19]:
srun -n 12 ./stencil_2d.x -nx 64 -ny 32 -nz 64 -num_iter 1024

CrayPat/X:  Version 7.1.1 Revision 7c0ddd79b  08/19/19 16:58:46
 Counted floating-point operations [GFLOP] =  0.2455902099609375
 Counted memory transfers [GB] =  15.4171142578125
 --------------------------------------------------------------------------
  Timers:
   number of workers =   12
 --------------------------------------------------------------------------
  Id      Tag                     #calls        min[s]        max[s]       mean[s] 
   1      work                         1        0.1811        0.1892        0.1852
 --------------------------------------------------------------------------

#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################

CrayPat/X:  Version 7.1.1 Revision 7c0ddd79b  08/19/19 