# Distributed Memory Parallelism with OpenMP

During this exercise we will parallelize the stencil program from day 1 using OpenMP. The goal is to apply the OpenMP concepts that have been discussed in the lecture. If everything goes well, at the end of this exercise you will have a parallel version of the diffusion operator.

So let's start!

## Performance Baseline

In the first step we will see how fast our code performs and what the straightforward insertion of compiler directives (pragmas) can do.

In [1]:
import timeit
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

## Validation

As a first step, it is always good to make sure that the code is working correctly before proceeding with any sort of parallelization or optimization. We plot the initial and final step to see that the code still produces the same result.

In [None]:
def read_field_from_file(filename, num_halo=None):
    (rank, nbits, num_halo, nx, ny, nz) = np.fromfile(filename, dtype=np.int32, count=6)
    offset=(3 + rank) * 32 // nbits
    data = np.fromfile(filename, dtype=np.float32 if nbits == 32 else np.float64, \
                       count=nz * ny * nx + offset)
    if rank == 3:
        return np.reshape(data[offset:], (nz, ny, nx))
    else:
        return np.reshape(data[offset:], (ny, nx))

def validate_results():
    fig, axs = plt.subplots(1, 2, figsize=(12, 4))

    in_field = read_field_from_file('in_field.dat')
    im1 = axs[0].imshow(in_field[in_field.shape[0] // 2, :, :], origin='lower', vmin=-0.1, vmax=1.1);
    fig.colorbar(im1, ax=axs[0]);
    axs[0].set_title('Initial condition');

    out_field = read_field_from_file('out_field.dat')
    im2 = axs[1].imshow(out_field[out_field.shape[0] // 2, :, :], origin='lower', vmin=-0.1, vmax=1.1);
    fig.colorbar(im2, ax=axs[1]);
    axs[1].set_title('Final result');
    
    plt.show()

In [None]:
validate_results()

## C++ implementation

In [2]:
%%bash
module load daint-gpu
module load perftools-lite
CC stencil2d-base.cpp -fopenmp -o stencil2d-base.x -O3

INFO: creating the CrayPat-instrumented executable 'stencil2d-base.x' (lite-samples) ...OK


In [3]:
%%bash
module load daint-gpu
module load perftools-lite
CC stencil2d-kparallel.cpp -fopenmp -o stencil2d-kparallel.x -O3

INFO: creating the CrayPat-instrumented executable 'stencil2d-kparallel.x' (lite-samples) ...OK


In [4]:
%%bash
srun -n 1 ./stencil2d-base.x+orig --nx 128 --ny 128 --nz 64 --num_iter 1024

# ranks nx ny ny nz num_iter time
data = np.array( [ \
[ 24, 128, 128, 64, 1024, 4.87306],
] )


In [5]:
%%bash
export OMP_NUM_THREADS=1
srun -n 1 ./stencil2d-kparallel.x+orig --nx 128 --ny 128 --nz 64 --num_iter 1024

#threads = 1
# ranks nx ny ny nz num_iter time
data = np.array( [ \
[ 1, 128, 128, 64, 1024, 5.85649],
] )


In [6]:
%%bash
export OMP_NUM_THREADS=5
srun -n 1 ./stencil2d-kparallel.x+orig --nx 128 --ny 128 --nz 64 --num_iter 1024

#threads = 5
# ranks nx ny ny nz num_iter time
data = np.array( [ \
[ 5, 128, 128, 64, 1024, 3.14998],
] )


In [7]:
%%bash
export OMP_NUM_THREADS=5
srun -n 1 ./stencil2d-kparallel.x --nx 128 --ny 128 --nz 64 --num_iter 1024 > report_cxx.txt

CrayPat/X:  Version 7.1.1 Revision 7c0ddd79b  08/19/19 16:58:46


## Fortran implementation

In [8]:
%%bash
# For Fortran uncomment these lines
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray
make VERSION=base

ftn -O3 -hfp3 -eZ -ffree -N255 -ec -eC -eI -eF -rm -h omp -c stencil2d-base.F90
ftn -O3 -hfp3 -eZ -ffree -N255 -ec -eC -eI -eF -rm -h omp m_utils.o stencil2d-base.o -o stencil2d-base.x
cp stencil2d-base.x stencil2d.x


In [9]:
%%bash
# For Fortran uncomment these lines
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray
module load perftools-lite
make VERSION=kparallel

ftn -O3 -hfp3 -eZ -ffree -N255 -ec -eC -eI -eF -rm -h omp -c stencil2d-kparallel.F90
ftn -O3 -hfp3 -eZ -ffree -N255 -ec -eC -eI -eF -rm -h omp m_utils.o stencil2d-kparallel.o -o stencil2d-kparallel.x
cp stencil2d-kparallel.x stencil2d.x


INFO: creating the CrayPat-instrumented executable 'stencil2d-kparallel.x' (lite-samples) ...OK


In [10]:
%%bash
srun -n 1 ./stencil2d-base.x --nx 128 --ny 128 --nz 64 --num_iter 1024

# ranks nx ny ny nz num_iter time
data = np.array( [ \
[    1,  128,  128,   64,    1024,  0.8072996E+00], \
] )


In [11]:
%%bash
export OMP_NUM_THREADS=1
srun -n 1 ./stencil2d-kparallel.x+orig --nx 128 --ny 128 --nz 64 --num_iter 1024

# threads =            1
# ranks nx ny ny nz num_iter time
data = np.array( [ \
[    1,  128,  128,   64,    1024,  0.8203089E+00], \
] )


In [12]:
%%bash
export OMP_NUM_THREADS=24
srun -n 1 ./stencil2d-kparallel.x+orig --nx 128 --ny 128 --nz 64 --num_iter 1024

# threads =           24
# ranks nx ny ny nz num_iter time
data = np.array( [ \
[    1,  128,  128,   64,    1024,  0.4305441E+00], \
] )


In [13]:
%%bash
export OMP_NUM_THREADS=24
srun -n 1 ./stencil2d-kparallel.x --nx 128 --ny 128 --nz 64 --num_iter 1024 > report_ftn.txt

CrayPat/X:  Version 7.1.1 Revision 7c0ddd79b  08/19/19 16:58:46


In [None]:
%%bash
make clean

## So is C++ just slower?

In [14]:
%%bash
module load daint-gpu
module switch PrgEnv-gnu PrgEnv-cray
make VERSION=jparallel

make: 'stencil2d-jparallel.x' is up to date.


In [15]:
%%bash
srun -n 1 ./stencil2d-jparallel.x --nx 128 --ny 128 --nz 64 --num_iter 1024

# threads =           24
# ranks nx ny ny nz num_iter time
data = np.array( [ \
[    1,  128,  128,   64,    1024,  0.1499902E+01], \
] )


In [16]:
%%bash
module load daint-gpu
CC stencil2d-jparallel.cpp -fopenmp -o stencil2d-jparallel.x -O3

In [17]:
%%bash
srun -n 1 ./stencil2d-jparallel.x --nx 128 --ny 128 --nz 64 --num_iter 1024

#threads = 24
# ranks nx ny ny nz num_iter time
data = np.array( [ \
[ 24, 128, 128, 64, 1024, 0.898099],
] )
