### Notebook 3: Analyze Benchmarks and Create Plots

This is the third and final notebook in the sequence. At this point, all the results have been calculated and the three implementations have been verified. Now we will create the plots for the final report.

In [None]:
print('notebook_03: started.')

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

#### Helper Functions

First, we will define some useful helper functions.

This function reads the benchmark files created by the first notebook. These files are plain text and the data is formatted as a `pandas.DataFrame`. This means that we can simply evaluate the text file as Python code and get a `DataFrame` as a result. The three implementations all write their output in the same format, so this function will work for all of them.

In [None]:
def read_benchmark_file(filename) -> pd.DataFrame:
    with open(filename, mode='r', encoding='utf-8') as file:
        contents = file.read()
    return eval(contents)

This function calculates the number of grid points.

In [None]:
def get_gridsize(bench):
    gridsize = bench['x'] * bench['y'] * bench['z']
    return gridsize

This function calculates the runtime per grid point. Note that the time column of the `DataFrame` contains the runtime in milliseconds.

In [None]:
def get_runtime_per_gridpoint(bench):
    rt_ms = bench['time'] / (bench['x'] * bench['y'] * bench['z'])
    rt_us = 1e3 * rt_ms
    return rt_us

This function calculates the size of the working set in memory in megabytes. We need to specify the memory size of a numeric value. If the code uses `float` as numeric precision, we have 4 bytes per value. If the numerical precision is `double`, we have 8 bytes per value.

In [None]:
def get_workingset(bench, bytes_per_value, num_fields):
    gs_B = num_fields * bytes_per_value * bench['x'] * bench['y'] * bench['z']
    gs_MiB = gs_B / 1024 / 1024
    return gs_MiB

This function adds all the derived measures that we need for the plots to the `DataFrame`.

In [None]:
def add_derived_measures(bench):
    bench['rt_per_gp'] = get_runtime_per_gridpoint(bench)
    bench['gridsize'] = get_gridsize(bench)
    bench['workingset'] = get_workingset(bench, bytes_per_value=4, num_fields=2)

The following functions are used to remove specific data points from a benchmark. We may want to do this if we are not interested in certain ranges of data.

In [None]:
def remove_values(bench, mask):
    result = bench.loc[~mask, :]
    result.reset_index(drop=True, inplace=True)
    return result

def remove_values_or(bench, x, y):
    mask = (bench['x'] == x) | (bench['y'] == y)
    return remove_values(bench, mask)

def remove_values_and(bench, x, y):
    mask = (bench['x'] == x) & (bench['y'] == y)
    return remove_values(bench, mask)

#### Read Benchmark Files

In [None]:
print('notebook_03: reading benchmark files ...')

Now we read the the benchmark files created by the Fortran and C++/CUDA executables.

In [None]:
bench_oacc = read_benchmark_file('./data/bench_openacc.txt')
bench_cuda_direct08 = read_benchmark_file('./data/bench_cuda_noshared08.txt')
bench_cuda_direct16 = read_benchmark_file('./data/bench_cuda_noshared16.txt')
bench_cuda_direct32 = read_benchmark_file('./data/bench_cuda_noshared32.txt')
bench_cuda_shared04 = read_benchmark_file('./data/bench_cuda_shared04.txt')
bench_cuda_shared12 = read_benchmark_file('./data/bench_cuda_shared12.txt')
bench_cuda_shared28 = read_benchmark_file('./data/bench_cuda_shared28.txt')

For each implementation, we calculate the required values for the plots.

In [None]:
add_derived_measures(bench_oacc)
add_derived_measures(bench_cuda_direct08)
add_derived_measures(bench_cuda_direct16)
add_derived_measures(bench_cuda_direct32)
add_derived_measures(bench_cuda_shared04)
add_derived_measures(bench_cuda_shared12)
add_derived_measures(bench_cuda_shared28)

#### Create Plots

In [None]:
print('notebook_03: creating plots ...')


We remove some data points for very small grid sizes from the benchmarks so that the plot focuses on the relevant range of grid sizes.

In [None]:
bench_oacc = remove_values_and(bench_oacc, x=16, y=16)
bench_cuda_direct08 = remove_values_or(bench_cuda_direct08, x=12, y=12)
bench_cuda_direct16 = remove_values_or(bench_cuda_direct16, x=12, y=12)
bench_cuda_shared04 = remove_values_or(bench_cuda_shared04, x=12, y=12)
bench_cuda_shared12 = remove_values_or(bench_cuda_shared12, x=12, y=12)

In this first plot, we show the relationship between working set size and runtime per grid point. Specifically, we compare the Fortran implementation and the two C++/CUDA implementations with and without shared memory usage. For the CUDA implementation, we choose the block size that is most useful for practical applications.

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax.loglog(bench_oacc['workingset'], bench_oacc['rt_per_gp'], '.', label='OpenACC')
ax.loglog(bench_cuda_direct16['workingset'], bench_cuda_direct16['rt_per_gp'], '.', label='CUDA Direct16')
ax.loglog(bench_cuda_shared12['workingset'], bench_cuda_shared12['rt_per_gp'], '.', label='CUDA Shared12')
ax.axvline(4, color='purple', label='L2 Chache Size')
ax.grid(which='major', linestyle='-')
ax.grid(which='minor', linestyle=':')
ax.set_xlabel('Size of working set [MiB]')
ax.set_ylabel('Runtime per gridpoint [µs]')
ax.set_title('Runtime vs. Working Set Size')
ax.legend()
plt.show()
fig.savefig('./data/plot_runtime_mix.png', dpi=300)

In the next plot, we compare how block size affects performance for a CUDA implementation with no shared memory usage.

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax.loglog(bench_cuda_direct08['workingset'], bench_cuda_direct08['rt_per_gp'], '.', label='CUDA Direct08')
ax.loglog(bench_cuda_direct16['workingset'], bench_cuda_direct16['rt_per_gp'], '.', label='CUDA Direct16')
ax.loglog(bench_cuda_direct32['workingset'], bench_cuda_direct32['rt_per_gp'], '.', label='CUDA Direct32')
ax.axvline(4, color='purple', label='L2 Chache Size')
ax.grid(which='major', linestyle='-')
ax.grid(which='minor', linestyle=':')
ax.set_xlabel('Size of working set [MiB]')
ax.set_ylabel('Runtime per gridpoint [µs]')
ax.set_title('Runtime vs. Working Set Size')
ax.legend()
plt.show()
fig.savefig('./data/plot_runtime_direct.png', dpi=300)

In this last plot, we again compare different block sizes. This time we are looking at the shared memory CUDA implementation.

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax.loglog(bench_cuda_shared04['workingset'], bench_cuda_shared04['rt_per_gp'], '.', label='CUDA Shared04')
ax.loglog(bench_cuda_shared12['workingset'], bench_cuda_shared12['rt_per_gp'], '.', label='CUDA Shared12')
ax.loglog(bench_cuda_shared28['workingset'], bench_cuda_shared28['rt_per_gp'], '.', label='CUDA Shared28')
ax.axvline(4, color='purple', label='L2 Chache Size')
ax.grid(which='major', linestyle='-')
ax.grid(which='minor', linestyle=':')
ax.set_xlabel('Size of working set [MiB]')
ax.set_ylabel('Runtime per gridpoint [µs]')
ax.set_title('Runtime vs. Working Set Size')
ax.legend()
plt.show()
fig.savefig('./data/plot_runtime_shared.png', dpi=300)