# ARCHER Benchmark Baseline results

Baseline results for selected benchmarks run on ARCHER. Details on versions and compilation of the individual applications can be found in the individual sections below.

## Import required modules for results analysis

In [1]:
import matplotlib as mpl
from matplotlib import pyplot as plt
%matplotlib inline
mpl.rcParams['figure.figsize'] = (12,6)
import seaborn as sns
sns.set_style("white", {"font.family": "serif"})
import pandas as pd

In [2]:
import sys
sys.path.append('../python-modules')

In [3]:
from utilities import filemanip

## CASTEP: Large DNA Benchmark

### Version and compilation details

CASTEP Version: 18.1.0

Key options in the CASTEP Makefile:

```
COMMS_ARCH := mpi
FFT := fftw3
BUILD := fast
MATHLIBS := mkl10
```

Key dependencies:

- Compiler: GCC 6.3.0
- FFT is provided by the Cray programming environment FFTW 3 installation (version: 3.3.4.11)
- Math libraries are provided by MKL (version: 17.0.0.098)
- MPI provided by Cray MPICH (version: 7.5.2)

Full build instructions can be found at: https://github.com/hpc-uk/build-instructions/blob/master/CASTEP/ARCHER_18.1.0_gcc6_CrayMPT.md

### Benchmark and measuring performance

Details of the DNA benchmark can be found in this repository at:  https://github.com/hpc-uk/archer-benchmarks/blob/master/apps/CASTEP/

The typical parallel launch line within the job submission script is:

```
export OMP_NUM_THREADS=<nomp threads>
aprun -n <nprocess total> -N <nprocess per node> -d <nomp threads> castep.mpi polyA20-no-wat
```

where `<nprocess total>` is replaced by the total number of MPI processes, `<nprocess per node>` is replaced by the number of MPI processes per node and `<nomp threads>` is replaced by the number of OpenMP threads.

Performance is measured in 'mean SCF cycles per second'. This is calculated from the CASTEP output files by computing the SCF cycle times, removing the minimum and maximum value and then computing the mean of the remaining values.

For each job size, the benchmark has been run 5 times at random times and days.

### ARCHER baseline results analysis

In [4]:
from appanalysis import castep

In [5]:
castepfiles = filemanip.get_filelist('../apps/CASTEP/DNA/results/ARCHER_baseline', 'polyA20')

In [6]:
df_list = castep.create_df_list(castepfiles, 24)
castep_df = pd.DataFrame(df_list)

In [7]:
threading = [1, 2, 4]
for threads in threading:
    print('\nPeformance (Mean SCF cycles per second): {0} threads per MPI process'.format(threads))
    nodes, perf_median = castep.get_perf_stats(castep_df, threads, 'max', writestats=True)
    plt.plot(nodes, perf_median, '-o', label='{0} threads per MPI process'.format(threads))
    plt.xlabel('Nodes')
    plt.ylabel('Median Perf. (Mean SCF cycles per s)')
    plt.legend(loc='best')
    sns.despine()


Peformance (Mean SCF cycles per second): 1 threads per MPI process


UndefinedVariableError: name 'max' is not defined

## GROMACS: 1400k atom benchmark

### Version and compilation details

GROMACS Version: 2018.2 (single precision)

Key dependencies:

- Compiler: GCC 6.3.0
- FFT is provided by the GROMACS self-build of FFTW (version: commercial-fftw-3.3.6-pl1-sse2-avx)
- Math libraries are provided by GROMACS
- MPI provided by Cray MPICH (version: 7.5.2)
- OpenMP is enabled

Full build instructions can be found at: https://github.com/hpc-uk/build-instructions/blob/master/GROMACS/ARCHER_2018.2_gcc6_ivybrg.md

### Benchmark and measuring performance

Details of the 1400k benchmark can be found in this repository at:  https://github.com/hpc-uk/archer-benchmarks/tree/master/apps/GROMACS

The typical parallel launch line within the job submission script is:

```
export OMP_NUM_THREADS=<nomp threads>
aprun -n <nprocess total> -N <nprocess per node> mdrun_mpi -s benchmark.tpr -g benchmark -noconfout
```

where `<nprocess total>` is replaced by the total number of MPI processes, `<nprocess per node>` is replaced by the number of MPI processes per node and `<nomp threads>` is replaced by the number of OpenMP threads.

Performance is measured in 'ns/day'. This is calculated by the GROMACS software itself and is read directly from the GROMACS output.

### ARCHER baseline results analysis

In [None]:
from appanalysis import gromacs

In [None]:
gromacsfiles = filemanip.get_filelist('../apps/GROMACS/1400k-atoms/results/ARCHER', 'benchmark')

In [None]:
df_list = gromacs.create_df_list(gromacsfiles, 24)
gromacs_df = pd.DataFrame(df_list)

In [None]:
print('\nPeformance (ns/day): 1 thread per MPI process')
nodes, perf_max = gromacs.get_perf_stats(gromacs_df,'max', threads=1, writestats=True)
print('\nPeformance (ns/day): 2 threads per MPI process')
nodes2, perf_max2 = gromacs.get_perf_stats(gromacs_df,'max', threads=2, writestats=True)
plt.plot(nodes, perf_max, '-o', label='1 thread per MPI process')
plt.plot(nodes2, perf_max2, '-o', label='2 threads per MPI process')
plt.xlabel('Nodes')
plt.ylabel('Maximum Perf. (ns/day)')
plt.legend(loc='best')
sns.despine()

## OpenSBLI: Taylor-Green Vortex 1024 benchmark

### Version and compilation details

OpenSBLI version: as supplied for the benchmark

Key dependencies:

- Compiler: Cray compiler (version: 8.5.8)
- MPI provided by Cray MPICH (version: 7.5.2)
- HDF5 is provided by Cray HDF5 (version: 1.10.0.1)

Full build instructions can be found at: https://github.com/hpc-uk/archer-benchmarks/blob/master/apps/OpenSBLI/source/ARCHER_build.md

### Benchmark and measuring performance

Details of the Taylor-Green Vortex 1024 benchmark can be found in this repository at: https://github.com/hpc-uk/archer-benchmarks/tree/master/apps/OpenSBLI

The typical parallel launch line within the job submission script is:

```
aprun -n <nprocess total> -N <nprocess per node> ./OpenSBLI_mpi
```

where `<nprocess total>` is replaced by the total number of MPI processes and `<nprocess per node>` is replaced by the number of MPI processes per node.

Performance is measured in 'interations/s'. The total runtime and number of iterations are read directly from the OpenSBLI ouptut and these are used to compute the number of iterations per second.

For each job size, the benchmark has been run 5 times at random times and days.

### ARCHER baseline results analysis

In [None]:
from appanalysis import osbli

In [None]:
osblifiles = filemanip.get_filelist('../apps/OpenSBLI/TGV1024ss/results/ARCHER_baseline/', 'output')

In [None]:
df_list = osbli.create_df_list(osblifiles, 24)
osbli_df = pd.DataFrame(df_list)

In [None]:
print('\nPeformance (iterations/s): 1 thread per MPI process')
nodes, perf_max = osbli.get_perf_stats(osbli_df, 'max', writestats=True)
plt.plot(nodes, perf_max, '-o', label='1 thread per MPI process')
plt.xlabel('Nodes')
plt.ylabel('Maximum Perf. (Iterations/s)')
plt.legend(loc='best')
sns.despine()

## CP2K: LiH-HFX benchmark

### Version and compilation details

CP2K version: 6.1-branch

Key dependencies:

- Compiler: GCC 6.3.0
- MPI provided by Cray MPICH (version: 7.5.5)
- FFT is provided by the Cray programming FFTW 3 installation (version: 3.3.4.11)
- ELPA (version 2015.05.001 was used for this build)
- libint (version 1.1.4 was used for this build)
- libxc (version 4.2.3 was used for this build)
- libxsmm (cloned from GitHub on 13 Sep 2018)
- PLUMED2 (version 2.6.3 was used for this build) 

Full build instructions can be found at: https://github.com/hpc-uk/build-instructions/blob/master/CP2K/CP2K_6.1_ARCHER.md

### Benchmark and measuring performance

Details of the LiH-HFX benchmark can be found in this repository at: https://github.com/hpc-uk/archer-benchmarks/tree/master/apps/CP2K

The typical parallel launch line within the job submission script is:

```
export OMP_NUM_THREADS=<nomp threads>
aprun -n <nprocess total> -N <nprocess per node> -d <nomp threads> cp2k.psmp -i input_bulk_HFX_3.inp
```

where `<nprocess total>` is replaced by the total number of MPI processes, `<nprocess per node>` is replaced by the number of MPI processes per node and `<nomp threads>` is replaced by the number of OpenMP threads.

Performance is measured in 'calculations/s'. The total runtime is read directly from the CP2K ouptut and used to compute the number of calculations per second.

For each job size, the benchmark has been run 5 times at random times and days.

### ARCHER baseline results analysis

In [None]:
from appanalysis import cp2k

In [None]:
cp2kfiles = filemanip.get_filelist('../apps/CP2K/results/ARCHER_baseline/', 'CP2K_')

In [None]:
df_list = cp2k.create_df_list(cp2kfiles, 24)
cp2k_df = pd.DataFrame(df_list)

In [None]:
print('\nPeformance (Calculations/s): 6 threads per MPI process')
nodes, perf_max = cp2k.get_perf_stats(cp2k_df, 6, 'max', writestats=True)
plt.plot(nodes, perf_max, '-o', label='6 threads per MPI process')
plt.xlabel('Nodes')
plt.ylabel('Maximum Perf. (Calculations/s)')
plt.legend(loc='best')
sns.despine()

## HadGEM3: G31 coupled model benchmark

### Version and compilation details

Version: benchmark case

Key dependencies:

- Compiler: Cray compiler, CCE (version: 8.5.8)
- MPI provided by Cray MPICH (version: 7.5.5)

Full build instructions can be found at: https://github.com/hpc-uk/build-instructions/blob/master/OASIS/OASIS_ACRHER.md

### Benchmark and measuring performance

Details of the benchmark can be found in this repository at: https://github.com/hpc-uk/archer-benchmarks/tree/master/apps/HadGEM3

Performance is measured in 'calculations/s'. The output files from this benchmark are too large to be added to the repository so the raw data is contained in a CSV file which is then used for the analysis below.

For each job size, the benchmark has been run 5 times at random times and days.

### ARCHER baseline results analysis

In [None]:
from appanalysis import hadgem3

In [None]:
hadgem3_df = pd.read_csv('../apps/HadGEM3/results/ARCHER_baseline/HadGEM3_ARCHER_baseline_results.csv')
cpn = 24
hadgem3_df['Cores'] = hadgem3_df.apply(lambda row: cpn*row.Nodes, axis=1)
hadgem3_df['Perf'] = hadgem3_df.apply(lambda row: 1.0/row.Time, axis=1)
hadgem3_df['Count'] = hadgem3_df.apply(lambda row: 1, axis=1)

In [None]:
print('\nPeformance (Calculations/s):')
nodes, perf_max = hadgem3.get_perf_stats(hadgem3_df, 'max', writestats=True)
plt.plot(nodes, perf_max, '-o', label='1 thread per MPI process')
plt.xlabel('Nodes')
plt.ylabel('Maximum Perf. (Calculations/s)')
plt.legend(loc='best')
sns.despine()