Skip to content

Dirac ITT 2020 Benchmarks

Peter Boyle edited this page Oct 28, 2020 · 36 revisions

Interpreting output

1. Dirac ITT 2020 Grid Benchmarks

The two Grid based benchmarks for the 2020 Dirac ITT are located in the build directories of the Grid package.

Build instructions for Grid are available on the Wiki at https://github.com/paboyle/Grid.

The key Grid benchmarks are located in branch:

release/dirac-ITT-2020

and in the corresponding release:

https://github.com/paboyle/Grid/releases

with binaries found under the build directories as:

benchmarks/Benchmark_ITT

and

benchmarks/Benchmark_IO

Compilation

The code can be compiled for two major node types: multicore CPU and GPU.

Compile instructions/scripts are provided for

However, details will depend on cluster software configurations and these are illustrative only, and provided as examples to help vendors prepare submissions. These scripts compile the code twice (with build-Nc3 and build-Nc4), with the two different scenarios submitted but only Nc=3 contributing to the DR1 measure of system throughput.

A64FX nodes

These nodes are supported with grateful thanks to contributed work by Nils Meyer at Regensburg University. The compile target is

  --enable-simd=A64FX

The code is claimed to work efficiently on Fujitsu ARM systems, but we have not been able to verify this ourselves. This target should enable both Fujitsu and Cray A64FX nodes to operate efficiently.

See: Nils Meyer's APLAT 2020 talk

AMD GPU based nodes.

We have compiled the code and run on MI-50 GPUs using the hip environment.

--enable-accelerator=hip

We have not had access to MI-100 GPU's, but anticipate these may be cost effective.

2. Benchmark_ITT

Requested baseline runs

It must be run and submitted on both Nc=3 and Nc=4 compiles, with Nc=3 contributing to the system throughput metric.

  1. Single node runs on both CPU and (if GPU's are included in the proposal) single multi-GPU node and single GPU execution.

2a) 64 node run for CPU nodes (2Nx2x4x4) mpi decomposition where N is the number of sockets on a node

OR

2b) 16 node run (multi)-GPU nodes (2x2x2x2N) nodes and 2x2x2x2N GPU's, where N is the number of GPU's on one node.

Optionally:

  1. 2,4,8,16,32 and 64 node runs. Large job sizes will give us greater confidence of scaling extrapolations and are advantageous.

Log files should be collected after compile, run and threading parameters and compile options are optimised. These log files must be submitted in entirety, and organised in a way that the run invocation parameters are clearly communicated.

Invocation

1. Single node

Single CPU node (if 2 socket)

Benchmark_ITT --mpi 2.1.1.1 --shm 2048  

Single CPU node (if 1 socket)

Benchmark_ITT --mpi 1.1.1.1 --shm 2048  

Single GPU node, 4 gpu's

Benchmark_ITT --mpi 1.1.1.4 --shm 2048  

Single GPU node, 1 gpu

Benchmark_ITT --mpi 1.1.1.1 --shm 2048  

2. a) 64 node CPU

64 node CPU (if 1 socket)

Benchmark_ITT --mpi 2.2.4.4 --shm 2048  

64 node CPU, (if 2 socket)

Benchmark_ITT --mpi 4.2.4.4 --shm 2048  

2. b) 16 node GPU

16 node GPU, 4 gpu's

Benchmark_ITT --mpi 2.2.2.8 --shm 2048  

Submission of projected figures

A baseline run must be submitted based on measurements of real hardware at the target node count. However, these may be modified by an adjustment to allow for changes to the proposed node specification to adjust for different node parts compared to the baseline run.

For example, runs on parts with fewer cores could be synthetically projected by running on a full core-count part with partial thread counts, which would be a high confidence projection. Another example could be an estimated effective of increasing or decreasing the number or type of fabric interfaces based on measurements with smaller node counts, where more detailed discussion of timing would be required.

Any such corrections to the throughput, along with supporting evidence upon which such corrections are estimated, must be clearly indicated and supporting evidence provided in the proposal. The quality of the supporting evidence for such projections will be scored and that score may be used to take a weighted average of the baseline and predicted runs with weighting to a confidence score of the supporting evidence.

The intent is to give vendors ability to fine tune system configuration without imposing a large cost or disruption in reconfiguring their internal clusters. The rules are designed to allow maximum flexibility while minimising the risk of substantially incorrect extrapolations propagating into the scoring.

Total throughput

The aggregate system performance score should be estimated as the sum of the estimated CPU and (if present) GPU partition throughputs via:

Cpu throughput = (64 node Benchmark result) x (total cpu node count) / 64

Gpu throughput = (16 node Benchmark result) x (total gpu node count) / 16

System throughput = CPU throughput + GPU throughput

Naturally the result will be an aggregate total system Gflop/s throughput.

(Nvidia and AMD) GPU based nodes.

We recommend running with one MPI-rank per GPU, dividing CPU cores evenly.

Compute nodes without NVlink or Infinity fabric based will not deliver good performance and will not be considered.

CPU nodes & NUMA

Multicore CPU, where the code is hybrid OpenMP + MPI with NUMA socket aware optimisations.

While AVX2 and AVX512 are both options, even on processors that support AVX512 we sometimes find AVX2 runs as fast or faster, and both should be compared. Control threads using the standard OMP_NUM_THREADS shell variable.

The relevant numa affinity options in multicore nodes can make big changes to delivered performance. We recommend one MPI rank per numa domain with OMP_NUM_THREADS set to the number of cores in each NUMA domain.

Examples include

IMPI_PIN=1 

for Intel MPI, and

--map-by numa 

Under OpenMPI

Manual NUMA, GPU, and HFI binding

If your MPI does not support process binding, invoke through a wrapper script to give MPI and CUDA no choice but to do the right thing. We give an example from the Booster system. The following wrapper script proved useful for CPU an GPU execution on Booster for NUMA and GPU-HFI proximity reasons.

#!/bin/sh
if [ $SLURM_LOCALID = "0" ]
then
  NUMA=1
  MLX=mlx5_0:1
  GPU=1
fi
if [ $SLURM_LOCALID = "1" ]
then
  NUMA=3
  MLX=mlx5_1:1
  GPU=0
fi
if [ $SLURM_LOCALID = "2" ]
then
  NUMA=5
  MLX=mlx5_2:1
  GPU=3
fi
if [ $SLURM_LOCALID = "3" ]
then
  NUMA=7
  MLX=mlx5_3:1
  GPU=2
fi
CMD="env CUDA_VISIBLE_DEVICES=$GPU UCX_NET_DEVICES=$MLX numactl -m $NUMA ./a.out"
echo $CMD
$CMD#!/bin/sh
if [ $SLURM_LOCALID = "0" ]
then
  NUMA=1
  MLX=mlx5_0:1
  GPU=1
fi
if [ $SLURM_LOCALID = "1" ]
then
  NUMA=3
  MLX=mlx5_1:1
  GPU=0
fi
if [ $SLURM_LOCALID = "2" ]
then
  NUMA=5
  MLX=mlx5_2:1
  GPU=3
fi
if [ $SLURM_LOCALID = "3" ]
then
  NUMA=7
  MLX=mlx5_3:1
  GPU=2
fi
CMD="env CUDA_VISIBLE_DEVICES=$GPU UCX_NET_DEVICES=$MLX numactl -m $NUMA ./a.out"
echo $CMD
$CMD

Huge pages

On CPU compiles, a globals comms buffer is allocated with either MMAP (default) or SHMGET if

    --shm-hugepages

is specified then the software requests that Linux provide 2MB huge pages. This requires system administrator assistance to preserve and enable the user to map these pages. Hugetlbfs is also an appropriate solution. Any performance gain will vary with MPI fabric type.

For GPU compiles this comms buffer is device memory.

Benchmark_IO

The I/O benchmark can be found under benchmarks/Benchmark_IO in your Grid build directory. This benchmark will write and read back large arrays of random numbers in parallel for a range of local volumes. For a given local volume L the size of the written array is proc*12*L*16 bytes, where proc is the number of MPI processes.

The program uses two strategies for I/O

  1. Each MPI process reads/writes its local array into a separate file using the C++ standard library.
  2. MPI processes collectively read/write in one single file using Grid's routine based on MPI-2 I/O.

When a file is read back, it undergoes a checksum verification, and the program aborts in case of checksum mismatch. Each file contains different random data. The benchmark will loop over local volumes between 8^4 and 32^4, and repeat the whole measurement 10 times.

At the end of the execution the benchmark will print summary tables on the standard output. These tables contain

  • a summary of individual results averaged over the 10 passes, with their standard deviations;
  • the robustness of individual results over the 10 passes in %, closer to 100% is better (robustness = 100% - standard deviation/mean);
  • a summary of results averaged over the volume range 24^4-32^4, with their standard deviations;
  • the robustness of the volume-averaged results.

Requested runs

Similarly to Benchmark_ITT, it must be run on 64 CPU nodes using a (2x2x4x4) MPI decomposition.

Log files should be collected after compile, run and threading parameters and compile options are optimised. These log files must be submitted in entirety. Several results can be submitted for different file system parameters.

Compilation

Identical to Benchmark_ITT.

Invocation

64 node CPU (if 1 socket)

Benchmark_IO --mpi 2.2.4.4

64 node CPU, (if 2 socket)

Benchmark_IO --mpi 4.2.4.4

Lustre stripping

With the Lustre file system, using stripping has a very significant impact on the results. In particular, optimal performances for strategies 1. and 2. might not be achieved with the same stripping parameters due to the different number of files. The benchmark writes files in the directory from which Benchmark_IO was executed. To create a directory with specific stripping parameters, one uses

mkdir dir
lfs setstripe -c <no> -S <size> dir

where <no> is the number of stripes and <size> is the stripe size. Running Benchmark_IO from within dir will benchmark this specific choice of parameters.