The ReproMPI Benchmark is a tool designed to accurately measure the run-time of MPI blocking collective operations. It provides multiple process synchronization methods and a flexible mechanism for predicting the number of measurements that are sufficient to obtain statistically sound results.
- an MPI library
- CMake (version >= 2.6)
- GSL libraries
cd $BENCHMARK_PATH ./cmake . make
For specific configuration options check the Benchmark Configuration section.
Running the ReproMPI Benchmark
The ReproMPI code is designed to serve two specific purposes:
Benchmarking of MPI collective calls
The most common usage scenario of the benchmark is to specify an MPI collective function to be benchmarked, a (list of) message sizes and the number of measurement repetitions for each test, as in the following example.
mpirun -np 4 ./bin/mpibenchmark --calls-list=MPI_Bcast,MPI_Allgather --msizes-list=8,1024,2048 --nrep=10
Estimating the number of measurement iterations
In this scenario, the user can generate an estimation of the number of measurements required for a stable result with respect to one or multiple prediction methods.
More details about the various methods that are supported and their usage can be found in:
- S. Hunold, A. Carpen-Amarie, F.D. Lübbe and J.L. Träff, “Automatic Verification of Self-Consistent MPI Performance Guidelines”, EuroPar (2016)
This is an example of how to perform such an estimation:
mpirun -np 4 ./bin/mpibenchmarkPredNreps --calls-list=MPI_Bcast,MPI_Allgather --msizes-list=8,1024,2048 --rep-prediction=min=1,max=200,step=5
-vprint run-times measured for each process
--msizes-list=<values>list of comma-separated message sizes in Bytes, e.g.,
--msize-interval=min=<min>,max=<max>,step=<step>list of power of 2 message sizes as an interval between 2^min and 2^max, with 2^step distance between values, e.g.,
--calls-list=<args>list of comma-separated MPI calls to be benchmarked, e.g.,
--root-proc=<process_id>root node for collective operations
--operation=<mpi_op>MPI operation applied by collective operations (where applicable), e.g.,
Supported operations: MPI_BOR, MPI_BAND, MPI_LOR, MPI_LAND, MPI_MIN, MPI_MAX, MPI_SUM, MPI_PROD
--datatype=<mpi_type>MPI datatype used by collective operations, e.g.,
Supported datatypes: MPI_CHAR, MPI_INT, MPI_FLOAT, MPI_DOUBLE
--shuffle-jobsshuffle experiments before running the benchmark
--params=k1:v1,k2:v2list of comma-separated
key:valuepairs to be printed in the benchmark output.
-f | --input-file=<path>input file containing the list of benchmarking jobs (tuples of MPI function, message size, number of repetitions). It replaces all the other common options.
Options Related to the Window-based Synchronization
--window-size=<win>window size in microseconds for Window-based synchronization
Specific options for synchronization methods based on a linear model of the clock drift
--fitpoints=<nfit>number of fitpoints (default: 20)
--exchanges=<nexc>number of exchanges (default: 10)
Specific Options for the ReproMPI Benchmark
--nrep=<nrep>set number of experiment repetitions
--summary=<args>list of comma-separated data summarizing methods (mean, median, min, max), e.g.,
Specific Options for Estimating the Number of Repetitions
--rep-prediction=min=<min>,max=<max>,step=<step>set the total number of repetitions to be estimated between
<max>, so that at each iteration i, the number of measurements (
nrep) is either
nrep(0) = <min>, or
nrep(i) = nrep(i-1) + <step> * 2^(i-1), e.g.,
--pred-method=m1,m2comma-separated list of prediction methods, i.e., rse, cov_mean, cov_median (default: rse)
--var-thres=thres1,thres2comma-separated list of thresholds corresponding to the specified prediction methods (default: 0.01)
--var-win=win1,win2comma-separated list of (non-zero) windows corresponding to the specified prediction methods;
rsedoes not rely on a measurement window, however a dummy window value is required in this list when multiple methods are used (default: 10)
Supported Collective Operations:
Mockup Functions of Various MPI Collectives
Process Synchronization Methods
This is the default synchronization method enabled for the benchmark.
To benchmark collective operations acorss multiple MPI libraries using the same barrier implementation, the benchmark provides a dissemination barrier that can replace the default MPI_Barrier to synchronize processes.
To enable the dissemination barrier, the following flag has to be set
before compiling the benchmark (e.g., using the
Both barrier-based synchronization methods can alternatively use a double barrier before each measurement.
The ReproMPI benchmark implements a window-based process synchronization mechanism, which estimates the clock offset/drift of each process relative to a reference process and then uses the obtained global clocks to synchronize processes before each measurement and to compute run-times.
It relies on one of the following clock synchronization methods:
- HCA synchronization: this is the clock synchronization algorithm we propose in . It computes a linear model of the clock drift of each process. The HCA method can be configured by setting the following flags before compilation.
ENABLE_LOGP_SYNC flag determines which variant of the HCA
algorithm is used, i.e., either HCA1 (which computes the clock models
in O(p) steps) or HCA2 (which requires only O(log p) rounds).
- SKaMPI synchronization: it implements the SKaMPI clock synchronization algorithm. To enable it, set the following flag before compilation.
- Jones and Koenig synchronization: it implements the clock synchronization algorithm introduced by Jones and Koenig~. To enable it, set the following flag before compilation.
The MPI operation run-time is computed in a different manner depending on the selected clock synchronization method. If global clocks are available, the run-times are computed as the difference between the largest exit time and the first start time among all processes.
If a barrier-based synchronization is used, the run-time of an MPI call is computed as the largest local run-time across all processes.
However, the timing proceduce that relies on global clocks can be used in combination with a barrier-based synchronization when the following flag is enabled:
More information regarding the timing procedure can be found in .
MPI_Wtime cll is used by default to obtain the current time.
To obtain accurate measurements of short time intervals, the benchmark
can rely on the high resolution
RDTSC/RDTSCP instructions (if they are
available on the test machines) by setting on of the following flags:
Additionally, setting the clock frequency of the CPU is required to obtain accurate measurements:
The clock frequency can also be automatically estimated (as done by the NetGauge tool) by enabling the following variable:
However, this method reduces the results accuracy and we advise to
manually set the highest CPU frequency instead. More details about
the usage of
RDTSC-based timers can be found in our research
List of Compilation Flags
This is the full list of compilation flags that can be used to control all the previously detailed configuration parameters.
CALIBRATE_RDTSC OFF COMPILE_BENCH_TESTS OFF COMPILE_PRED_BENCHMARK ON COMPILE_SANITY_CHECK_TESTS OFF ENABLE_BENCHMARK_BARRIER OFF ENABLE_DOUBLE_BARRIER OFF ENABLE_GLOBAL_TIMES OFF ENABLE_LOGP_SYNC OFF ENABLE_RDTSC OFF ENABLE_RDTSCP OFF ENABLE_WINDOWSYNC_HCA OFF ENABLE_WINDOWSYNC_JK OFF ENABLE_WINDOWSYNC_SK OFF FREQUENCY_MHZ 2300