Find file Copy path
ae9f1e0 Dec 18, 2018
2 contributors

Users who have contributed to this file

@petermattson @nvpaulius
407 lines (261 sloc) 23.9 KB

MLPerf Training Rules

1. Overview

This document describes how to implement the MLPerf Training Suite using an ML framework and how to use that implementation to measure the performance of an ML software framework or hardware.

The MLPerf name and logo are trademarks. In order to refer to a result using the MLPerf name, the result must conform to the letter and spirit of the rules specified in this document. The MLPerf organization reserves the right to solely determine if a use of its name or logo is acceptable.

1.1. Definitions (read this section carefully)

The following definitions are used throughout this document:

Performance always refers to execution speed.

Quality always refers to a model’s ability to produce “correct” outputs.

A system consists of a defined set of hardware resources such as processors, memories, disks, and interconnect. It also includes specific versions of all software such as operating system, compilers, libraries, and drivers that significantly influences the running time of a benchmark, excluding the ML framework.

A framework is a specific version of a software library or set of related libraries, possibly with associated offline compiler, for training ML models using a system. Examples include specific versions of Caffe2, MXNet, PaddlePaddle, pyTorch, or TensorFlow.

A benchmark is an abstract problem that can be solved using ML by training a model based on a specific dataset or simulation environment to a target quality level.

A suite is a specific set of benchmarks.

A division is a set of rules for implementing benchmarks friom a suite to produce a class of comparable results.

A reference implementation is a specific implementation of a benchmark provided by the MLPerf organization.

A benchmark implementation is an implementation of a benchmark in a particular framework by a user under the rules of a specific division.

A submission implementation set is a set of benchmark implementations for one or more benchmarks from a suite under the rules of a specific division using the same framework.

A run is a complete execution of an implementation on a system, training a model from initialization to the quality target.

A run result is the wallclock time required for a run.

A reference result is the result provided by the MLPerf organization for each reference implementation

A benchmark result is the mean of a benchmark-specific number of run results, dropping the highest and lowest results. The result is then normalized to the reference result for that benchmark. Normalization is of the form (reference result / benchmark result) such that a better benchmark result produces a higher number.

A submission result set is a one benchmark result for each benchmark implementation in a submission implementation set.

A submission is a submission implementation set and a corresponding submission result set.

A custom summary result is the weighted geometric mean of an arbitrary set of results from a specific submission. MLPerf v0.5 endorses this methodology for computing custom summary results but does not endorse any official summary result.

2. General rules

The following rules apply to all benchmark implementations.

2.1. Strive to be fair

Benchmarking should be conducted to measure the framework and system performance as fairly as possible. Ethics and reputation matter.

2.2. System and framework must be consistent

The same system and framework must be used for a submission result set. Note that the reference implementations do not all use the same framework in v0.5.

2.3. System and framework must be available

If you are measuring the performance of a publicly available and widely-used system or framework, you must use publicly available and widely-used used versions of the system or framework.

If you are measuring the performance of an experimental framework or system, you must make the system and framework you use available upon demand for replication.

2.4. Benchmark implementations must be shared

Source code used for the benchmark implementations must be open-sourced under a license that permits a commercial entity to freely use the implementation for benchmarking. The code must be available as long as the results are actively used.

2.5. Benchmark detection is not allowed

The framework and system should not detect and behave differently for benchmarks.

2.6. Pre-training is not allowed

The implementation should not encode any information about the content of the dataset or a successful model’s state in any form.

3. Benchmarks

The benchmark suite consists of the benchmarks shown in the following table.

Area Problem Dataset Quality Target


Image classification


74.90% classification (discussing 75.9%)

Object detection (light weight)


21.2% mAP

Object detection (heavy weight)


0.377 Box min AP and 0.339 Mask min AP


Translation (recurrent)

WMT English-German

21.8 Sacre BLEU

Translation (non-recurrent)

WMT English-German

25.00 BLEU




0.635 HR@10


Reinforcement learning


40.00% pro move prediction

The following benchmarks are included but delayed to the next submission cycle:

Area Problem Dataset Quality Target


Speech recognition



The MLPerf organization provides a reference implementation of each benchmark, which includes the following elements:

Code that implements the model in a framework.

A plain text “” file that describes:

  • Problem

    • Dataset/Environment

    • Publication/Attribution

    • Data preprocessing

    • Training and test data separation

    • Training data order

    • Test data order

    • Simulation environment (RL models only)

  • Model

    • Publication/Attribution

    • List of layers

    • Weight and bias initialization

    • Loss function

    • Optimizer

  • Quality

    • Quality metric

    • Quality target

    • Evaluation frequency (training items between quality evaluations)

    • Evaluation thoroughness (test items per quality evaluation)

  • Directions

    • Steps to configure machine

    • Steps to download and verify data

    • Steps to run and time

A “download_dataset” script that downloads the dataset.

A “verify_dataset” script that verifies the dataset against the checksum.

A “run_and_time” script that executes the benchmark and reports the wall-clock time.

4. Divisions

There are two divisions of the benchmark suite, the Closed division and the Open division.

4.1. Closed Division

The Closed division requires using the same preprocessing, model, and training method as the reference implementation.

The closed division models are:

Area Problem Model


Image classification

Resnet-50 v1.5

Object detection (light weight)


Object detection (heavy weight)

Mask R-CNN


Speech recognition

Deep Speech 2

Translation (recurrent)


Translation (non-recurrent)




Neural Collaborative Filtering


Reinforcement learning

Mini Go (based on Alpha Go paper)

Closed division benchmarks must be referred to using the benchmark name plus the term Closed, e.g. “for the Image Classification Closed benchmark, the system achieved a result of 7.2.”

4.2. Open Division

The Open division allows using arbitrary preprocessing, model, and/or training method. However, the Open division still requires using supervised or reinforcement machine learning in which a model is iteratively improved based on training data, simulation, or self-play.

Open division benchmarks must be referred to using the benchmark name plus the term Open, e.g. “for the Image Classification Open benchmark, the system achieved a result of 7.2.”

5. Basics

5.1. Random numbers

CLOSED: Random numbers must be generated using stock random number generators.

Random number generators may be seeded from the following sources:

  • Clock

  • System source of randomness, e.g. /dev/random or /dev/urandom

  • Another random number generator initialized with an allowed seed

Random number generators may be initialized repeatedly in muliple processes or threads. For a single run, the same seed may be shared across multiple processes or threads.

OPEN: Any random number generation may be used.

5.2. Numerical formats

CLOSED: The numerical formats fp64, fp32, fp16, and bfloat16 are pre-approved for use. Additional formats require explicit approval. Scaling may be added where required to compensate for different precision.

OPEN: Any format and scaling may be used.

5.3. Epoch numbering

Epochs should always be numbered from 0.

5.4. Result rounding

Public results should be rounded normally.

6. Data Set

6.1. Data State at Start of Run

Each reference implementation includes a script to download the input dataset and script to verify the dataset using a checksum.

The data must then be preprocessed in a manner consistent with the reference implementation, excepting any transformations that must be done for each run (e.g. random transformations). The data may also be reformatted for the target system provided that the reformatting does not introduce new information or introduce duplicate copies of data. This policy is intended to simplify v0.5 and will be reconsidered.

You must flush the cache or restart the system prior to benchmarking. Data can start on any durable storage system such as local disks and cloud storage systems. This explicitly excludes RAM.

6.2. Preprocessing During the Run

Only preprocessing that must be done for each run (e.g. random transformations) must be timed.

CLOSED: The same preprocessing steps as the reference implementation must be used.

OPEN: Any preprocessing steps are allowed. However, each datum must be preprocessed individually in a manner that is not influenced by any other data.

6.3. Data Representation

CLOSED: Images must have the same size as in the reference implementation. Mathematically equivalent padding of images is allowed.

CLOSED: For benchmarks with sequence inputs, you may choose a length N and either truncate all examples to length N or throw out all examples which exceed length N. This must be done uniformly for all examples. This may only be done on the training set and not the evaluation set.

OPEN: The closed division data representations restrictions only apply at the start of the run. Data may be represented in an arbitrary fashion during the run.

6.4. Training and Test Sets

If applicable, the dataset must be separated into training and test sets in the same manner as the reference implementation.

6.5. Training Data Order

CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order.

Future versions of the benchmark suite may specify the Closed traversal order.

OPEN: the training data may be traversed in any order. The test data must be traversed in the same order as the reference implementation.

7. RL Environment

CLOSED: The implementation must use the same RL algorithm and simulator or game as the reference implementation, with the same parameters.

OPEN: The implementation may use a different RL algorithm but must use the same simulator or game with the same parameters. If the reference implementation generates all data online, the Open division implementation must also generate all data online.

It is allowed and encouraged to parallelize and otherwise optimize (e.g. by implementing in a compiled language) the RL environment provided that the semantics are preserved.

8. Model

CLOSED: The benchmark implementation must use the same model as the reference implementation, as defined by the remainder of this section.

OPEN: The benchmark implementation may use a different model.

8.1. Graph Definition

CLOSED: Each of the current frameworks has a graph that describes the operations performed during the forward propagation of training. The frameworks automatically infer and execute the corresponding back-propagation computations from this graph. Benchmark implementations must use the same graph as the reference implementation.

8.2. Weight and Bias Initialization

CLOSED: Weights and biases must be initialized using the same constant or random value distribution as the reference implementation.

OPEN: Weights and biases must be initialized using a consistent constant or random value distribution.

8.3. Graph Execution

CLOSED: Frameworks are free to optimize the non-weight parts of the computation graph provided that the changes are mathematically equivalent. So optimizations and graph / code transformations of the flavor of dead code elimination, common subexpression elimination, loop-invariant code motion, and recomputation of node state are entirely allowed.

OPEN: Frameworks are free to alter the graph.

9. Training Loop

9.1. Hyperparameters and Optimizer


For v0.5.0, the following rules apply:

By default, the hyperparameters and optimizer must be the same as the reference. Hyperparameters include regularization terms such as norms and weight decays.

The following scheme for scaling learning rate schedule and adding warmup applies to all models that use SGD.

  • An arbitrary batch size may be chosen to allow for tailoring the application to the hardware platform’s memory hierarchy. The batch size must be constant for the entire run, with the exception of the final batch in each epoch which may be smaller to account for the remainder when the number of samples is divided by batch.

  • Learning rate may be changed to accommodate the change in batch size or different precision. The learning rate schedule is defined relative to the reference implementation using four parameters wb, wr, t, and r:

    • A linear warm-up period of wb batches may be added with a per batch step size wr. It is assumed that the reference implementation learning rate is a constant r0 for more than wb batches. Then the warm up learning rate for batch b is r0 - (wb - b) * wr. The term wr is contrained to be (r0 / (wb * 2^wk)) where wk is a non-negative integer.

    • The learning rate schedule may be scaled in time by multiplying the input epoch by a constant factor t and rounding down, where t is constrained to be (1 + tk/10) where tk is a positive integer.

    • The learning rate may be scaled by a constant factor r, where r is an integer.

The following model-specific exceptions are also allowed:

Model Hyperparameter Change allowed


base learning rate

0.1 or 0.128


lr2 weight decay

Arbitrary constant

maximum number of samples attempted when generating a training patch for a given IoU choice

Arbitrary integer >= 1

Mask R-CNN

number of image candidates

2000 or 1000*batch_size patches (may sample either from a pool of all examples, or individually and uniformly from each image)



Adam or Lazy Adam


Arbitrary constant


Arbitrary constant


Arbitrary constant


learning rate

Arbitrary constant

learning rate decay function

May use alt_decay fn in reference

decay_start (number of updates after which the first lr decay occurs)

Arbitrary constant

decay_interval (number of updates between lr decays)

Arbitrary constant

warmup function

May use alt_warmup fn in reference


Arbitary constant, suggest 200



Adam or Lazy Adam

lr hyperparam to learning rate function

Arbitrary constant

learning_rate_warmup_steps hyperparam to lr function

Arbitrary constant

Some benchmarks may require extensions of these policies; submitters are encouraged to request extensions based on data.

For version 0.5.1, MLPerf will be moving to a batchsize-to-hyperparameter-and-optimizer table.

OPEN: Hyperparameters and optimizer may be freely changed.

9.2. Loss function

CLOSED: The same loss function used in the reference implementation must be used.

OPEN: Any loss function may be used. Do not confuse the loss function with target quality measure.

9.3. Quality measure

Each run must reach a target quality level on the reference implementation quality measure. The time to measure quality is included in the wallclock time.

The same quality measure as the reference implementation must be used. The quality measure must be evaluated at least as frequently (in terms of number of training items between test sets) and at least as thoroughly (in terms of number of tests per set) as in the reference implementation. Typically, a test consists of comparing the output of one forward pass through the network with the desired output from the test set.

Check points can be created at the discretion of submitter. No check points are required to be produced or retained. This policy is intended to simplify v0.5 and will be reconsidered.

9.4. Equivalence exceptions

The CLOSED division allows limited exemptions to mathematical equivalence between implementations for pragmatic purposes, including:

  • Different methods can be used to add color jitter as long as the methods are of a similar distribution and magnitude to the reference.

  • If data set size is not evenly divisible by batch size, one of several techniques may be used. The last batch in an epoch may be composed of the remaining samples in the epoch, may be padded, or may be a mixed batch composed of samples from the end of one epoch and the start of the next. If the mixed batch technique is used, quality for the ending epoch must be evaluated after the mixed batch. If the padding technique is used, the first batch may be padded instead of the last batch.

  • Values introduced for padding purposes may be reflected in batch norm computations.

  • Adam optimizer implementations may use the very small value epsilon to maintain mathematical stability in slightly different ways, provided that methods are reviewed and approved in advance. One such method involves squaring the value of epsilon and moving epsilon inside the square root in the parameter update equation.

  • ImageNet has 1000 classes but the reference uses 1001. TF has an additional 'I don’t know' class. Both are allowed.

Additional exemptions need to be explicitly requested and approved in advance. In general, exemptions may be approved for techniques that are common industry practice, introduce small differences that would be difficult to engineer around relative to their significance, and do not substantially decrease the required computation. Over time, MLPerf should seek to help the industry converge on standards and remove exemptions.

The OPEN division does not restrict mathematical equivalence.

10. Run Results

A run result consists of a wall-clock timing measurement for an entire run, including model construction, any data preprocessing required to be on the clock, training, and quality testing.

The clock must start before any part of the system performs any benchmark specific work (e.g. model creation, or dataset manipulation), and may be stopped as soon as any part of the system determines target accuracy has been reached.

The clock may not be paused during the run.

11. Benchmark Results

Each benchmark result is based on a set of run results. The number of results for each benchmark is based on a combination of the variance of the benchmark result, the cost of each run, and the likelihood of convergence.

Area Problem Number of Runs


Image classification


Object detection (light weight)


Object detection (heavy weight)



Translation (recurrent)


Translation (non-recurrent)




Run 100, use first 50 that converge


Reinforcement learning


Each benchmark result is computed by dropping the fastest and slowest runs, then taking the mean of the remaining times.

Each benchmark result should be normalized by dividing the reference result for the corresponding reference implementation by the benchmark result. This normalization produces higher numbers for better results, which better aligns with human intuition.

12. Submission Result Set

All results in a submission result set must be for benchmarks from the same suite and produced under the rules of the same division. Results may be reported for one, some, or all benchmarks.

All results must be produced using the same framework and system. The only difference in software and hardware between results should be the benchmark implementation.

An organization or individual may submit multiple submissions.

13. Framework Reporting

Report the framework used, including version.

14. System Reporting

If the system is available in the cloud it should be benchmarked in the cloud. On premise benchmarking is allowed when the required system is not available in the cloud.

14.1. With Cloud Hardware

14.1.1. Replication recipe

Report a recipe that starts from a vanilla VM image or Docker container and a sequence of steps that creates the system that performs the benchmark measurement.

14.1.2. Scale

Cloud results are presented alongside a cloud scale number which seeks to approximate the cost of the system involved. Cloud scale is main hardware components used in the system. Cloud scale is computed after submission based on a table which relates the set of hardware components to on-demand hourly prices to rent typical systems containing that set.

The table is populated based on the hourly on-demand price across a specific set of large cloud providers for a specific region, then normalizing all entries to a common system. For the initial v0.5 submission, the set of cloud providers is { Alibaba, Amazon, Google, and Microsoft }, the region is Eastern United States, and the common system is typical system containing 1 NVIDIA P100. Cloud scale for ML accelerators only offered by a subset of the set cloud providers is based on that subset. The table will be updated every submission cycle after submissions but before result posting on a date determined by the cloud scale working group.

14.2. With On-premise Hardware

14.2.1. Replication recipe

Report everything that will eventually be required by a third-party user to replicate the result when the hardware and software becomes widely available.

14.2.2. Power

For v0.5, power information is not required. You may optionally include a hyperlink to power information for the system.

15. Submissions

The MLPerf organization will create a database that collects submission data and a website that presents the results.

15.1. Submission Compliance Logs

Submissions must contain a properly formatted compliance log for each run, even if the run result does not directly impact the benchmark result because it was the lowest, highest, or allowed non-convergence. See the compliance/ directory in the Github repo for the python files used to produce the compliance log.

Each compliance log is produced by the benchmark implementation calling a standard "mlperf_print" function with a different set of tags. There are a standard set of tags which are required for all submissions. Some tags are required once per run, some once per epoch, some once per eval, and some are only required if optional code is included such as padding. Sample logs are provided for each benchmark. A standard script is provided that checks for all required tags for a given benchmark.

The "mlperf_print" function may be reimplemented if needed in another language provided that the semantics are preserved.

For practical implementation reasons, the compliance log may be the result of combining multiple other logs from the run, e.g. from multiple processes.

All official run results will be extracted from the log based on the start and end tags.

All logs should be encrypted prior to submission using the script. The script will not contain a valid encryption key until two days before submissions and should not be used prior to that point.

15.2. Submission Form

Submissions to the database must use the provided submission form to report all required information.

15.3. Submission Process

Submit the completed form and supporting code to the MLPerf organization Github mlperf/results repo as a PR.