Permalink
Find file Copy path
3d98d11 Dec 14, 2018
1 contributor

Users who have contributed to this file

46 lines (32 sloc) 4.96 KB

MLPerf Training v0.5 Technical Summary

Seven Benchmarks

MLPerf Training v0.5 is a benchmark suite for measuring ML system speed. Each MLPerf Training benchmark is defined by a Dataset and Quality Target. MLPerf Training also provides a reference implementation for each benchmark that uses a specific model. The following table summarizes the seven benchmarks in version v0.5 of the suite.

Benchmark Dataset Quality Target Reference Implementation Model

Image classification

ImageNet

74.90% classification

Resnet-50 v1.5

Object detection (light weight)

COCO 2017

21.2% mAP

SSD (Resnet-34 backbone)

Object detection (heavy weight)

COCO 2017

0.377 Box min AP, 0.339 Mask min AP

Mask R-CNN

Translation (recurrent)

WMT English-German

21.8 BLEU

Neural Machine Translation

Translation (non-recurrent)

WMT English-German

25.0 BLEU

Transformer

Recommendation

MovieLens-20M

0.635 HR@10

Neural Collaborative Filtering

Reinforcement learning

Pro games

40.00% move prediction

Mini Go

Measure is Time to Train

Each MLPerf Training benchmark measures the wallclock time required to train a model on the specified dataset to achieve the specified quality target. ML training times have substantial variance. Because of this variance, final MLPerf Training results are obtained by running the benchmark a benchmark-specific number of times, discarding the lowest and highest results, and averaging the remaining results.

All results are then converted into speedups because different benchmarks take substantially different times to train. The speedups are relative to the time it takes to train the unoptimized reference implementation on a Pascal P100. So, an MLPerf result of 10 indicates that the measured system trains a model 10 times a fast as a Pascal P100 trains the reference implementation model.

Even the multiple result average is not sufficient to eliminate all variance. MLPerf imaging benchmark results are very roughly +/- 2.5% and other MLPerf benchmarks are very roughly +/- 5%.

Results in Divisions and Categories

MLPerf aims to encourage innovation in software as well as hardware by allowing submitters to reimplement the reference implementations. MLPerf has two Divisions that allow different levels of flexibility during reimplementation. The Closed division is intended to compare hardware platforms or software frameworks “apples-to-apples” and requires using the same model and optimizer as the reference implementation. The Open division is intended to foster faster models and optimizers and allows any ML approach that can reach the target quality.

MLPerf divides benchmarks into four Categories based on availability. Available In Cloud systems are available for rent in the cloud. Available On Premise systems contain only components that are available for purchase. Preview systems must be submittable as Available In Cloud or Available on Premise in the next submission round. Research systems either contain experimental hardware or software or available components at experimentally large scale.

Results Table

The MLPerf results table is organized first by Division and then by Category as described in the previous section. Each row in the results table is a set of results produced by a single submitter using the same software stack and hardware platform. Each row contains the following information:

  • Submitter: The organization that submitted the results.

  • Hardware: The type of ML hardware used, e.g. accelerators or high-performance CPUs.

  • Chip Count and Type: The number of ML hardware chips used, and if they are accelerators (a) or CPUs (c).

  • Software: The ML framework and primary ML hardware library used.

  • Benchmark Results: The benchmark results as described above. By default, benchmark results are presented as speedups relative to a Pascal P100. The results page enables switching to absolute times.

  • Cloud Scale: (only for Available Cloud systems): Cloud scale is derived from on-demand pricing by several major cloud providers and gives a rough indicator of relative system size/cost. The reference single Pascal P100 system has a cloud scale of 1. A system with a cloud scale of 4 would be roughly four times as big/expensive.

  • Power (only for non-Available Cloud systems): Information for Available On-premise systems. Due to the complexity of standardizing power measurement, this version of MLPerf simply allows voluntary reporting of arbitrary unofficial power information.

  • Details: link to metadata for submission.

Future Plans

MLPerf v0.5.0 is the “alpha” release of an agile benchmark, and the benchmark is still evolving based on feedback from the community. Changes being considered include raising target quality, adopting a standard batch-size-to-hyperparameter table, scaling up some benchmarks (especially recommendation), and adding new benchmarks. The precise timeline for future submission rounds has not been set, but will be roughly v0.5.1 in Q1, v0.5.2 in Q2, and v1.0 in Q3 of 2019.