Please follow this installation guide to install the MLCommons CM reproducibility and automation language in your native environment or Docker container.
Then install the repository with CM automation scripts to run MLPerf benchmarks out-of-the-box across different software, hardware, models and data sets:
cm pull repo mlcommons@ck
Note that you can install Python virtual environment via CM to avoid contaminating your local Python installation as described here.
Bert has two variants - bert-99
and bert-99.9
where the 99
and 99.9
specifies the required accuracy constraint with respect to the reference floating point model. bert-99.9
model is applicable only on a datacenter system.
On edge category bert-99
has Offline and SingleStream scenarios and in datacenter category both bert-99
and bert-99.9
have Offline and Server scenarios. The below commands are assuming an edge category system.
cm run script --tags=generate-run-cmds,inference,_find-performance,_all-scenarios \
--model=bert-99 --implementation=reference --device=cpu --backend=onnxruntime \
--category=edge --division=open --quiet
- Use
--device=cuda
to run the inference on Nvidia GPU - Use
--division=closed
to run all scenarios for the closed division including the compliance tests - Use
--category=datacenter
to run datacenter scenarios
cm run script --tags=generate-run-cmds,inference,_all-modes,_all-scenarios \
--model=bert-99 --device=cpu --implementation=reference --backend=onnxruntime \
--execution-mode=valid --results_dir=$HOME/inference_3.1_results \
--category=edge --division=open --quiet
- Use
--power=yes
for measuring power. It is ignored for accuracy and compliance runs. This requires a power analyzer as described here - Use
--division=closed
to run all scenarios for the closed division including the compliance tests --offline_target_qps
,--server_target_qps
and--singlestream_target_latency
can be used to override the determined performance numbers
cm run script --tags=generate-run-cmds,inference,_populate-readme,_all-scenarios \
--model=bert-99 --device=cpu --implementation=reference --backend=onnxruntime \
--execution-mode=valid --results_dir=$HOME/inference_3.1_results \
--category=edge --division=open --quiet
Here, we are copying the performance and accuracy log files (compliance logs also in the case of closed division) from the results directory to the submission tree following the directory structure required by MLCommons Inference. After the submission tree is generated, accuracy truncate script is called to truncate accuracy logs and then the submission checker is called to validate the generated submission tree.
We should use the master branch of MLCommons inference repo for the submission checker. You can use --hw_note_extra
option to add your name to the notes.
cm run script --tags=generate,inference,submission --results_dir=$HOME/inference_3.1_results/valid_results \
--device=cpu --submission_dir=$HOME/inference_submission_tree --clean --run-checker --submitter=cTuning
--adr.inference-src.version=master --hw_notes_extra="Result taken by NAME" --quiet
Same commands as for onnxruntime
should work by replacing backend=onnxruntime
with --backend=tf
. For example,
cm run script --tags=generate-run-cmds,inference,_accuracy-only,_all-scenarios \
--model=bert-99 --device=cpu --implementation=reference --backend=tf --execution-mode=valid \
--results_dir=$HOME/inference_3.1_results --quiet
Same commands as for onnxruntime
should work by replacing backend=onnxruntime
with --backend=pytorch
. For example,
cm run script --tags=generate-run-cmds,inference,_accuracy-only,_all-scenarios \
--model=bert-99 --device=cpu --implementation=reference --backend=pytorch \
--execution-mode=valid --results_dir=$HOME/inference_3.1_results --quiet
For TensorRT backend we are using the Nvidia implementation and not the MLPerf inference reference implementation for the below reasons
- TensorRT backend is not supported by default in the reference implementation
- Reference implemnetation is mostly for fp32 models and quantization is not suppoted by default
- Nvidia has done some fantastic work in optimizing performance for TensorRT backend
To get setup please follow the instructions here to download and install TensorRT and cuDNN unless you already have them installed. This readme also details how to handle the configuration files which are automatically generated by the Nvidia implementation scripts. Once this is done, the following command will run all the modes and scenarios.
cm run script --tags=generate-run-cmds,inference,_all-modes,_all-scenarios \
--model=bert-99 --device=cuda --implementation=nvidia-original --backend=tensorrt \
--execution-mode=valid --results_dir=$HOME/inference_3.1_results --quiet
- Use
--power=yes
for measuring power. It is ignored for accuracy and compliance runs. This requires a power analyzer as described here - Use
--division=closed
to run all scenarios for the closed division including the compliance tests --offline_target_qps
,--server_target_qps
and--singlestream_target_latency
can be used to override the default performance numbers- Use
--division=closed
to run all scenarios for the closed division including the compliance tests - Use
--category=datacenter
to run datacenter scenarios
TensorRT backend has an engine generation stage which can be time consuming. For repeated runs --adr.nvidia-harness.make_cmd=run_harness
option will avoid this engine regeneration and reuse the previously generated one.