how to cleanup before re-running a test #984

jdesfossez · 2023-10-24T18:20:37Z

Hi,

I have followed the doc to the point where I can run the first example of the Bert test with CUDA:
https://github.com/mlcommons/ck/blob/master/docs/mlperf/inference/bert/README_nvidia.md

cmr "generate-run-cmds inference _find-performance _all-scenarios" \
--model=bert-99 --implementation=nvidia-original --device=cuda --backend=tensorrt \
--category=edge --division=open --quiet

It seems to work, I see the GPU is busy for a while (probably around 30+ minutes)
But when it completes, if I re-run the exact same command, it completes in about 40 seconds and all the results are marked as INVALID.

So my question is if we need to run some kind of cleanup command between runs or if I am encountering a bug somewhere.

I have attached a small portion of the output, I see an error in there, but I don't know if it's fatal
bert-log.txt

Thanks !

Julien

The text was updated successfully, but these errors were encountered:

jdesfossez · 2023-10-24T18:59:14Z

Follow-up, I deleted the cache folders related to that test in CM/repos/local/cache/ and re-ran the command and it is now running as expected.
Is there a clean way to do this ?

* Tags: bert,bert-large,bert-squad,get,language,language-processing,ml-model,raw,script-artifact-5e865dbdc65949d2,_amazon-s3,_fp32,_onnx,_unpacked
  Path: /root/CM/repos/local/cache/966c040426324c19

* Tags: bert,bert-large,bert-squad,get,language,language-processing,ml-model,raw,script-artifact-5e865dbdc65949d2,_amazon-s3,_int8,_onnx,_unpacked
  Path: /root/CM/repos/local/cache/bebee43d0df64737

* Tags: harness,inference,mlcommons,mlperf,nvidia,nvidia-harness,reproduce,script-artifact-bc3b17fb430f4732,_batch_size.1,_bert-99,_bert_,_build_engine,_cuda,_singlestream,_tensorrt
  Path: /root/CM/repos/local/cache/e0df579f9a7341e8

* Tags: harness,inference,mlcommons,mlperf,nvidia,nvidia-harness,reproduce,script-artifact-bc3b17fb430f4732,_batch_size.1024,_bert-99,_bert_,_build_engine,_cuda,_offline,_tensorrt
  Path: /root/CM/repos/local/cache/7f85574449214974

arjunsuresh · 2023-10-24T20:56:26Z

Oh, actually there is a --rerun flag to force rerun even when results exist. Can you please try it?

jdesfossez · 2023-10-24T21:00:45Z

Hum it doesn't seem to work with this command.

…

On 2023-10-24 16:56, Arjun Suresh wrote: Oh, actually there is a |--rerun| flag to force rerun even when results exist. Can you please try it? — Reply to this email directly, view it on GitHub <#984 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKSMTPKKPIU2KB2OJGB64DYBATQNAVCNFSM6AAAAAA6ODZFWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYGAZDINZWGE>. You are receiving this because you authored the thread.Message ID: ***@***.***>

arjunsuresh · 2023-10-24T21:10:37Z

cmr "generate-run-cmds inference _submission _all-scenarios" --model=bert-99 \
--device=cuda --implementation=nvidia-original --backend=tensorrt \
--execution-mode=valid --results_dir=$HOME/results_dir \
--category=edge --division=open --quiet --rerun

This should be the command to get a valid result. If the estimated target_qps is incorrect you can also provide --offline_target_qps=<ACTUAL_VALUE> to get a valid run. A run is valid only if it runs for 10 minutes and how long the benchmark run (in offline scenario) is determined by the given/estimated target_qps value.

jdesfossez · 2023-10-24T21:23:48Z

Ah yes, I think this is what I was missing, it is now running and it looks busy as expected.
Thank you !

jdesfossez closed this as completed Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to cleanup before re-running a test #984

how to cleanup before re-running a test #984

jdesfossez commented Oct 24, 2023

jdesfossez commented Oct 24, 2023 •

edited

Loading

arjunsuresh commented Oct 24, 2023

jdesfossez commented Oct 24, 2023 via email

arjunsuresh commented Oct 24, 2023

jdesfossez commented Oct 24, 2023

how to cleanup before re-running a test #984

how to cleanup before re-running a test #984

Comments

jdesfossez commented Oct 24, 2023

jdesfossez commented Oct 24, 2023 • edited Loading

arjunsuresh commented Oct 24, 2023

jdesfossez commented Oct 24, 2023 via email

arjunsuresh commented Oct 24, 2023

jdesfossez commented Oct 24, 2023

jdesfossez commented Oct 24, 2023 •

edited

Loading