Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to cleanup before re-running a test #984

Closed
jdesfossez opened this issue Oct 24, 2023 · 5 comments
Closed

how to cleanup before re-running a test #984

jdesfossez opened this issue Oct 24, 2023 · 5 comments

Comments

@jdesfossez
Copy link
Contributor

Hi,

I have followed the doc to the point where I can run the first example of the Bert test with CUDA:
https://github.com/mlcommons/ck/blob/master/docs/mlperf/inference/bert/README_nvidia.md

cmr "generate-run-cmds inference _find-performance _all-scenarios" \
--model=bert-99 --implementation=nvidia-original --device=cuda --backend=tensorrt \
--category=edge --division=open --quiet

It seems to work, I see the GPU is busy for a while (probably around 30+ minutes)
But when it completes, if I re-run the exact same command, it completes in about 40 seconds and all the results are marked as INVALID.

So my question is if we need to run some kind of cleanup command between runs or if I am encountering a bug somewhere.

I have attached a small portion of the output, I see an error in there, but I don't know if it's fatal
bert-log.txt

Thanks !

Julien

@jdesfossez
Copy link
Contributor Author

jdesfossez commented Oct 24, 2023

Follow-up, I deleted the cache folders related to that test in CM/repos/local/cache/ and re-ran the command and it is now running as expected.
Is there a clean way to do this ?

* Tags: bert,bert-large,bert-squad,get,language,language-processing,ml-model,raw,script-artifact-5e865dbdc65949d2,_amazon-s3,_fp32,_onnx,_unpacked
  Path: /root/CM/repos/local/cache/966c040426324c19

* Tags: bert,bert-large,bert-squad,get,language,language-processing,ml-model,raw,script-artifact-5e865dbdc65949d2,_amazon-s3,_int8,_onnx,_unpacked
  Path: /root/CM/repos/local/cache/bebee43d0df64737

* Tags: harness,inference,mlcommons,mlperf,nvidia,nvidia-harness,reproduce,script-artifact-bc3b17fb430f4732,_batch_size.1,_bert-99,_bert_,_build_engine,_cuda,_singlestream,_tensorrt
  Path: /root/CM/repos/local/cache/e0df579f9a7341e8

* Tags: harness,inference,mlcommons,mlperf,nvidia,nvidia-harness,reproduce,script-artifact-bc3b17fb430f4732,_batch_size.1024,_bert-99,_bert_,_build_engine,_cuda,_offline,_tensorrt
  Path: /root/CM/repos/local/cache/7f85574449214974

@arjunsuresh
Copy link
Contributor

Oh, actually there is a --rerun flag to force rerun even when results exist. Can you please try it?

@jdesfossez
Copy link
Contributor Author

jdesfossez commented Oct 24, 2023 via email

@arjunsuresh
Copy link
Contributor

cmr "generate-run-cmds inference _submission _all-scenarios" --model=bert-99 \
--device=cuda --implementation=nvidia-original --backend=tensorrt \
--execution-mode=valid --results_dir=$HOME/results_dir \
--category=edge --division=open --quiet --rerun

This should be the command to get a valid result. If the estimated target_qps is incorrect you can also provide --offline_target_qps=<ACTUAL_VALUE> to get a valid run. A run is valid only if it runs for 10 minutes and how long the benchmark run (in offline scenario) is determined by the given/estimated target_qps value.

@jdesfossez
Copy link
Contributor Author

Ah yes, I think this is what I was missing, it is now running and it looks busy as expected.
Thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants