-
Notifications
You must be signed in to change notification settings - Fork 80
microbenchmark - CPU Stream Benchmark Revise #712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
❌ Your patch status has failed because the patch coverage (60.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #712 +/- ##
=======================================
Coverage 86.06% 86.07%
=======================================
Files 99 99
Lines 7211 7209 -2
=======================================
- Hits 6206 6205 -1
+ Misses 1005 1004 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
abuccts
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls fix the build
Description Add release note for v0.12.0 # Main Features ## SuperBench Improvement 1. - [x] Update Image Build Pipeline (#659) 2. - [x] Add support for arm64 build (#660) 3. - [x] Upgrade dependency versions in pipeline (#671) 4. - [x] Fix installation and lint issues (#684) 5. - [x] Update Flake8 repo (#683) 6. - [x] Init latest python support. (#687) 7. - [x] Add image build on arm64 arch (#690) 8. - [x] Enhancement of ignoring errors for import pkg_resources (#692) 9. - [x] Update label in the ROCm image build (#693) 10. - [x] Support cuda12.8 for Blackwell arch (#682) 11. - [x] Merge multi-arch image (#696) 12. - [x] Update OS of runner to the latest. (#702) 13. - [x] cuda arch flag for cublaslt (#701) ## Micro-benchmark Improvement 1. - [x] Bug Fix - Fix numa error on grace cpu in gpu-copy (#658) 2. - [x] Dependency - Bump onnxruntime-gpu version from 1.10.0 to 1.12.0 (#663) 3. - [x] Benchmarks: micro benchmarks - add general CPU bandwidth and latency benchmark (#662) 4. - [x] Benchmarks: micro benchmarks - add nvbandwidth build and benchmark (#665 and #669) 5. - [x] Fix stderr message in gpu-copy benchmark (#673) 6. - [x] Add arch support for 10.0 in gemm-flops (#680) 7. - [x] Fix tensorrt-inference parsing (#674) 8. - [x] nvbandwidth benchmark need to handle N/A value (#675) 9. - [x] Avoid Unintended nvbandwidth Function Calls in All Benchmarks (#685) 10. - [x] Add GPU Stream Micro Benchmark (#697) 11. - [x] Cuda arch flag for cublaslt (#701) 12. - [x] Support autotuning in cublaslt gemm (#706) 14. - [x] Add FP4 GEMM FLOPS support for cublaslt_gemm benchmark (#711) 15. - [x] CPU Stream Benchmark Revise (#712) 16. - [x] Add cuda12.9 docker image (#716) 17. - [x] Add Grace CPU support for CPU Stream (#719) ## Model Benchmark Improvement 1. - [x] Add LLaMA-2 Models (#668) 2. - [x] Fix typos in documentation and code files (#686) 3. - [x] Add Mixture of Experts Model (#679) 4. - [ ] Add DeepSeek Training Benchmark 5. - [x] Add DeepSeek Inference Benchmark (AMD GPU) (#713) ## Documentation 1. - [x] Update CODEOWNERS (#670) 2. - [x] Update CODEOWNERS (#718) ## Result Analysis 1. - [x] Enhance logging information for diagnosis rule op baseline errors. (#689)
Description Add release note for v0.12.0 # Main Features ## SuperBench Improvement 1. - [x] Update Image Build Pipeline (#659) 2. - [x] Add support for arm64 build (#660) 3. - [x] Upgrade dependency versions in pipeline (#671) 4. - [x] Fix installation and lint issues (#684) 5. - [x] Update Flake8 repo (#683) 6. - [x] Init latest python support. (#687) 7. - [x] Add image build on arm64 arch (#690) 8. - [x] Enhancement of ignoring errors for import pkg_resources (#692) 9. - [x] Update label in the ROCm image build (#693) 10. - [x] Support cuda12.8 for Blackwell arch (#682) 11. - [x] Merge multi-arch image (#696) 12. - [x] Update OS of runner to the latest. (#702) 13. - [x] cuda arch flag for cublaslt (#701) ## Micro-benchmark Improvement 1. - [x] Bug Fix - Fix numa error on grace cpu in gpu-copy (#658) 2. - [x] Dependency - Bump onnxruntime-gpu version from 1.10.0 to 1.12.0 (#663) 3. - [x] Benchmarks: micro benchmarks - add general CPU bandwidth and latency benchmark (#662) 4. - [x] Benchmarks: micro benchmarks - add nvbandwidth build and benchmark (#665 and #669) 5. - [x] Fix stderr message in gpu-copy benchmark (#673) 6. - [x] Add arch support for 10.0 in gemm-flops (#680) 7. - [x] Fix tensorrt-inference parsing (#674) 8. - [x] nvbandwidth benchmark need to handle N/A value (#675) 9. - [x] Avoid Unintended nvbandwidth Function Calls in All Benchmarks (#685) 10. - [x] Add GPU Stream Micro Benchmark (#697) 11. - [x] Cuda arch flag for cublaslt (#701) 12. - [x] Support autotuning in cublaslt gemm (#706) 14. - [x] Add FP4 GEMM FLOPS support for cublaslt_gemm benchmark (#711) 15. - [x] CPU Stream Benchmark Revise (#712) 16. - [x] Add cuda12.9 docker image (#716) 17. - [x] Add Grace CPU support for CPU Stream (#719) ## Model Benchmark Improvement 1. - [x] Add LLaMA-2 Models (#668) 2. - [x] Fix typos in documentation and code files (#686) 3. - [x] Add Mixture of Experts Model (#679) 4. - [ ] Add DeepSeek Training Benchmark 5. - [x] Add DeepSeek Inference Benchmark (AMD GPU) (#713) ## Documentation 1. - [x] Update CODEOWNERS (#670) 2. - [x] Update CODEOWNERS (#718) ## Result Analysis 1. - [x] Enhance logging information for diagnosis rule op baseline errors. (#689)
In the current implementation, the CPU‑stream benchmark code renames the binary before the microbench base class can verify its existence, causing the default‐binary check to fail.
This PR adds a “default” binary—built with the standard compile parameters—so that the base class can always find and validate it. Once the default binary is in place, the CPU‑stream code will rename it as needed and re‑check its presence before running the benchmark.
The PR also enable CPU stream in the default settings.