-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks #198
Conversation
…xx_models benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to workaround, could you update the config file for amd?
otherwise, the model results will override previous ones in the same benchmark
Codecov Report
@@ Coverage Diff @@
## release/0.3 #198 +/- ##
===============================================
- Coverage 88.52% 88.49% -0.04%
===============================================
Files 57 57
Lines 2807 2808 +1
===============================================
Hits 2485 2485
- Misses 322 323 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
…nchmarks (#198) **Description** Add barrier before 'destroy_process_group' to resolve the bug due to when multi models in one model benchmark, some processes haven't finished the previous process group while others failed to initialize new process group for the next model on rocm4.x when running bert_models. **Major Revision** - Add barrier before 'destroy_process_group'.
**Description** Cherry-pick bug fixes from v0.3.0 to main. **Major Revisions** * Docs - Upgrade version and release note (#209) * Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (#210) * Benchmarks: Update - Update benchmarks in configuration file (#208) * CI/CD - Update GitHub Action VM (#211) * Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203) * CI/CD - Fix bug in build image for push event (#205) * Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204) * Tool: Fix bug - Fix function naming issue in system info (#200) * CI/CD - Push images in GitHub Action (#202) * Bug - Fix torch.distributed command for single node (#201) * CLI - Integrate system info for node (#199) * Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196) * CI/CD - Add ROCm image build in GitHub Actions (#194) * Bug: Fix bug - fix bug of hipBusBandwidth build (#193) * Benchmarks: Build Pipeline - Restore rocblas build logic (#197) * Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198) * Bug - Revise 'docker run' in sb deploy (#195) * Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190) Co-authored-by: Yuting Jiang <v-yujiang@microsoft.com> Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com> Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Description
Add barrier before 'destroy_process_group' to resolve the bug due to when multi models in one model benchmark, some processes haven't finished the previous process group while others failed to initialize new process group for the next model on rocm4.x when running bert_models.
Major Revision