Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks #198

Merged
merged 3 commits into from
Sep 13, 2021

Conversation

yukirora
Copy link
Contributor

@yukirora yukirora commented Sep 10, 2021

Description
Add barrier before 'destroy_process_group' to resolve the bug due to when multi models in one model benchmark, some processes haven't finished the previous process group while others failed to initialize new process group for the next model on rocm4.x when running bert_models.

Major Revision

  • Add barrier before 'destroy_process_group'.

@yukirora yukirora requested a review from a team as a code owner September 10, 2021 12:36
Copy link
Member

@abuccts abuccts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to workaround, could you update the config file for amd?
otherwise, the model results will override previous ones in the same benchmark

@yukirora yukirora changed the title Runner: Code Revision - Runner launch new process for each model in a model benchmark Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks Sep 11, 2021
@codecov
Copy link

codecov bot commented Sep 11, 2021

Codecov Report

Merging #198 (3262251) into release/0.3 (1f9de77) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           release/0.3     #198      +/-   ##
===============================================
- Coverage        88.52%   88.49%   -0.04%     
===============================================
  Files               57       57              
  Lines             2807     2808       +1     
===============================================
  Hits              2485     2485              
- Misses             322      323       +1     
Flag Coverage Δ
cpu-unit-test 74.13% <0.00%> (-0.03%) ⬇️
cuda-unit-test 88.42% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rbench/benchmarks/model_benchmarks/pytorch_base.py 74.72% <0.00%> (-0.84%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f9de77...3262251. Read the comment docs.

@cp5555 cp5555 self-assigned this Sep 13, 2021
@cp5555 cp5555 added benchmarks SuperBench Benchmarks bug Something isn't working model-benchmarks Model Benchmark Test for SuperBench Benchmarks labels Sep 13, 2021
@cp5555 cp5555 self-requested a review September 13, 2021 05:27
@cp5555 cp5555 merged commit 7a3a450 into release/0.3 Sep 13, 2021
@cp5555 cp5555 deleted the v-yujiang/runner-split branch September 13, 2021 09:40
abuccts pushed a commit that referenced this pull request Sep 24, 2021
…nchmarks (#198)

**Description**
Add barrier before 'destroy_process_group' to resolve the bug due to when multi models in one model benchmark, some processes haven't finished the previous process group while others failed to initialize new process group for the next model on rocm4.x when running bert_models.

**Major Revision**
-  Add barrier before 'destroy_process_group'.
cp5555 pushed a commit that referenced this pull request Sep 26, 2021
**Description**

Cherry-pick  bug fixes from v0.3.0 to main.

**Major Revisions**
* Docs - Upgrade version and release note (#209)
* Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (#210)
* Benchmarks: Update - Update benchmarks in configuration file (#208)
* CI/CD - Update GitHub Action VM (#211)
* Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203)
* CI/CD - Fix bug in build image for push event (#205)
* Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204)
* Tool: Fix bug - Fix function naming issue in system info  (#200)
* CI/CD - Push images in GitHub Action (#202)
* Bug - Fix torch.distributed command for single node (#201)
* CLI - Integrate system info for node (#199)
* Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196)
* CI/CD - Add ROCm image build in GitHub Actions (#194)
* Bug: Fix bug - fix bug of hipBusBandwidth build (#193)
* Benchmarks: Build Pipeline - Restore rocblas build logic (#197)
* Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198)
* Bug - Revise 'docker run' in sb deploy (#195)
* Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190)

Co-authored-by: Yuting Jiang <v-yujiang@microsoft.com>
Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks SuperBench Benchmarks bug Something isn't working model-benchmarks Model Benchmark Test for SuperBench Benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants