Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release - SuperBench v0.3.0 #212

Merged
merged 18 commits into from
Sep 26, 2021
Merged

Release - SuperBench v0.3.0 #212

merged 18 commits into from
Sep 26, 2021

Conversation

abuccts
Copy link
Member

@abuccts abuccts commented Sep 24, 2021

Description

Cherry-pick bug fixes from v0.3.0 to main.

Major Revisions

Co-authored-by: Yuting Jiang v-yujiang@microsoft.com
Co-authored-by: Guoshuai Zhao guzhao@microsoft.com
Co-authored-by: Ziyue Yang ziyyang@microsoft.com

Yuting Jiang and others added 18 commits September 24, 2021 15:19
…l-bw of hpe config (#190)

**Description**
fix bug of error param opterations of rccl-bw in hpe MI100 config

**Major Revision**
- operations->operation
**Description**

Revise 'docker run' in sb deploy due to base image running endpoint/cmd under /root.

**Major Revision**

- define endpoint bash when 'docker run'
…nchmarks (#198)

**Description**
Add barrier before 'destroy_process_group' to resolve the bug due to when multi models in one model benchmark, some processes haven't finished the previous process group while others failed to initialize new process group for the next model on rocm4.x when running bert_models.

**Major Revision**
-  Add barrier before 'destroy_process_group'.
**Description**
 restore rocblas build logic to cancel support of rocblas build in rocm4.0_ubuntu18.04_py3.6_pytorch_1.7.0 base image.

**Major Revision**
-  restore rocblas build logic, remove gpu target limit and other resource limit for rocm4.0.
**Description**
fix bug of hipBusBandwidth building

**Major Revision**
- it failed to enter the check 'hip/samples/1_Utils/hipBusBandwidth/CMakeLists.txt' when building docker, so removed this check
- add sb_micro_path for rocm_bandwidthTest
Add ROCm image build in GitHub Actions.
…196)

**Description**
1. Do `enable_language(CUDA)` before using `CMAKE_CUDA_COMPILER_VERSION`
2. use `cmake --install` to install target which will call `cmake -P cmake_install.cmake` instead of `make Makefile` to avoid issue `make: *** No rule to make target 'install'.  Stop.`
Integrate system info for node, add `sb node info` command.
Fix `torch.distributed` command for single node.
Push Docker images in GitHub Action.
**Description**
Fix function naming issue in system info.

**Major Revision**
- fix function naming issue in system info 
- save to json file
- add timeout for subprocess.run
- revise error handling to print exception message
…verlap (#204)

**Description**
fix bug in error message of communication-computation-overlap.

**Major Revision**
- remove non existing variable
Fix bug in build image for push event.

**Major Revision**
- Fix bug in build image for push event when `github.base_ref` is not set.

**Minor Revision**
- Unify `[` and `[[` usage.
…iguration examples (#203)

**Description**
This commit fixes wrong parameters for gpu-sm-copy-bw call in configuration examples.
Update GitHub Action VM, fix pipeline hanging.
**Description**
Update benchmarks in configuration files for single node validation of superbench v0.3.

**Major Revision**
- fix bugs of parameters in nccl-bw for single node validation in configs
- update new benchmarks in amd_mi100_hpe.yaml, amd_mi100_z53.yaml, azure_ndv4.yaml
- fix bug of wrong gpu visible prefix
#210)

**Description**
Update rccl-test git submodule to dc1ad48 which fix the bug of division by zero

**Major Revision**
- update rccl-test git submodule to dc1ad48
__Description__

Upgrade version and release note. Closes #95 and #170.

__Major Revisions__

* Upgrade package versions
* Add release note for v0.3.0
@abuccts abuccts requested review from TobeyQin and a team as code owners September 24, 2021 07:24
@abuccts abuccts added the bug Something isn't working label Sep 24, 2021
@codecov
Copy link

codecov bot commented Sep 24, 2021

Codecov Report

Merging #212 (b875c44) into main (37b15db) will increase coverage by 0.19%.
The diff coverage is 76.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #212      +/-   ##
==========================================
+ Coverage   88.52%   88.72%   +0.19%     
==========================================
  Files          57       58       +1     
  Lines        2807     2821      +14     
==========================================
+ Hits         2485     2503      +18     
+ Misses        322      318       -4     
Flag Coverage Δ
cpu-unit-test 74.43% <76.47%> (+0.27%) ⬆️
cuda-unit-test 88.65% <76.47%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ro_benchmarks/computation_communication_overlap.py 83.01% <0.00%> (ø)
...rbench/benchmarks/model_benchmarks/pytorch_base.py 74.72% <0.00%> (-0.84%) ⬇️
superbench/cli/_node_handler.py 75.00% <75.00%> (ø)
superbench/__init__.py 100.00% <100.00%> (ø)
superbench/cli/_commands.py 100.00% <100.00%> (ø)
superbench/runner/runner.py 85.21% <100.00%> (-0.75%) ⬇️
superbench/tools/system_info.py 100.00% <100.00%> (+100.00%) ⬆️
superbench/tools/__init__.py 100.00% <0.00%> (+100.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37b15db...b875c44. Read the comment docs.

@cp5555 cp5555 merged commit dfbd70b into main Sep 26, 2021
@cp5555 cp5555 deleted the xiongyf/cherrypick-0.3 branch September 26, 2021 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants