Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release - SuperBench v0.3.0 #212

Merged
merged 18 commits into from
Sep 26, 2021
Merged

Release - SuperBench v0.3.0 #212

merged 18 commits into from
Sep 26, 2021

Commits on Sep 24, 2021

  1. Bug - Fix Bug : fix bug of error param operations to operation in rcc…

    …l-bw of hpe config (#190)
    
    **Description**
    fix bug of error param opterations of rccl-bw in hpe MI100 config
    
    **Major Revision**
    - operations->operation
    Yuting Jiang authored and abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    fbfb58f View commit details
    Browse the repository at this point in the history
  2. Bug - Revise 'docker run' in sb deploy (#195)

    **Description**
    
    Revise 'docker run' in sb deploy due to base image running endpoint/cmd under /root.
    
    **Major Revision**
    
    - define endpoint bash when 'docker run'
    Yuting Jiang authored and abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    3553381 View commit details
    Browse the repository at this point in the history
  3. Bug: Fix Bug - Add barrier before 'destroy_process_group' in model be…

    …nchmarks (#198)
    
    **Description**
    Add barrier before 'destroy_process_group' to resolve the bug due to when multi models in one model benchmark, some processes haven't finished the previous process group while others failed to initialize new process group for the next model on rocm4.x when running bert_models.
    
    **Major Revision**
    -  Add barrier before 'destroy_process_group'.
    Yuting Jiang authored and abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    c9fb724 View commit details
    Browse the repository at this point in the history
  4. Benchmarks: Build Pipeline - Restore rocblas build logic (#197)

    **Description**
     restore rocblas build logic to cancel support of rocblas build in rocm4.0_ubuntu18.04_py3.6_pytorch_1.7.0 base image.
    
    **Major Revision**
    -  restore rocblas build logic, remove gpu target limit and other resource limit for rocm4.0.
    Yuting Jiang authored and abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    6da800f View commit details
    Browse the repository at this point in the history
  5. Bug: Fix bug - fix bug of hipBusBandwidth build (#193)

    **Description**
    fix bug of hipBusBandwidth building
    
    **Major Revision**
    - it failed to enter the check 'hip/samples/1_Utils/hipBusBandwidth/CMakeLists.txt' when building docker, so removed this check
    - add sb_micro_path for rocm_bandwidthTest
    Yuting Jiang authored and abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    2c281ba View commit details
    Browse the repository at this point in the history
  6. CI/CD - Add ROCm image build in GitHub Actions (#194)

    Add ROCm image build in GitHub Actions.
    abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    3b9edee View commit details
    Browse the repository at this point in the history
  7. Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#…

    …196)
    
    **Description**
    1. Do `enable_language(CUDA)` before using `CMAKE_CUDA_COMPILER_VERSION`
    2. use `cmake --install` to install target which will call `cmake -P cmake_install.cmake` instead of `make Makefile` to avoid issue `make: *** No rule to make target 'install'.  Stop.`
    guoshzhao authored and abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    2cf1331 View commit details
    Browse the repository at this point in the history
  8. CLI - Integrate system info for node (#199)

    Integrate system info for node, add `sb node info` command.
    abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    11c0ba3 View commit details
    Browse the repository at this point in the history
  9. Bug - Fix torch.distributed command for single node (#201)

    Fix `torch.distributed` command for single node.
    abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    e3266da View commit details
    Browse the repository at this point in the history
  10. CI/CD - Push images in GitHub Action (#202)

    Push Docker images in GitHub Action.
    abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    2c2cad0 View commit details
    Browse the repository at this point in the history
  11. Tool: Fix bug - Fix function naming issue in system info (#200)

    **Description**
    Fix function naming issue in system info.
    
    **Major Revision**
    - fix function naming issue in system info 
    - save to json file
    - add timeout for subprocess.run
    - revise error handling to print exception message
    Yuting Jiang authored and abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    c6f76ce View commit details
    Browse the repository at this point in the history
  12. Benchmark: Fix Bug - fix error message of communication-computation-o…

    …verlap (#204)
    
    **Description**
    fix bug in error message of communication-computation-overlap.
    
    **Major Revision**
    - remove non existing variable
    Yuting Jiang authored and abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    43da0dd View commit details
    Browse the repository at this point in the history
  13. CI/CD - Fix bug in build image for push event (#205)

    Fix bug in build image for push event.
    
    **Major Revision**
    - Fix bug in build image for push event when `github.base_ref` is not set.
    
    **Minor Revision**
    - Unify `[` and `[[` usage.
    abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    b5349ef View commit details
    Browse the repository at this point in the history
  14. Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in conf…

    …iguration examples (#203)
    
    **Description**
    This commit fixes wrong parameters for gpu-sm-copy-bw call in configuration examples.
    yzygitzh authored and abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    42465b0 View commit details
    Browse the repository at this point in the history
  15. CI/CD - Update GitHub Action VM (#211)

    Update GitHub Action VM, fix pipeline hanging.
    abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    031be6a View commit details
    Browse the repository at this point in the history
  16. Benchmarks: Update - Update benchmarks in configuration file (#208)

    **Description**
    Update benchmarks in configuration files for single node validation of superbench v0.3.
    
    **Major Revision**
    - fix bugs of parameters in nccl-bw for single node validation in configs
    - update new benchmarks in amd_mi100_hpe.yaml, amd_mi100_z53.yaml, azure_ndv4.yaml
    - fix bug of wrong gpu visible prefix
    Yuting Jiang authored and abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    ddb0fd2 View commit details
    Browse the repository at this point in the history
  17. Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (

    #210)
    
    **Description**
    Update rccl-test git submodule to dc1ad48 which fix the bug of division by zero
    
    **Major Revision**
    - update rccl-test git submodule to dc1ad48
    Yuting Jiang authored and abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    d6cc73a View commit details
    Browse the repository at this point in the history
  18. Docs - Upgrade version and release note (#209)

    __Description__
    
    Upgrade version and release note. Closes #95 and #170.
    
    __Major Revisions__
    
    * Upgrade package versions
    * Add release note for v0.3.0
    abuccts committed Sep 24, 2021
    Configuration menu
    Copy the full SHA
    b875c44 View commit details
    Browse the repository at this point in the history