Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance timeout cleanup to avoid possible hanging #405

Merged
merged 4 commits into from
Sep 2, 2022

Conversation

abuccts
Copy link
Member

@abuccts abuccts commented Sep 1, 2022

Enhance timeout cleanup to avoid possible hanging.

Major Revisions

  • Skip postprocess (mainly torch.dist.barrier and destroy) when exception happens (e.g., timeout, GPU crashed) to avoid subprocesses hanging.
  • Add cleanup to kill sb exec processes when Ansible run failed for certain benchmark.

Minor Revisions

  • Update extra Ansible timeout from 300s to 60s.

Skip postprocess (mainly torch.dist.barrier and destroy) when exception happens (e.g., timeout, GPU crashed) to avoid subprocesses hanging.
@abuccts abuccts added the benchmarks SuperBench Benchmarks label Sep 1, 2022
@codecov
Copy link

codecov bot commented Sep 1, 2022

Codecov Report

Merging #405 (72c4d17) into release/0.6 (db84289) will increase coverage by 0.00%.
The diff coverage is 87.50%.

@@             Coverage Diff              @@
##           release/0.6     #405   +/-   ##
============================================
  Coverage        88.77%   88.78%           
============================================
  Files               83       83           
  Lines             5265     5269    +4     
============================================
+ Hits              4674     4678    +4     
  Misses             591      591           
Flag Coverage Δ
cpu-python3.6-unit-test 75.24% <87.50%> (+0.01%) ⬆️
cpu-python3.7-unit-test 75.24% <87.50%> (+0.01%) ⬆️
cuda-unit-test 88.70% <87.50%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superbench/runner/runner.py 91.22% <85.71%> (+0.21%) ⬆️
superbench/benchmarks/base.py 94.44% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Add cleanup when Ansible run failed for certain benchmark.
@abuccts abuccts changed the title Skip postprocess after exception to avoid hanging Enhance timeout cleanup to avoid possible hanging Sep 1, 2022
@abuccts abuccts marked this pull request as ready for review September 1, 2022 11:56
@abuccts abuccts requested a review from a team as a code owner September 1, 2022 11:56
@abuccts abuccts added the runner SuperBench Runner label Sep 1, 2022
@abuccts abuccts enabled auto-merge (squash) September 2, 2022 10:03
@abuccts abuccts merged commit 8afaa37 into release/0.6 Sep 2, 2022
@abuccts abuccts deleted the xiongyf/update-timeout branch September 2, 2022 10:35
@yukirora yukirora mentioned this pull request Sep 5, 2022
27 tasks
abuccts added a commit that referenced this pull request Sep 6, 2022
Enhance timeout cleanup to avoid possible hanging.

__Major Revisions__
* Skip postprocess (mainly torch.dist.barrier and destroy) when exception happens (e.g., timeout, GPU crashed) to avoid subprocesses hanging.
* Add cleanup to kill sb exec processes when Ansible run failed for certain benchmark.

__Minor Revisions__
* Update extra Ansible timeout from 300s to 60s.
abuccts added a commit that referenced this pull request Sep 6, 2022
**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)

Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks SuperBench Benchmarks runner SuperBench Runner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants