Benchmarks: model benchmarks - change torch.distributed.launch to torchrun #556

pnunna93 · 2023-07-24T22:24:36Z

This PR has following changes

torch.distributed.launch changed to torchrun. torch.distributed.launch is deprecated in latest Pytorch and is recommended to move to torchrun - https://pytorch.org/docs/stable/elastic/run.html
Changes to AMD GPU detection logic. The AMD GPU detection logic throws warning when containers have only renderD in /dev/dri, this change would resolve those warnings

pnunna93 · 2023-07-24T22:27:51Z

@microsoft-github-policy-service agree company="AMD"

abuccts

thanks for the PR, pls also replace all python3 -m torch.distributed.launch --use_env with torchrun in tests/ to pass unit tests

abuccts · 2023-07-25T04:36:21Z

/azp run

azure-pipelines · 2023-07-25T04:36:35Z

Azure Pipelines successfully started running 3 pipeline(s).

codecov · 2023-07-25T14:28:10Z

Codecov Report

Merging #556 (248600a) into main (e1df877) will decrease coverage by 0.64%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main     #556      +/-   ##
==========================================
- Coverage   86.96%   86.32%   -0.64%     
==========================================
  Files          93       93              
  Lines        6268     6268              
==========================================
- Hits         5451     5411      -40     
- Misses        817      857      +40

Flag	Coverage Δ
cpu-python3.6-unit-test	`71.86% <0.00%> (ø)`
cpu-python3.7-unit-test	`71.86% <0.00%> (ø)`
cpu-python3.8-unit-test	`72.34% <0.00%> (ø)`
cuda-unit-test	`84.91% <0.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
superbench/common/devices/gpu.py	`82.60% <0.00%> (ø)`
superbench/runner/runner.py	`82.46% <ø> (ø)`

... and 1 file with indirect coverage changes

cp5555 · 2023-07-25T23:11:09Z

Thanks for your PR. We will test it soon to check whether it will impact the final performance or not. If not, we will merge it.

yukirora

Regarding the functionality,
works for both single-node and multi-node on MI200
Regarding the performance,
not observe much difference on single A100 node

Model	Precision	Previous Throughput	New Throughput
bert/pytorch-bert-base	fp32	380.13	379.61
bert/pytorch-bert-base	fp16	614.74	614.43
bert/pytorch-bert-large	fp32	130.85	130.77
bert/pytorch-bert-large	fp16	224.03	223.17
densenet/pytorch-densenet169	fp32	268.70	264.50
densenet/pytorch-densenet169	fp16	274.66	266.47
densenet/pytorch-densenet201	fp32	219.88	219.15
densenet/pytorch-densenet201	fp16	219.61	218.44
gpt/pytorch-gpt2-small	fp32	179.65	179.04
gpt/pytorch-gpt2-small	fp16	188.58	189.43
gpt/pytorch-gpt2-large	fp32	35.37	35.48
gpt/pytorch-gpt2-large	fp16	59.36	59.25
lstm/pytorch-lstm	fp32	4975.33	5026.24
lstm/pytorch-lstm	fp16	7895.35	7981.03
resnet/pytorch-resnet50	fp32	945.86	945.61
resnet/pytorch-resnet50	fp16	1273.37	1317.63
resnet/pytorch-resnet101	fp32	607.28	611.11
resnet/pytorch-resnet101	fp16	887.07	913.76
resnet/pytorch-resnet152	fp32	436.23	435.34
resnet/pytorch-resnet152	fp16	652.38	660.70
vgg/pytorch-vgg11	fp32	760.03	757.03
vgg/pytorch-vgg11	fp16	1130.59	1139.74
vgg/pytorch-vgg13	fp32	554.00	552.60
vgg/pytorch-vgg13	fp16	858.53	885.61
vgg/pytorch-vgg16	fp32	482.54	481.02
vgg/pytorch-vgg16	fp16	777.03	785.29
vgg/pytorch-vgg19	fp32	422.60	422.29
vgg/pytorch-vgg19	fp16	693.83	696.07

yukirora · 2023-08-04T08:35:01Z

/azp run

azure-pipelines · 2023-08-04T08:35:15Z

Azure Pipelines successfully started running 3 pipeline(s).

pnunna93 added 2 commits July 19, 2023 17:52

Change torch.distributed.launch to torchrun

cc76a56

Fix issue with GPU device detection

8438718

pnunna93 requested a review from a team as a code owner July 24, 2023 22:24

abuccts approved these changes Jul 25, 2023

View reviewed changes

Change torch.distributed.launch to torchrun in test scripts

12a9bab

cp5555 self-requested a review July 25, 2023 23:11

cp5555 approved these changes Jul 25, 2023

View reviewed changes

cp5555 assigned yukirora Jul 25, 2023

cp5555 mentioned this pull request Jul 27, 2023

V0.10.0 Release Plan #559

Closed

30 tasks

yukirora approved these changes Aug 4, 2023

View reviewed changes

Merge branch 'main' into torchrun

248600a

yukirora changed the title ~~Change torch.distributed.launch to torchrun~~ Benchmarks: model benchmarks - change torch.distributed.launch to torchrun Aug 4, 2023

yukirora merged commit 67f2aa7 into microsoft:main Aug 8, 2023
22 of 23 checks passed

pnunna93 deleted the torchrun branch August 8, 2023 15:37

yukirora mentioned this pull request Dec 6, 2023

V0.10.0 Test Plan #585

Closed

29 tasks

cp5555 added benchmarks SuperBench Benchmarks model-benchmarks Model Benchmark Test for SuperBench Benchmarks labels Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks: model benchmarks - change torch.distributed.launch to torchrun #556

Benchmarks: model benchmarks - change torch.distributed.launch to torchrun #556

pnunna93 commented Jul 24, 2023

pnunna93 commented Jul 24, 2023

abuccts left a comment

abuccts commented Jul 25, 2023

azure-pipelines bot commented Jul 25, 2023

codecov bot commented Jul 25, 2023 •

edited

Loading

cp5555 commented Jul 25, 2023

yukirora left a comment

yukirora commented Aug 4, 2023

azure-pipelines bot commented Aug 4, 2023

Benchmarks: model benchmarks - change torch.distributed.launch to torchrun #556

Benchmarks: model benchmarks - change torch.distributed.launch to torchrun #556

Conversation

pnunna93 commented Jul 24, 2023

pnunna93 commented Jul 24, 2023

abuccts left a comment

Choose a reason for hiding this comment

abuccts commented Jul 25, 2023

azure-pipelines bot commented Jul 25, 2023

codecov bot commented Jul 25, 2023 • edited Loading

Codecov Report

cp5555 commented Jul 25, 2023

yukirora left a comment

Choose a reason for hiding this comment

yukirora commented Aug 4, 2023

azure-pipelines bot commented Aug 4, 2023

codecov bot commented Jul 25, 2023 •

edited

Loading