[Fix] Distributed use correct local rank #973

C1rN09 · 2023-02-28T14:49:51Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Related to MMCV PR

In dist.utils we set torch.cuda.set_device(rank % num_gpus), while in Runner.wrap_model we set device_ids=[os.environ['LOCAL_RANK']. These 2 values, local_rank and rank % device_counts, are the same when each node has the same number of devices, but are different otherwise.

Besides, _init_dist_slurm can also use SLURM_PROCID instead of rank % num_gpus. Although the original code won't raise errors (because we exported that to LOCAL_RANK), it might be better to keep pytorch local_rank consistent with slurm local_rank for some corner use cases.

Reproduction

Since I can only access slurm cluster, I simulate torchrun with torchrun_in_slurm.py

# 2 nodes with 5 gpus, so that they cannot have same number of gpus
srun -N 2 -n 5 --gpus-per-task=1 -p my_partition python torchrun_in_slurm.py tools/train.py configs/resnet/resnet18_8xb32_in1k.py --launcher pytorch

Before this PR:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)

After this PR, its all fine.

Below are some experiments on slurm with test_dist_slurm_env.py

srun -p my_partition -c 1 -n 5 -N 2 --gpus-per-task=1 python test_dist_slurm_env.py

Obviously rank % device_counts != local_rank, although it won't cause error in MMEngine.

Modification

Use os.environ['LOCAL_RANK'] in pytorch dist init.

Use os.environ['SLURM_LOCALID'] in slurm dist init.

BC-breaking (Optional)

Theoretically not.

LOCAL_RANK has been set in torch.distributed.launch since v1.5.0 (and even older versions). Also in torchrun

SLURM_LOCALID is documented here

Use cases (Optional)

Not changed.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
The documentation has been modified accordingly, like docstring or example tutorials.

mmengine/dist/utils.py

codecov · 2023-03-01T06:01:58Z

Codecov Report

❗ No coverage uploaded for pull request base (main@a3e5e03). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 09b696a differs from pull request most recent head 70d6b6f. Consider uploading reports for the commit 70d6b6f to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #973   +/-   ##
=======================================
  Coverage        ?   76.55%           
=======================================
  Files           ?      138           
  Lines           ?    10843           
  Branches        ?     2167           
=======================================
  Hits            ?     8301           
  Misses          ?     2185           
  Partials        ?      357

Flag	Coverage Δ
unittests	`76.55% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

fix distributed local rank

1c8724c

mm-assistant bot assigned zhouzaida Feb 28, 2023

C1rN09 marked this pull request as ready for review February 28, 2023 14:49

C1rN09 requested a review from zhouzaida as a code owner February 28, 2023 14:50

song-zhang reviewed Mar 1, 2023

View reviewed changes

mmengine/dist/utils.py Show resolved Hide resolved

C1rN09 added 2 commits March 1, 2023 13:50

fix as comments

724bacf

improve readability

70d6b6f

zhouzaida approved these changes Mar 13, 2023

View reviewed changes

zhouzaida added this to the 0.7.0 milestone Mar 13, 2023

zhouzaida modified the milestones: 0.7.0, 0.7.3 Apr 11, 2023

HAOCHENYE approved these changes Apr 23, 2023

View reviewed changes

zhouzaida merged commit 580c9d4 into open-mmlab:main Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Distributed use correct local rank #973

[Fix] Distributed use correct local rank #973

C1rN09 commented Feb 28, 2023 •

edited

codecov bot commented Mar 1, 2023

[Fix] Distributed use correct local rank #973

[Fix] Distributed use correct local rank #973

Conversation

C1rN09 commented Feb 28, 2023 • edited

Motivation

Reproduction

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

codecov bot commented Mar 1, 2023

Codecov Report

C1rN09 commented Feb 28, 2023 •

edited