Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Distributed use correct local rank #973

Merged
merged 3 commits into from Apr 24, 2023

Conversation

C1rN09
Copy link
Collaborator

@C1rN09 C1rN09 commented Feb 28, 2023

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Related to MMCV PR

In dist.utils we set torch.cuda.set_device(rank % num_gpus), while in Runner.wrap_model we set device_ids=[os.environ['LOCAL_RANK']. These 2 values, local_rank and rank % device_counts, are the same when each node has the same number of devices, but are different otherwise.

Besides, _init_dist_slurm can also use SLURM_PROCID instead of rank % num_gpus. Although the original code won't raise errors (because we exported that to LOCAL_RANK), it might be better to keep pytorch local_rank consistent with slurm local_rank for some corner use cases.

Reproduction

Since I can only access slurm cluster, I simulate torchrun with torchrun_in_slurm.py

# 2 nodes with 5 gpus, so that they cannot have same number of gpus
srun -N 2 -n 5 --gpus-per-task=1 -p my_partition python torchrun_in_slurm.py tools/train.py configs/resnet/resnet18_8xb32_in1k.py --launcher pytorch

Before this PR:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)

After this PR, its all fine.

Below are some experiments on slurm with test_dist_slurm_env.py

srun -p my_partition -c 1 -n 5 -N 2 --gpus-per-task=1 python test_dist_slurm_env.py

image

Obviously rank % device_counts != local_rank, although it won't cause error in MMEngine.

Modification

Use os.environ['LOCAL_RANK'] in pytorch dist init.

Use os.environ['SLURM_LOCALID'] in slurm dist init.

BC-breaking (Optional)

Theoretically not.

LOCAL_RANK has been set in torch.distributed.launch since v1.5.0 (and even older versions). Also in torchrun

SLURM_LOCALID is documented here

Use cases (Optional)

Not changed.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  3. If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@C1rN09 C1rN09 marked this pull request as ready for review February 28, 2023 14:49
@codecov
Copy link

codecov bot commented Mar 1, 2023

Codecov Report

❗ No coverage uploaded for pull request base (main@a3e5e03). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 09b696a differs from pull request most recent head 70d6b6f. Consider uploading reports for the commit 70d6b6f to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #973   +/-   ##
=======================================
  Coverage        ?   76.55%           
=======================================
  Files           ?      138           
  Lines           ?    10843           
  Branches        ?     2167           
=======================================
  Hits            ?     8301           
  Misses          ?     2185           
  Partials        ?      357           
Flag Coverage Δ
unittests 76.55% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zhouzaida zhouzaida added this to the 0.7.0 milestone Mar 13, 2023
@zhouzaida zhouzaida modified the milestones: 0.7.0, 0.7.3 Apr 11, 2023
@zhouzaida zhouzaida merged commit 580c9d4 into open-mmlab:main Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants