[Fix] Distributed use correct local rank #973
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Related to MMCV PR
In dist.utils we set
torch.cuda.set_device(rank % num_gpus)
, while in Runner.wrap_model we setdevice_ids=[os.environ['LOCAL_RANK']
. These 2 values,local_rank
andrank % device_counts
, are the same when each node has the same number of devices, but are different otherwise.Besides,
_init_dist_slurm
can also useSLURM_PROCID
instead ofrank % num_gpus
. Although the original code won't raise errors (because we exported that toLOCAL_RANK
), it might be better to keep pytorch local_rank consistent with slurm local_rank for some corner use cases.Reproduction
Since I can only access slurm cluster, I simulate torchrun with torchrun_in_slurm.py
# 2 nodes with 5 gpus, so that they cannot have same number of gpus srun -N 2 -n 5 --gpus-per-task=1 -p my_partition python torchrun_in_slurm.py tools/train.py configs/resnet/resnet18_8xb32_in1k.py --launcher pytorch
Before this PR:
After this PR, its all fine.
Below are some experiments on slurm with test_dist_slurm_env.py
Obviously
rank % device_counts != local_rank
, although it won't cause error in MMEngine.Modification
Use
os.environ['LOCAL_RANK']
in pytorch dist init.Use
os.environ['SLURM_LOCALID']
in slurm dist init.BC-breaking (Optional)
Theoretically not.
LOCAL_RANK
has been set intorch.distributed.launch
since v1.5.0 (and even older versions). Also intorchrun
SLURM_LOCALID
is documented hereUse cases (Optional)
Not changed.
Checklist