New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broadcast deadlock for incomplete batches in data sample for data analysis #5117
Merged
conglongli
merged 17 commits into
microsoft:master
from
bm-synth:fix_broadcast_deadlock_for_incomplete_batches
Feb 15, 2024
Merged
Fix broadcast deadlock for incomplete batches in data sample for data analysis #5117
conglongli
merged 17 commits into
microsoft:master
from
bm-synth:fix_broadcast_deadlock_for_incomplete_batches
Feb 15, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mrwyattii
approved these changes
Feb 14, 2024
conglongli
approved these changes
Feb 15, 2024
mauryaavinash95
pushed a commit
to mauryaavinash95/DeepSpeed
that referenced
this pull request
Feb 17, 2024
… analysis (microsoft#5117) When the batch is not a full batch (`drop_last=False`), then the size of the current batch is smaller than the expected: ``` self.global_batch_size = self.micro_batch_times_data_parallel_size * self.gradient_accumulation_steps ``` The `get_next_global_batch()` method will try to broadcast the tensor of a size smaller than `self.global_batch_size` from a master rank (`0`). However, in this case, the master rank will send a shorter tensor. This leads to an unexpected behaviour (deadlock, crash, or `None` tensor on receiving ranks). The documentation for the [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) operation says "tensor must have the same number of elements in all processes participating in the collective." In the following call, `tensor` can have different sizes when comparing master with other participant ranks. File `deepspeed/runtime/data_pipeline/data_sampling/data_sampler.py`, like `289`: ``` dist.broadcast(batch, 0, group=self.data_parallel_group) ``` This PR fixes that bug, by filling incomplete batch indices with `-1` so that the batch tensor is always of the same size. Note: an alternative resolution is to broadcast beforehand the size of the batches tensor, but adds an extra comm step. The current method of extending the `batch` tensor with `-1`s is memory-safe as the batch tensor will match the one used in previous iterations with a full batch.
rraminen
pushed a commit
to ROCm/DeepSpeed
that referenced
this pull request
May 9, 2024
… analysis (microsoft#5117) When the batch is not a full batch (`drop_last=False`), then the size of the current batch is smaller than the expected: ``` self.global_batch_size = self.micro_batch_times_data_parallel_size * self.gradient_accumulation_steps ``` The `get_next_global_batch()` method will try to broadcast the tensor of a size smaller than `self.global_batch_size` from a master rank (`0`). However, in this case, the master rank will send a shorter tensor. This leads to an unexpected behaviour (deadlock, crash, or `None` tensor on receiving ranks). The documentation for the [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) operation says "tensor must have the same number of elements in all processes participating in the collective." In the following call, `tensor` can have different sizes when comparing master with other participant ranks. File `deepspeed/runtime/data_pipeline/data_sampling/data_sampler.py`, like `289`: ``` dist.broadcast(batch, 0, group=self.data_parallel_group) ``` This PR fixes that bug, by filling incomplete batch indices with `-1` so that the batch tensor is always of the same size. Note: an alternative resolution is to broadcast beforehand the size of the batches tensor, but adds an extra comm step. The current method of extending the `batch` tensor with `-1`s is memory-safe as the batch tensor will match the one used in previous iterations with a full batch.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When the batch is not a full batch (
drop_last=False
), then the size of the current batch is smaller than the expected:The
get_next_global_batch()
method will try to broadcast the tensor of a size smaller thanself.global_batch_size
from a master rank (0
). However, in this case, the master rank will send a shorter tensor. This leads to an unexpected behaviour (deadlock, crash, orNone
tensor on receiving ranks). The documentation for the broadcast operation says "tensor must have the same number of elements in all processes participating in the collective." In the following call,tensor
can have different sizes when comparing master with other participant ranks. Filedeepspeed/runtime/data_pipeline/data_sampling/data_sampler.py
, like289
:This PR fixes that bug, by filling incomplete batch indices with
-1
so that the batch tensor is always of the same size.Note: an alternative resolution is to broadcast beforehand the size of the batches tensor, but adds an extra comm step. The current method of extending the
batch
tensor with-1
s is memory-safe as the batch tensor will match the one used in previous iterations with a full batch.