Fix broadcast deadlock for incomplete batches in data sample for data analysis #5117

bm-synth · 2024-02-12T14:34:10Z

When the batch is not a full batch (drop_last=False), then the size of the current batch is smaller than the expected:

self.global_batch_size = self.micro_batch_times_data_parallel_size * self.gradient_accumulation_steps

The get_next_global_batch() method will try to broadcast the tensor of a size smaller than self.global_batch_size from a master rank (0). However, in this case, the master rank will send a shorter tensor. This leads to an unexpected behaviour (deadlock, crash, or None tensor on receiving ranks). The documentation for the broadcast operation says "tensor must have the same number of elements in all processes participating in the collective." In the following call, tensor can have different sizes when comparing master with other participant ranks. File deepspeed/runtime/data_pipeline/data_sampling/data_sampler.py, like 289:

dist.broadcast(batch, 0, group=self.data_parallel_group)

This PR fixes that bug, by filling incomplete batch indices with -1 so that the batch tensor is always of the same size.

Note: an alternative resolution is to broadcast beforehand the size of the batches tensor, but adds an extra comm step. The current method of extending the batch tensor with -1s is memory-safe as the batch tensor will match the one used in previous iterations with a full batch.

… analysis (microsoft#5117) When the batch is not a full batch (`drop_last=False`), then the size of the current batch is smaller than the expected: ``` self.global_batch_size = self.micro_batch_times_data_parallel_size * self.gradient_accumulation_steps ``` The `get_next_global_batch()` method will try to broadcast the tensor of a size smaller than `self.global_batch_size` from a master rank (`0`). However, in this case, the master rank will send a shorter tensor. This leads to an unexpected behaviour (deadlock, crash, or `None` tensor on receiving ranks). The documentation for the [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) operation says "tensor must have the same number of elements in all processes participating in the collective." In the following call, `tensor` can have different sizes when comparing master with other participant ranks. File `deepspeed/runtime/data_pipeline/data_sampling/data_sampler.py`, like `289`: ``` dist.broadcast(batch, 0, group=self.data_parallel_group) ``` This PR fixes that bug, by filling incomplete batch indices with `-1` so that the batch tensor is always of the same size. Note: an alternative resolution is to broadcast beforehand the size of the batches tensor, but adds an extra comm step. The current method of extending the `batch` tensor with `-1`s is memory-safe as the batch tensor will match the one used in previous iterations with a full batch.

bm-synth and others added 14 commits February 9, 2024 11:11

flush file before continuing. added assert

0fa6da6

removed if condition. collate_fn is by default None

4414e15

added assert of torch vs numpy types

14f2bbe

removed 'data' key in favour or a global index

84d4398

removed self.start_idx and self.end_idx

9a4cf51

metric_function as it was

7951289

removed assert

8e6e556

removed dtypes

00a9fed

Merge branch 'bug_fix_flush_curriculum_reduce_file_concatenation'

82fbf19

Merge branch 'remove_unnecessary_if_condition'

6ffad30

fixed broadcast deadlock

87d5db9

reverted 2 files that belong to a different PR

9c197d9

removed breakline

13d730c

Merge branch 'master' into fix_broadcast_deadlock_for_incomplete_batches

6cf3053

bm-synth marked this pull request as ready for review February 12, 2024 14:45

bm-synth requested a review from conglongli as a code owner February 12, 2024 14:45

bm-synth added 3 commits February 12, 2024 23:09

format

46aac6b

Merge branch 'master' into fix_broadcast_deadlock_for_incomplete_batches

9864a4c

Merge branch 'master' into fix_broadcast_deadlock_for_incomplete_batches

84d148a

mrwyattii approved these changes Feb 14, 2024

View reviewed changes

conglongli approved these changes Feb 15, 2024

View reviewed changes

conglongli added this pull request to the merge queue Feb 15, 2024

Merged via the queue into microsoft:master with commit 2b41110 Feb 15, 2024
12 checks passed

bm-synth deleted the fix_broadcast_deadlock_for_incomplete_batches branch February 15, 2024 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broadcast deadlock for incomplete batches in data sample for data analysis #5117

Fix broadcast deadlock for incomplete batches in data sample for data analysis #5117

bm-synth commented Feb 12, 2024 •

edited

Fix broadcast deadlock for incomplete batches in data sample for data analysis #5117

Fix broadcast deadlock for incomplete batches in data sample for data analysis #5117

Conversation

bm-synth commented Feb 12, 2024 • edited

bm-synth commented Feb 12, 2024 •

edited