-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
streaming datasets doesn't work properly with multi-node #6623
Comments
@mariosasko, @lhoestq, @albertvillanova |
Hi !
|
considering there's just 1 shard and 2 worker nodes, do you mean each worker node will load the whole dataset but still receive half of that shard while streaming? |
Yes both nodes will stream from the 1 shard, but each node will skip half of the examples. This way in total each example is seen once and exactly once during you distributed training. Though it terms of I/O, the dataset is effectively read/streamed twice. |
what if the number of samples in that shard % num_nodes != 0? it will break/get stuck? or is the data repeated in that case for gradient sync? |
In the case one at least one of the noes will get an empty/incomplete batch. The data is not repeated in that case. If the training loop doesn't take this into account it can lead to unexpected behaviors indeed. In the future we'd like to add a feature that would allow the nodes to ignore the last batch, this way all the nodes would only have full batches. |
Is there any method to modify one dataset's n_shard? modify the number of files is ok? one file == one shard? |
Yep, one file == one shard :) |
Feature request
Let’s say I have a dataset with 5 samples with values [1, 2, 3, 4, 5], with 2 GPUs (for DDP) and batch size of 2. This dataset is an
IterableDataset
since I am streaming it.Now I split the dataset using
split_dataset_by_node
to ensure it doesn’t get repeated. And since it’s already splitted, I don’t have to useDistributedSampler
(also they don't work with iterable datasets anyway)?But in this case I noticed that the:
First iteraton:
first GPU will get → [1, 2]
first GPU will get → [3, 4]
Second iteraton:
first GPU will get → [5]
first GPU will get → Nothing
which actually creates an issue since in case of
DistributedSampler
, the samples are repeated internally to ensure non of the GPUs at any iteration is missing any data for gradient sync.So my questions are:
DistributedSampler
? If yes, how?split_dataset_by_node
, this is mentioned: "If the dataset has a number of shards that is a factor ofworld_size
(i.e. ifdataset.n_shards % world_size == 0
), then the shards are evenly assigned across the nodes, which is the most optimized. Otherwise, each node keeps 1 example out ofworld_size
, skipping the other examples." Can you explain the last part here?dataset.n_shards % world_size != 0
, is it possible to shard the streaming dataset on the fly to avoid the case where data is missing?Motivation
Somehow streaming datasets should work with DDP since for big LLMs a lot of data is required and DDP/multi-node is mostly used to train such models and streaming can actually help solve the data part of it.
Your contribution
Yes, I can help in submitting the PR once we get mutual understanding on how it should behave.
The text was updated successfully, but these errors were encountered: