-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed support #5369
Distributed support #5369
Conversation
The documentation is not available anymore as the PR was closed or merged. |
be79dda
to
8d73fef
Compare
Alright all the tests are passing - this is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit.
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
just added a note :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks all good now!
Hi @lhoestq , Let's assume I have 127 parquet files and world_size is 4. I was not able to fully comprehend the above statement |
If you have 128 parquet files, then On the other hand if you have Therefore in this case, all the workers take care of the 127 parquet files but workers will skip examples to not end up with duplicates. That's what "each node keeps 1 example out of world_size, skipping the other examples" means, and in your case it implies:
|
Thanks a lot @lhoestq , this helps! |
Hi, in the case above, if we use |
Also they are perfectly sharded using |
Hi, please correct if I mistake anything:
Generally, the dilemma I'm facing is: But is it possible to load 120GB once into 4 * A100 (which has around 4*120GB memory) and make each process read from this shared data from memory? Theoretically, maybe it should be faster? |
Feel free to ask your questions on the forum if you don't mind, this way the discussions may be useful to other people ;) |
To split your dataset across your training nodes, you can use the new [
datasets.distributed.split_dataset_by_node
]:This works for both map-style datasets and iterable datasets.
The dataset is split for the node at rank
rank
in a pool of nodes of sizeworld_size
.For map-style datasets:
Each node is assigned a chunk of data, e.g. rank 0 is given the first chunk of the dataset.
For iterable datasets:
If the dataset has a number of shards that is a factor of
world_size
(i.e. ifdataset.n_shards % world_size == 0
),then the shards are evenly assigned across the nodes, which is the most optimized.
Otherwise, each node keeps 1 example out of
world_size
, skipping the other examples.This can also be combined with a
torch.utils.data.DataLoader
if you want each node to use multiple workers to load the data.This also supports shuffling. At each epoch, the iterable dataset shards are reshuffled across all the nodes - you just have to call
iterable_ds.set_epoch(epoch_number)
.TODO:
Related to huggingface/transformers#20770
Close #5360