New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map-style Dataset to IterableDataset #5410
Conversation
@@ -493,7 +493,7 @@ def _create_builder_config( | |||
) | |||
is_custom = (config_id not in self.builder_configs) and config_id != "default" | |||
if is_custom: | |||
logger.warning(f"Using custom data configuration {config_id}") | |||
logger.info(f"Using custom data configuration {config_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this because I think it's not relevant anymore, and because I find it confusing to show this when calling IterableDataset.from_generator
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
The documentation is not available anymore as the PR was closed or merged. |
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so cool, I learned a lot reading this. I'm sure it'll be super valuable and welcomed by the community! 😄
I think this would be more of a Conceptual Guide doc since this is more explanatory and compares the differences between a Dataset
and an IterableDataset
. It’s not necessarily a how-to for how to do something, but it discusses and explains the two types of datasets. There are definitely places in the docs where we can add a nice and link to this doc though to build up the user's understanding of this topic. For example, in the Know your dataset tutorial, we only introduce the regular Dataset
object and not the IterableDataset
. We can add a section there for IterableDataset
and then link to this doc that explains the difference between the two 🙂
|
||
## Downloading and streaming | ||
|
||
When you have a regular "map-style" [`Dataset`], you can access it using `my_dataset[0]`: we have what we call "random access" to the rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by a “map-style” Dataset
? If I understand correctly, this is just a regular Dataset
. So it might be easier for users to understand if we don’t use this specific term and just use Dataset
or if we define what we mean by “map-style” (unless this is commonly known jargon, in which case ignore this haha).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it refers to datasets with random access, i.e. that allows you to do my_dataset[0]
. I'll define it properly
print(my_dataset[0]) | ||
``` | ||
|
||
To not have to wait for the conversion to Arrow, you can define an iterable dataset by streaming from your local files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of not converting Dataset to Arrow (obvs it’s faster, but it’d be good to mention this explicitly for the user)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faster + save disk space + you can modify your original data and re-instantiate the dataset without having to reconvert the original data. I'll mention this !
sounds good to me !
good idea, thanks :) |
I'll open a PR to add a section on |
I moved the doc page to conceptual guides and took your suggestions into account :) I think this is ready for final review now |
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome doc, thanks for sharing all this infos!
@@ -0,0 +1,220 @@ | |||
# Differences between Dataset and IterableDataset | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just add a sentence or two here that introduces the topic and scope of the doc. Something like:
There are two types of dataset objects, a Dataset
and an IterableDataset
. Whichever type of dataset you choose to use or create depends on the size of the dataset. In general, an IterableDataset
is ideal for big datasets (think hundreds of GBs!) due to its lazy behavior and speed advantages, while a Dataset
is great for everything else. This page will compare the differences between a Dataset
and an IterableDataset
to help you pick the right dataset object for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me !
my_iterable_dataset.n_shards # 1024 | ||
``` | ||
|
||
Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions ! | |
Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this sentence altogether. Two existing links in our docs are more than enough :).
Returns: | ||
[`datasets.IterableDataset`] | ||
|
||
Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love all the example usages here! 😍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
The code looks good.
Regarding the docs, I think it would be better to add this info as notes/tips/sections to the existing docs (Process/Stream; e.g. a tip under Dataset.shuffle
that explains how to make this operation more performant by using to_iterable
+ shuffle
, etc.) rather than introducing a new doc page.
my_iterable_dataset.n_shards # 1024 | ||
``` | ||
|
||
Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this sentence altogether. Two existing links in our docs are more than enough :).
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
I took your comments into account :)
I added a paragraph in the Dataset.shuffle docstring, and a note in the Process doc page |
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the new doc about Dataset vs IterableDataset, thank you! and I think it's worth a separate page.
I left just a few comments to the text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I guess we can keep the new page.
Btw, this page boils down to "switch from Dataset to IterableDataset to save time/disk space by avoiding the full rewrite of a dataset" (map
always does it, flatten_indices
is good to run after shuffle
to preserve the iteration speed), so maybe a better title for it would be "Optimize processing" (or "Working with datasets at scale" as I mentioned earlier on Slack)
PS: I think it would be a good idea to add links to the Guide pages for better discoverability and to somewhat "justify their presence in the docs" (from the tutorial/how-to pages to the guides; some guides are not referenced at all)
cc @stevhliu
Co-authored-by: Polina Kazakova <polina@huggingface.co>
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Took your last comments into account !
I think the content would be slightly different, e.g. focus more on multiprocessing/sharding or what data formats to use. This can be a complementary page IMO
Added a link in the how-to stream page. We may want to include it in the tutorial at one point at well - right now none of the tutorials mention streaming |
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Just merged #5485, which references this new doc! Will look for other pages in the docs where it'd make sense to add them :) |
Added
ds.to_iterable()
to get an iterable dataset from a map-style arrow dataset.It also has a
num_shards
argument to split the dataset before converting to an iterable dataset. Sharding is important to enable efficient shuffling and parallel loading of iterable datasets.TODO:
Fix #5265