documentation missing how to split a dataset #259

fotisj · 2020-06-10T13:18:13Z

I am trying to understand how to split a dataset ( as arrow_dataset).
I know I can do something like this to access a split which is already in the original dataset :

ds_test = nlp.load_dataset('imdb, split='test')

But how can I split ds_test into a test and a validation set (without reading the data into memory and keeping the arrow_dataset as container)?
I guess it has something to do with the module split :-) but there is no real documentation in the code but only a reference to a longer description:

See the guide on splits for more information.

But the guide seems to be missing.

To clarify: I know that this has been modelled after the dataset of tensorflow and that some of the documentation there can be used like this one. But to come back to the example above: I cannot simply split the testset doing this:
ds_test = nlp.load_dataset('imdb, split='test'[:5000])
ds_val = nlp.load_dataset('imdb, split='test'[5000:])

because the imdb test data is sorted by class (probably not a good idea anyway)

The text was updated successfully, but these errors were encountered:

fotisj · 2020-06-10T22:17:12Z

this seems to work for my specific problem:

self.train_ds, self.test_ds, self.val_ds = map(_prepare_ds, ('train', 'test[:25%]+test[50%:75%]', 'test[75%:]'))

lhoestq · 2020-06-11T09:37:08Z

Currently you can indeed split a dataset using ds_test = nlp.load_dataset('imdb, split='test[:5000]') (works also with percentages).

However right now we don't have a way to shuffle a dataset but we are thinking about it in the discussion in #166. Feel free to share your thoughts about it.

One trick that you can do until we have a better solution is to shuffle and split the indices of your dataset:

import nlp
from sklearn.model_selection import train_test_split

imdb = nlp.load_dataset('imbd', split='test')
test_indices, val_indices = train_test_split(range(len(imdb)))

and then to iterate each split:

for i in test_indices:
    example = imdb[i]
   ...

lhoestq · 2020-06-15T16:56:36Z

I added a small guide here that explains how to split a dataset. It is very similar to the tensorflow datasets guide, as we kept the same logic.

fotisj · 2020-06-15T17:50:04Z

Thanks a lot, the new explanation is very helpful!

About using train_test_split from sklearn: I stumbled across the same error message as this user and thought it can't be used at the moment in this context. Will check it out again.

One of the problems is how to shuffle very large datasets, which don't fit into the memory. Well, one strategy could be shuffling data in sections. But in a case where the data is sorted by the labels you have to swap larger sections first.

lhoestq · 2020-06-18T22:20:24Z

We added a way to shuffle datasets (shuffle the indices and then reorder to make a new dataset).
You can do shuffled_dset = dataset.shuffle(seed=my_seed). It shuffles the whole dataset.
There is also dataset.train_test_split() which if very handy (with the same signature as sklearn).

Closing this issue as we added the docs for splits and tools to split datasets. Thanks again for your feedback !

scottfleming · 2023-03-13T21:30:16Z

https://huggingface.co/docs/datasets/v1.0.1/package_reference/builder_classes.html#datasets.Split still links to https://github.com/huggingface/datasets/tree/main/docs/splits.md which is a 404

lhoestq · 2023-03-14T13:56:06Z

The updated documentation doesn't link to this anymore: https://huggingface.co/docs/datasets/v2.10.0/en/package_reference/builder_classes#datasets.Split

thomwolf mentioned this issue Jun 17, 2020

Add sort, shuffle, test_train_split and select methods #266

Merged

lhoestq closed this as completed Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documentation missing how to split a dataset #259

documentation missing how to split a dataset #259

fotisj commented Jun 10, 2020 •

edited

Loading

fotisj commented Jun 10, 2020

lhoestq commented Jun 11, 2020 •

edited

Loading

lhoestq commented Jun 15, 2020

fotisj commented Jun 15, 2020

lhoestq commented Jun 18, 2020

scottfleming commented Mar 13, 2023

lhoestq commented Mar 14, 2023

documentation missing how to split a dataset #259

documentation missing how to split a dataset #259

Comments

fotisj commented Jun 10, 2020 • edited Loading

fotisj commented Jun 10, 2020

lhoestq commented Jun 11, 2020 • edited Loading

lhoestq commented Jun 15, 2020

fotisj commented Jun 15, 2020

lhoestq commented Jun 18, 2020

scottfleming commented Mar 13, 2023

lhoestq commented Mar 14, 2023

fotisj commented Jun 10, 2020 •

edited

Loading

lhoestq commented Jun 11, 2020 •

edited

Loading