Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentation missing how to split a dataset #259

Closed
fotisj opened this issue Jun 10, 2020 · 7 comments
Closed

documentation missing how to split a dataset #259

fotisj opened this issue Jun 10, 2020 · 7 comments

Comments

@fotisj
Copy link

fotisj commented Jun 10, 2020

I am trying to understand how to split a dataset ( as arrow_dataset).
I know I can do something like this to access a split which is already in the original dataset :

ds_test = nlp.load_dataset('imdb, split='test')

But how can I split ds_test into a test and a validation set (without reading the data into memory and keeping the arrow_dataset as container)?
I guess it has something to do with the module split :-) but there is no real documentation in the code but only a reference to a longer description:

See the guide on splits for more information.

But the guide seems to be missing.

To clarify: I know that this has been modelled after the dataset of tensorflow and that some of the documentation there can be used like this one. But to come back to the example above: I cannot simply split the testset doing this:
ds_test = nlp.load_dataset('imdb, split='test'[:5000])
ds_val = nlp.load_dataset('imdb, split='test'[5000:])

because the imdb test data is sorted by class (probably not a good idea anyway)

@fotisj
Copy link
Author

fotisj commented Jun 10, 2020

this seems to work for my specific problem:

self.train_ds, self.test_ds, self.val_ds = map(_prepare_ds, ('train', 'test[:25%]+test[50%:75%]', 'test[75%:]'))

@lhoestq
Copy link
Member

lhoestq commented Jun 11, 2020

Currently you can indeed split a dataset using ds_test = nlp.load_dataset('imdb, split='test[:5000]') (works also with percentages).

However right now we don't have a way to shuffle a dataset but we are thinking about it in the discussion in #166. Feel free to share your thoughts about it.

One trick that you can do until we have a better solution is to shuffle and split the indices of your dataset:

import nlp
from sklearn.model_selection import train_test_split

imdb = nlp.load_dataset('imbd', split='test')
test_indices, val_indices = train_test_split(range(len(imdb)))

and then to iterate each split:

for i in test_indices:
    example = imdb[i]
   ...

@lhoestq
Copy link
Member

lhoestq commented Jun 15, 2020

I added a small guide here that explains how to split a dataset. It is very similar to the tensorflow datasets guide, as we kept the same logic.

@fotisj
Copy link
Author

fotisj commented Jun 15, 2020

Thanks a lot, the new explanation is very helpful!

About using train_test_split from sklearn: I stumbled across the same error message as this user and thought it can't be used at the moment in this context. Will check it out again.

One of the problems is how to shuffle very large datasets, which don't fit into the memory. Well, one strategy could be shuffling data in sections. But in a case where the data is sorted by the labels you have to swap larger sections first.

@lhoestq
Copy link
Member

lhoestq commented Jun 18, 2020

We added a way to shuffle datasets (shuffle the indices and then reorder to make a new dataset).
You can do shuffled_dset = dataset.shuffle(seed=my_seed). It shuffles the whole dataset.
There is also dataset.train_test_split() which if very handy (with the same signature as sklearn).

Closing this issue as we added the docs for splits and tools to split datasets. Thanks again for your feedback !

@lhoestq lhoestq closed this as completed Jun 18, 2020
@lhoestq
Copy link
Member

lhoestq commented Mar 14, 2023

The updated documentation doesn't link to this anymore: https://huggingface.co/docs/datasets/v2.10.0/en/package_reference/builder_classes#datasets.Split

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants