-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
documentation missing how to split a dataset #259
Comments
this seems to work for my specific problem:
|
Currently you can indeed split a dataset using However right now we don't have a way to shuffle a dataset but we are thinking about it in the discussion in #166. Feel free to share your thoughts about it. One trick that you can do until we have a better solution is to shuffle and split the indices of your dataset: import nlp
from sklearn.model_selection import train_test_split
imdb = nlp.load_dataset('imbd', split='test')
test_indices, val_indices = train_test_split(range(len(imdb))) and then to iterate each split: for i in test_indices:
example = imdb[i]
... |
I added a small guide here that explains how to split a dataset. It is very similar to the tensorflow datasets guide, as we kept the same logic. |
Thanks a lot, the new explanation is very helpful! About using train_test_split from sklearn: I stumbled across the same error message as this user and thought it can't be used at the moment in this context. Will check it out again. One of the problems is how to shuffle very large datasets, which don't fit into the memory. Well, one strategy could be shuffling data in sections. But in a case where the data is sorted by the labels you have to swap larger sections first. |
We added a way to shuffle datasets (shuffle the indices and then reorder to make a new dataset). Closing this issue as we added the docs for splits and tools to split datasets. Thanks again for your feedback ! |
The updated documentation doesn't link to this anymore: https://huggingface.co/docs/datasets/v2.10.0/en/package_reference/builder_classes#datasets.Split |
I am trying to understand how to split a dataset ( as arrow_dataset).
I know I can do something like this to access a split which is already in the original dataset :
ds_test = nlp.load_dataset('imdb, split='test')
But how can I split ds_test into a test and a validation set (without reading the data into memory and keeping the arrow_dataset as container)?
I guess it has something to do with the module split :-) but there is no real documentation in the code but only a reference to a longer description:
But the guide seems to be missing.
To clarify: I know that this has been modelled after the dataset of tensorflow and that some of the documentation there can be used like this one. But to come back to the example above: I cannot simply split the testset doing this:
ds_test = nlp.load_dataset('imdb, split='test'[:5000])
ds_val = nlp.load_dataset('imdb, split='test'[5000:])
because the imdb test data is sorted by class (probably not a good idea anyway)
The text was updated successfully, but these errors were encountered: