Skip to content

Conversation

@stas00
Copy link
Contributor

@stas00 stas00 commented Mar 30, 2022

Starting with datasets==1.18.4 an exception is raised when ds.select(myrange) is called and myrange includes indices larger than the length of the dataset. This impacts all our examples, e.g.:

stderr: Traceback (most recent call last):
stderr:   File "/mnt/nvme0/code/huggingface/transformers-master/examples/pytorch/translation/run_translation.py", line 624, in <module>
stderr:     main()
stderr:   File "/mnt/nvme0/code/huggingface/transformers-master/examples/pytorch/translation/run_translation.py", line 436, in main
stderr:     train_dataset = train_dataset.select(range(data_args.max_train_samples))
stderr:   File "/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 486, in wrapper
stderr:     out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
stderr:   File "/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/datasets/fingerprint.py", line 458, in wrapper
stderr:     out = func(self, *args, **kwargs)
stderr:   File "/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2601, in select
stderr:     _check_valid_indices_value(int(max(indices)), size=size)
stderr:   File "/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 573, in _check_valid_indices_value
stderr:     raise IndexError(
stderr: IndexError: Invalid value 15 in indices iterable. All values must be within range [-11, 10].

This PR is trying to fix the issue across all pytorch examples with:

find examples -type f -name "*.py" -exec perl -pi -e 's|^(\s+)(eval_dataset = eval_dataset.select.range.data_args.max_eval_samples..)|$1max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples)\n$1eval_dataset = eval_dataset.select(range(max_eval_samples))|' {} \;
find examples -type f -name "*.py" -exec perl -pi -e 's|^(\s+)(eval_examples = eval_examples.select.range.data_args.max_eval_samples..)|$1max_eval_samples = min(len(eval_examples), data_args.max_eval_samples)\n$1eval_examples = eval_examples.select(range(max_eval_samples))|' {} \;
find examples -type f -name "*.py" -exec perl -pi -e 's|^(\s+)(predict_dataset = predict_dataset.select.range.data_args.max_predict_samples..)|$1max_predict_samples = min(len(predict_dataset), data_args.max_predict_samples)\n$1predict_dataset = predict_dataset.select(range(max_predict_samples))|' {} \;
find examples -type f -name "*.py" -exec perl -pi -e 's|^(\s+)(test_dataset = test_dataset.select.range.data_args.max_eval_samples..)|$1max_eval_samples = min(len(test_dataset), data_args.max_eval_samples)\n$1test_dataset = test_dataset.select(range(max_eval_samples))|' {} \;
find examples -type f -name "*.py" -exec perl -pi -e 's|^(\s+)(train_dataset = train_dataset.select.range.data_args.max_train_samples..)|$1max_train_samples = min(len(train_dataset), data_args.max_train_samples)\n$1train_dataset = train_dataset.select(range(max_train_samples))|' {} \;

I may have missed some cases, but this should cover most of it.

This PR only adjusts the pytorch examples.

@sgugger

@stas00 stas00 changed the title [examples] max samples can't be bigger than then len of dataset [examples] max samples can't be bigger than the len of dataset Mar 30, 2022
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 30, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing those!
The same fix will need to be deployed on the TensorFlow and Flax examples AFAICT.

@stas00
Copy link
Contributor Author

stas00 commented Mar 30, 2022

sure, replayed for tf/flax as well.

@sgugger
Copy link
Collaborator

sgugger commented Mar 30, 2022

Very nice of you, thanks a lot!

You caught some research projects in the change, but it's more defensive, so fine to merge.

@stas00 stas00 merged commit a73281e into main Mar 30, 2022
@stas00 stas00 deleted the examples-fix-max-samples branch March 30, 2022 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants