-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Feature request
In [6]: ds = datasets.load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:1000]")
In [7]: ds = datasets.load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:1_000]")
[dozens of frames skipped]
File /usr/local/lib/python3.10/dist-packages/datasets/arrow_reader.py:444, in _str_to_read_instruction(spec)
442 res = _SUB_SPEC_RE.match(spec)
443 if not res:
--> 444 raise ValueError(f"Unrecognized instruction format: {spec}")
ValueError: Unrecognized instruction format: train_sft[:1_000]
It took me a while to understand what the problem was. But apparently pyarrow doesn't allow python numbers that may include _ as in 1_000. The _ aids readability since 10_000_000 vs 10000000 is obviously easier to grasp of what the actual number is.
Feature request:
ideally datasets being a python module will do the right thing and convert python numbers into whatever pyarrow supports - in this case stripping _s.
Second best it'd err and tell the user that using numbers with _ in split slices is not acceptible, so that the user won't have to deal with a huge pyarrow assert they know nothing about.
Thank you!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request