Skip to content

deal with python 10_000 legal number in slice syntax #7481

@sfc-gh-sbekman

Description

@sfc-gh-sbekman

Feature request

In [6]: ds = datasets.load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:1000]")

In [7]: ds = datasets.load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:1_000]")
[dozens of frames skipped]
File /usr/local/lib/python3.10/dist-packages/datasets/arrow_reader.py:444, in _str_to_read_instruction(spec)
    442 res = _SUB_SPEC_RE.match(spec)
    443 if not res:
--> 444     raise ValueError(f"Unrecognized instruction format: {spec}")
ValueError: Unrecognized instruction format: train_sft[:1_000]

It took me a while to understand what the problem was. But apparently pyarrow doesn't allow python numbers that may include _ as in 1_000. The _ aids readability since 10_000_000 vs 10000000 is obviously easier to grasp of what the actual number is.

Feature request:

ideally datasets being a python module will do the right thing and convert python numbers into whatever pyarrow supports - in this case stripping _s.

Second best it'd err and tell the user that using numbers with _ in split slices is not acceptible, so that the user won't have to deal with a huge pyarrow assert they know nothing about.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions