deal with python `10_000` legal number in slice syntax

### Feature request

```
In [6]: ds = datasets.load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:1000]")

In [7]: ds = datasets.load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:1_000]")
[dozens of frames skipped]
File /usr/local/lib/python3.10/dist-packages/datasets/arrow_reader.py:444, in _str_to_read_instruction(spec)
    442 res = _SUB_SPEC_RE.match(spec)
    443 if not res:
--> 444     raise ValueError(f"Unrecognized instruction format: {spec}")
ValueError: Unrecognized instruction format: train_sft[:1_000]
```

It took me a while to understand what the problem was. But apparently `pyarrow` doesn't allow python numbers that may include `_` as in `1_000`. The `_` aids readability since `10_000_000` vs `10000000` is obviously easier to grasp of what the actual number is.

Feature request:

ideally `datasets` being a python module will do the right thing and convert python numbers into whatever pyarrow supports - in this case stripping `_`s.

Second best it'd err and tell the user that using numbers with `_` in split slices is not acceptible, so that the user won't have to deal with a huge pyarrow assert they know nothing about.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deal with python `10_000` legal number in slice syntax #7481

Feature request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

deal with python 10_000 legal number in slice syntax #7481

Description

Feature request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

deal with python `10_000` legal number in slice syntax #7481