Shard Dataset at specific indices

I have a dataset of sequences, where each example in the sequence is a separate row in the dataset (similar to LeRobotDataset). When running `Dataset.save_to_disk` how can I provide indices where it's possible to shard the dataset such that no episode spans more than 1 shard. Consequently, when I run `Dataset.load_from_disk`, how can I load just a subset of the shards to save memory and time on different ranks?

I guess an alternative to this would be, given a loaded `Dataset`, how can I run `Dataset.shard` such that sharding doesn't split any episode across shards?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shard Dataset at specific indices #7415

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shard Dataset at specific indices #7415

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions