Skip to content

Shard Dataset at specific indices #7415

@nikonikolov

Description

@nikonikolov

I have a dataset of sequences, where each example in the sequence is a separate row in the dataset (similar to LeRobotDataset). When running Dataset.save_to_disk how can I provide indices where it's possible to shard the dataset such that no episode spans more than 1 shard. Consequently, when I run Dataset.load_from_disk, how can I load just a subset of the shards to save memory and time on different ranks?

I guess an alternative to this would be, given a loaded Dataset, how can I run Dataset.shard such that sharding doesn't split any episode across shards?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions