-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
I have a dataset of sequences, where each example in the sequence is a separate row in the dataset (similar to LeRobotDataset). When running Dataset.save_to_disk how can I provide indices where it's possible to shard the dataset such that no episode spans more than 1 shard. Consequently, when I run Dataset.load_from_disk, how can I load just a subset of the shards to save memory and time on different ranks?
I guess an alternative to this would be, given a loaded Dataset, how can I run Dataset.shard such that sharding doesn't split any episode across shards?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels