-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Closed
Labels
Description
🚀 Feature
Efficient data loader for huge dataset with lazy loading!
Motivation
I am working with a huge dataset consisting of 120m examples (~40G raw text) in a single csv file. I tried to follow the run_glue distributed training example, however this is too slow as it first creates all the examples and cache it here. Basically only first process in distributed training process the dataset and others just use the cache.
Is there any data loader (or a working example) that would be efficient for training the model on such a huge dataset?
Additional context
fabiocapsouza