Skip to content

Efficient data loading functionality #1031

@shubhamagarwal92

Description

@shubhamagarwal92

🚀 Feature

Efficient data loader for huge dataset with lazy loading!

Motivation

I am working with a huge dataset consisting of 120m examples (~40G raw text) in a single csv file. I tried to follow the run_glue distributed training example, however this is too slow as it first creates all the examples and cache it here. Basically only first process in distributed training process the dataset and others just use the cache.

Is there any data loader (or a working example) that would be efficient for training the model on such a huge dataset?

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions