Efficient data loading functionality

## 🚀 Feature

Efficient data loader for huge dataset with lazy loading! 

## Motivation



I am working with a huge dataset consisting of 120m examples (~40G raw text) in a single csv file. I tried to follow the [run_glue](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_glue.py) distributed training example, however this is too slow as it first creates all the examples and cache it [here](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_glue.py#L251). Basically only first process in distributed training process the dataset and others just use the cache. 

Is there any data loader (or a working example) that would be efficient for training the model on such a huge dataset? 

## Additional context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficient data loading functionality #1031

🚀 Feature

Motivation

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Efficient data loading functionality #1031

Description

🚀 Feature

Motivation

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions