Training is slow with the lazy mode enabled

Pull request #231 reduced the memory footprint of the training script significantly by using **memmaps**. But this causes a lot of overhead while data loading and now the data loading part is a bottle neck while training on GPU and causes a significant drop in the training speed (around 5 times). And there is [this issue](https://github.com/numpy/numpy/issues/13172) in numpy which highlights that the memmap is incredibly slow in LINUX. Even while using 8 workers to load data using PyTorch dataloader this problem still persists. From initial experiments and profiling, [this function](https://github.com/huggingface/neuralcoref/blob/master/neuralcoref/train/dataset.py#L140) seems to be the major bottleneck. 

Possible solutions: 
1. PyTorch Dataloader's [prefetch](https://discuss.pytorch.org/t/how-to-prefetch-data-when-processing-with-gpu/548/41) can be improved maybe? 
2. Remove [this](https://github.com/huggingface/neuralcoref/blob/master/neuralcoref/train/dataset.py#L140) heavy pre-processing while training by doing this once and dumping as binary files and use [dask](https://dask.org/) with PyTorch dataloader.
3. Use **HDF5** instead of npy files. Initial experiments show memmapped npy files are much faster to load than HDF5 files.
4. Convert the training to Tensorflow 2.0 and use tf.data pipeline. (This is not the most elegant solution)
5. [DALI](https://github.com/NVIDIA/DALI). I guess it's more optimized for images

Me and @svlandeg  are inclining more towards solution **2**. 

Suggestions are Welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training is slow with the lazy mode enabled #234

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training is slow with the lazy mode enabled #234

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions