Skip to content

Training is slow with the lazy mode enabled #234

@Abhijit-2592

Description

@Abhijit-2592

Pull request #231 reduced the memory footprint of the training script significantly by using memmaps. But this causes a lot of overhead while data loading and now the data loading part is a bottle neck while training on GPU and causes a significant drop in the training speed (around 5 times). And there is this issue in numpy which highlights that the memmap is incredibly slow in LINUX. Even while using 8 workers to load data using PyTorch dataloader this problem still persists. From initial experiments and profiling, this function seems to be the major bottleneck.

Possible solutions:

  1. PyTorch Dataloader's prefetch can be improved maybe?
  2. Remove this heavy pre-processing while training by doing this once and dumping as binary files and use dask with PyTorch dataloader.
  3. Use HDF5 instead of npy files. Initial experiments show memmapped npy files are much faster to load than HDF5 files.
  4. Convert the training to Tensorflow 2.0 and use tf.data pipeline. (This is not the most elegant solution)
  5. DALI. I guess it's more optimized for images

Me and @svlandeg are inclining more towards solution 2.

Suggestions are Welcome!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions