Standalone quasi-random sampling. #58

Erotemic · 2022-01-19T18:38:45Z

This is a neat library, but I don't have the extra disk space to convert my highly complex torch Dataset into ffcv format. My bottleneck is disk-io, so what I would really like is just a simple standalone quasi-random sampling dataloader that works with existing torch datasets. The idea would be to sample items into a memory pool as fast as possible and sample from the current data in the pool. Each sample in the pool keeps a count of the number of times it's been sampled, and the most used data is kicked out when the pool gets too big. This way the dataloader itself can read from memory and bypass disk overhead for a penalty in batch diversity.

GuillaumeLeclerc · 2022-01-19T18:59:36Z

I don't think that would be possible as we use the file format itself to enable good performance with QUASI_RANDOM. Each dataset file is cut into pages/blocks (default 8MB) and generates an random order that makes sure that we only ever need BATCH_SIZE pages in RAM at any time. Moreover we use the fact that we know ahead of time in which order we will need this pages to pre-load them and compensate for slow disk io.
Unfortunately I don't see how we could make this possible without assuming the structure of the dataset file. What you suggest is a very different strategy. I suggest that you look into webdataset's implementation of shuffle. It seems that it's closer to what you want and also should work without converting your dataset.

GuillaumeLeclerc closed this as completed Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone quasi-random sampling. #58

Standalone quasi-random sampling. #58

Erotemic commented Jan 19, 2022

GuillaumeLeclerc commented Jan 19, 2022

Standalone quasi-random sampling. #58

Standalone quasi-random sampling. #58

Comments

Erotemic commented Jan 19, 2022

GuillaumeLeclerc commented Jan 19, 2022