Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standalone quasi-random sampling. #58

Closed
Erotemic opened this issue Jan 19, 2022 · 1 comment
Closed

Standalone quasi-random sampling. #58

Erotemic opened this issue Jan 19, 2022 · 1 comment

Comments

@Erotemic
Copy link

This is a neat library, but I don't have the extra disk space to convert my highly complex torch Dataset into ffcv format. My bottleneck is disk-io, so what I would really like is just a simple standalone quasi-random sampling dataloader that works with existing torch datasets. The idea would be to sample items into a memory pool as fast as possible and sample from the current data in the pool. Each sample in the pool keeps a count of the number of times it's been sampled, and the most used data is kicked out when the pool gets too big. This way the dataloader itself can read from memory and bypass disk overhead for a penalty in batch diversity.

@GuillaumeLeclerc
Copy link
Collaborator

I don't think that would be possible as we use the file format itself to enable good performance with QUASI_RANDOM. Each dataset file is cut into pages/blocks (default 8MB) and generates an random order that makes sure that we only ever need BATCH_SIZE pages in RAM at any time. Moreover we use the fact that we know ahead of time in which order we will need this pages to pre-load them and compensate for slow disk io.
Unfortunately I don't see how we could make this possible without assuming the structure of the dataset file. What you suggest is a very different strategy. I suggest that you look into webdataset's implementation of shuffle. It seems that it's closer to what you want and also should work without converting your dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants