You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a neat library, but I don't have the extra disk space to convert my highly complex torch Dataset into ffcv format. My bottleneck is disk-io, so what I would really like is just a simple standalone quasi-random sampling dataloader that works with existing torch datasets. The idea would be to sample items into a memory pool as fast as possible and sample from the current data in the pool. Each sample in the pool keeps a count of the number of times it's been sampled, and the most used data is kicked out when the pool gets too big. This way the dataloader itself can read from memory and bypass disk overhead for a penalty in batch diversity.
The text was updated successfully, but these errors were encountered:
I don't think that would be possible as we use the file format itself to enable good performance with QUASI_RANDOM. Each dataset file is cut into pages/blocks (default 8MB) and generates an random order that makes sure that we only ever need BATCH_SIZE pages in RAM at any time. Moreover we use the fact that we know ahead of time in which order we will need this pages to pre-load them and compensate for slow disk io.
Unfortunately I don't see how we could make this possible without assuming the structure of the dataset file. What you suggest is a very different strategy. I suggest that you look into webdataset's implementation of shuffle. It seems that it's closer to what you want and also should work without converting your dataset.
This is a neat library, but I don't have the extra disk space to convert my highly complex torch Dataset into ffcv format. My bottleneck is disk-io, so what I would really like is just a simple standalone quasi-random sampling dataloader that works with existing torch datasets. The idea would be to sample items into a memory pool as fast as possible and sample from the current data in the pool. Each sample in the pool keeps a count of the number of times it's been sampled, and the most used data is kicked out when the pool gets too big. This way the dataloader itself can read from memory and bypass disk overhead for a penalty in batch diversity.
The text was updated successfully, but these errors were encountered: