New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data loader for PovertyMap is very slow #12
Comments
Thanks for the question! The copy is mainly there for protection against writing into the original memmapped data array, since pytorch doesn’t support immutable tensors. However, if you’re sure that the algorithm won’t do any in place operations or write into the input tensor (like when ‘no_nl = True’) then it should be safe to not copy. But in general it isn’t safe, so for now we’d like to have the copy. My other thought is that it should scale well with the number of data loader workers. Have you tried just increasing the number of prefetching threads in the data loader? |
Yeah, totally agree it makes sense to be copying! Trying to increase the number of workers now - it looks like the time per minibatch is roughly constant as num_workers increases (is this the keyword argument you meant?). The loading times are quite uneven between minibatches when this number increases, so sometimes have to aggregate times across sets of minibatches: num_workers = 0 -> 5-6s for 1 minibatch That said I'm just testing now, I think that maybe when I get it up and running with a model, this will become less noticeable (esp in the multi-threaded setting) but not clear yet. However, I'd be curious if you had any recommendations in terms of other methods for loading the data in (possibly other formats you tried?) |
Thanks for doing the test with num_workers. Hm, I'll look into this soon myself and get back to you on faster methods for loading. The numpy file is about 30G so we were technically able to fit the array in RAM during our tests (with RAM large enough) and set cache_counter to 9999999 - in this case after the first epoch everything is cached by numpy, I believe. If you can increase your RAM, maybe that could help. This isn't a solution for everyone, of course. |
Okay thanks - I haven't tried playing with cache_counter, I'll take a look at doing that and see how high I can get. |
Sorry I misspoke, I meant the variable called cache_size, not cache_counter. You may want to make the change that's in this commit in one of the open PR's if you can make your RAM large enough to fit the entire file: d308c0b |
Thanks again for flagging this issue and for your patience! We have updated PovertyMap-WILDS in our v1.1 release to address this. We no longer use memmap, and we’ve instead converted the npy files (losslessly) into individual compressed .npz files to improve disk I/O and memory usage. The underlying data is the same, so results should not be affected. Thank you! |
Hi -
Ran into a bit of an issue with data loading the Povertymap dataset - loading a single minibatch with 128 examples takes about 5-6 seconds. This is not a huge deal but slow enough to make me curious if there's a faster way of doing this.
Digging into the code a bit, it looks like the slowdown is mostly due to the array copy on line 239 of poverty_dataset.py
wilds/wilds/datasets/poverty_dataset.py
Line 239 in f984047
FWIW it looks like this is a known issue for memory-mapped numpy arrays on Linux systems (https://stackoverflow.com/questions/42864320/numpy-memmap-performance-issues).
I'm not sure if there are any recommendations for getting around this, or if there's another way the data could be loaded in? Or let me know if I'm totally off-base here. Thanks!
The text was updated successfully, but these errors were encountered: