Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loader for PovertyMap is very slow #12

Closed
dmadras opened this issue Jan 12, 2021 · 6 comments
Closed

Data loader for PovertyMap is very slow #12

dmadras opened this issue Jan 12, 2021 · 6 comments

Comments

@dmadras
Copy link

dmadras commented Jan 12, 2021

Hi -

Ran into a bit of an issue with data loading the Povertymap dataset - loading a single minibatch with 128 examples takes about 5-6 seconds. This is not a huge deal but slow enough to make me curious if there's a faster way of doing this.

Digging into the code a bit, it looks like the slowdown is mostly due to the array copy on line 239 of poverty_dataset.py

img = self.imgs[idx].copy()

FWIW it looks like this is a known issue for memory-mapped numpy arrays on Linux systems (https://stackoverflow.com/questions/42864320/numpy-memmap-performance-issues).

I'm not sure if there are any recommendations for getting around this, or if there's another way the data could be loaded in? Or let me know if I'm totally off-base here. Thanks!

@sangmichaelxie
Copy link
Member

Thanks for the question! The copy is mainly there for protection against writing into the original memmapped data array, since pytorch doesn’t support immutable tensors. However, if you’re sure that the algorithm won’t do any in place operations or write into the input tensor (like when ‘no_nl = True’) then it should be safe to not copy. But in general it isn’t safe, so for now we’d like to have the copy.

My other thought is that it should scale well with the number of data loader workers. Have you tried just increasing the number of prefetching threads in the data loader?

@dmadras
Copy link
Author

dmadras commented Jan 15, 2021

Yeah, totally agree it makes sense to be copying! Trying to increase the number of workers now - it looks like the time per minibatch is roughly constant as num_workers increases (is this the keyword argument you meant?). The loading times are quite uneven between minibatches when this number increases, so sometimes have to aggregate times across sets of minibatches:

num_workers = 0 -> 5-6s for 1 minibatch
num_workers = 1 -> 6-7s for 1 minibatch
num_workers = 2 -> 14-15s for 2 minibatches
num_workers = 3 -> 27-30s for 3 minibatches
num_workers = 4 -> 32-40s for 4 minibatches

That said I'm just testing now, I think that maybe when I get it up and running with a model, this will become less noticeable (esp in the multi-threaded setting) but not clear yet. However, I'd be curious if you had any recommendations in terms of other methods for loading the data in (possibly other formats you tried?)

@sangmichaelxie
Copy link
Member

Thanks for doing the test with num_workers. Hm, I'll look into this soon myself and get back to you on faster methods for loading. The numpy file is about 30G so we were technically able to fit the array in RAM during our tests (with RAM large enough) and set cache_counter to 9999999 - in this case after the first epoch everything is cached by numpy, I believe. If you can increase your RAM, maybe that could help. This isn't a solution for everyone, of course.

@dmadras
Copy link
Author

dmadras commented Jan 19, 2021

Okay thanks - I haven't tried playing with cache_counter, I'll take a look at doing that and see how high I can get.

@sangmichaelxie
Copy link
Member

Sorry I misspoke, I meant the variable called cache_size, not cache_counter. You may want to make the change that's in this commit in one of the open PR's if you can make your RAM large enough to fit the entire file: d308c0b
This way if you set the cache_size to a large number, it won't try to refresh the cache anymore.

@ssagawa
Copy link
Collaborator

ssagawa commented Mar 10, 2021

Thanks again for flagging this issue and for your patience! We have updated PovertyMap-WILDS in our v1.1 release to address this. We no longer use memmap, and we’ve instead converted the npy files (losslessly) into individual compressed .npz files to improve disk I/O and memory usage. The underlying data is the same, so results should not be affected. Thank you!

@ssagawa ssagawa closed this as completed Mar 10, 2021
teetone added a commit that referenced this issue Dec 10, 2021
Support coarse-grained domains
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants