Data loader for PovertyMap is very slow #12

dmadras · 2021-01-12T22:05:16Z

Hi -

Ran into a bit of an issue with data loading the Povertymap dataset - loading a single minibatch with 128 examples takes about 5-6 seconds. This is not a huge deal but slow enough to make me curious if there's a faster way of doing this.

Digging into the code a bit, it looks like the slowdown is mostly due to the array copy on line 239 of poverty_dataset.py

wilds/wilds/datasets/poverty_dataset.py

Line 239 in f984047

img = self.imgs[idx].copy()

FWIW it looks like this is a known issue for memory-mapped numpy arrays on Linux systems (https://stackoverflow.com/questions/42864320/numpy-memmap-performance-issues).

I'm not sure if there are any recommendations for getting around this, or if there's another way the data could be loaded in? Or let me know if I'm totally off-base here. Thanks!

sangmichaelxie · 2021-01-13T17:27:46Z

Thanks for the question! The copy is mainly there for protection against writing into the original memmapped data array, since pytorch doesn’t support immutable tensors. However, if you’re sure that the algorithm won’t do any in place operations or write into the input tensor (like when ‘no_nl = True’) then it should be safe to not copy. But in general it isn’t safe, so for now we’d like to have the copy.

My other thought is that it should scale well with the number of data loader workers. Have you tried just increasing the number of prefetching threads in the data loader?

dmadras · 2021-01-15T21:21:44Z

Yeah, totally agree it makes sense to be copying! Trying to increase the number of workers now - it looks like the time per minibatch is roughly constant as num_workers increases (is this the keyword argument you meant?). The loading times are quite uneven between minibatches when this number increases, so sometimes have to aggregate times across sets of minibatches:

num_workers = 0 -> 5-6s for 1 minibatch
num_workers = 1 -> 6-7s for 1 minibatch
num_workers = 2 -> 14-15s for 2 minibatches
num_workers = 3 -> 27-30s for 3 minibatches
num_workers = 4 -> 32-40s for 4 minibatches

That said I'm just testing now, I think that maybe when I get it up and running with a model, this will become less noticeable (esp in the multi-threaded setting) but not clear yet. However, I'd be curious if you had any recommendations in terms of other methods for loading the data in (possibly other formats you tried?)

sangmichaelxie · 2021-01-19T06:24:51Z

Thanks for doing the test with num_workers. Hm, I'll look into this soon myself and get back to you on faster methods for loading. The numpy file is about 30G so we were technically able to fit the array in RAM during our tests (with RAM large enough) and set cache_counter to 9999999 - in this case after the first epoch everything is cached by numpy, I believe. If you can increase your RAM, maybe that could help. This isn't a solution for everyone, of course.

dmadras · 2021-01-19T13:37:25Z

Okay thanks - I haven't tried playing with cache_counter, I'll take a look at doing that and see how high I can get.

sangmichaelxie · 2021-01-19T16:18:10Z

Sorry I misspoke, I meant the variable called cache_size, not cache_counter. You may want to make the change that's in this commit in one of the open PR's if you can make your RAM large enough to fit the entire file: d308c0b
This way if you set the cache_size to a large number, it won't try to refresh the cache anymore.

ssagawa · 2021-03-10T07:38:46Z

Thanks again for flagging this issue and for your patience! We have updated PovertyMap-WILDS in our v1.1 release to address this. We no longer use memmap, and we’ve instead converted the npy files (losslessly) into individual compressed .npz files to improve disk I/O and memory usage. The underlying data is the same, so results should not be affected. Thank you!

Support coarse-grained domains

ssagawa closed this as completed Mar 10, 2021

teetone added a commit that referenced this issue Dec 10, 2021

Merge pull request #12 from p-lambda/coarse

236db52

Support coarse-grained domains

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loader for PovertyMap is very slow #12

Data loader for PovertyMap is very slow #12

dmadras commented Jan 12, 2021

sangmichaelxie commented Jan 13, 2021

dmadras commented Jan 15, 2021

sangmichaelxie commented Jan 19, 2021

dmadras commented Jan 19, 2021

sangmichaelxie commented Jan 19, 2021

ssagawa commented Mar 10, 2021

Data loader for PovertyMap is very slow #12

Data loader for PovertyMap is very slow #12

Comments

dmadras commented Jan 12, 2021

sangmichaelxie commented Jan 13, 2021

dmadras commented Jan 15, 2021

sangmichaelxie commented Jan 19, 2021

dmadras commented Jan 19, 2021

sangmichaelxie commented Jan 19, 2021

ssagawa commented Mar 10, 2021