Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Memory issues using map with LazySignal #2045

Closed
woozey opened this issue Aug 30, 2018 · 7 comments · Fixed by #2691
Closed

Bug: Memory issues using map with LazySignal #2045

woozey opened this issue Aug 30, 2018 · 7 comments · Fixed by #2691

Comments

@woozey
Copy link
Contributor

woozey commented Aug 30, 2018

I have tried to use map() applying a self-made function to a large dataset which was loaded lazily. In this example for simplicity I've replaced custom function by np.mean():

>>> h = hs.load(file_names[:2635], signal_type='hologram', lazy=True, stack=True)
>>> h
<LazyHologramImage, title: holo_time_series_01_Ltz_11_5K, dimensions: (2635|3710, 3838)>
>>> mean = h.map(np.mean,
            inplace=False,
            parallel=True,
            ragged=True)
>>> mean.compute()
>>> mean
<BaseSignal, title: holo_time_series_01_Ltz_11_5K, dimensions: (2635|)>

I was puzzled with ragged argument so I've tried both ragged=True and ragged=False, it doesn't play any role here... Next I tried plotting mean.plot() and it already took much longer than expected (though I haven't timed it properly). Next:

>>> mean.save(dir_path+'_mean')
>>> os.path.getsize(dir_path+'_mean.hspy') / 2**30 # in GB
1.9926677970215678

2 GB for a one dimensional data set with 2635 'float32' is certainly to much... Also the performance suggests that it really stores 2 GB for the mean in memory.

Any ideas?

@francisco-dlp
Copy link
Member

Probably the issue is more the load function: when loading and stacking it keeps the metadata of all the files and stacks them too. If the original_metadata of all those files is big (can be the case in e.g. dm files), then that can explain the size of the file.

Regarding the computation time, map applied to np.mean should be exactly equivalent here to h.mean(axis=(1, 2)). The time it takes in lazy mode depends on the chunks. As you are lazy-stacking the signals, it'll have to read them one-by-one for processing, and that can take longer than performing the same operation on a single optimized for the purpose hdf5 file.

@francisco-dlp
Copy link
Member

Probably it'll be a good idea to add the option not to stack the original_metadata when loading and stacking. @woozey, could you confirm if that was actually your issue?

@woozey
Copy link
Contributor Author

woozey commented Aug 30, 2018

I've done some more tests and the behaviour is exactly the same if I would use BaseSignal.mean() instead of map(). It seems that for BaseSignal.mean() if argument out=None it uses _deepcopy_with_new_data() as also map() does. So the issue is likely to be in the _deepcopy_with_new_data(). Using a preallocated output signal passing it to mean() works fine. So it is consistent with your idea that the memory is used up for original_metadata. I'm now doing some tests to confirm it. Will give an update soon.

Regarding the computation time, there is now issue at all. The large size of the results give troubles with all the following computation i.e. transpose, save or plot.

@woozey
Copy link
Contributor Author

woozey commented Aug 30, 2018

You was right, original_metadata makes troubles here. For instance, following the example above:

>>> mean.original_metadata = hyperspy.misc.utils.DictionaryTreeBrowser()
>>> mean.save(dir_path+'_mean', overwrite=True)
>>> os.path.getsize(dir_path+'_mean.hspy')/2**10 # in KB
27.4404296875

Just 27 KB which is totally reasonable.
So adding an option not to stack the original_metadata would be ideal solution.

@francisco-dlp
Copy link
Member

Thanks for the feedback. We must definitely add the option of not storing original_metadata.

@jlaehne
Copy link
Contributor

jlaehne commented Mar 16, 2021

When adding that option, we should include a similar fix for hs.stack.

@francisco-dlp
Copy link
Member

Fixed in #2691.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants