Bug: Memory issues using `map` with LazySignal #2045

woozey · 2018-08-30T08:26:46Z

I have tried to use map() applying a self-made function to a large dataset which was loaded lazily. In this example for simplicity I've replaced custom function by np.mean():

>>> h = hs.load(file_names[:2635], signal_type='hologram', lazy=True, stack=True)
>>> h
<LazyHologramImage, title: holo_time_series_01_Ltz_11_5K, dimensions: (2635|3710, 3838)>
>>> mean = h.map(np.mean,
            inplace=False,
            parallel=True,
            ragged=True)
>>> mean.compute()
>>> mean
<BaseSignal, title: holo_time_series_01_Ltz_11_5K, dimensions: (2635|)>

I was puzzled with ragged argument so I've tried both ragged=True and ragged=False, it doesn't play any role here... Next I tried plotting mean.plot() and it already took much longer than expected (though I haven't timed it properly). Next:

>>> mean.save(dir_path+'_mean')
>>> os.path.getsize(dir_path+'_mean.hspy') / 2**30 # in GB
1.9926677970215678

2 GB for a one dimensional data set with 2635 'float32' is certainly to much... Also the performance suggests that it really stores 2 GB for the mean in memory.

Any ideas?

The text was updated successfully, but these errors were encountered:

francisco-dlp · 2018-08-30T12:37:31Z

Probably the issue is more the load function: when loading and stacking it keeps the metadata of all the files and stacks them too. If the original_metadata of all those files is big (can be the case in e.g. dm files), then that can explain the size of the file.

Regarding the computation time, map applied to np.mean should be exactly equivalent here to h.mean(axis=(1, 2)). The time it takes in lazy mode depends on the chunks. As you are lazy-stacking the signals, it'll have to read them one-by-one for processing, and that can take longer than performing the same operation on a single optimized for the purpose hdf5 file.

francisco-dlp · 2018-08-30T12:46:39Z

Probably it'll be a good idea to add the option not to stack the original_metadata when loading and stacking. @woozey, could you confirm if that was actually your issue?

woozey · 2018-08-30T14:13:04Z

I've done some more tests and the behaviour is exactly the same if I would use BaseSignal.mean() instead of map(). It seems that for BaseSignal.mean() if argument out=None it uses _deepcopy_with_new_data() as also map() does. So the issue is likely to be in the _deepcopy_with_new_data(). Using a preallocated output signal passing it to mean() works fine. So it is consistent with your idea that the memory is used up for original_metadata. I'm now doing some tests to confirm it. Will give an update soon.

Regarding the computation time, there is now issue at all. The large size of the results give troubles with all the following computation i.e. transpose, save or plot.

woozey · 2018-08-30T14:57:00Z

You was right, original_metadata makes troubles here. For instance, following the example above:

>>> mean.original_metadata = hyperspy.misc.utils.DictionaryTreeBrowser()
>>> mean.save(dir_path+'_mean', overwrite=True)
>>> os.path.getsize(dir_path+'_mean.hspy')/2**10 # in KB
27.4404296875

Just 27 KB which is totally reasonable.
So adding an option not to stack the original_metadata would be ideal solution.

francisco-dlp · 2018-08-30T15:03:27Z

Thanks for the feedback. We must definitely add the option of not storing original_metadata.

jlaehne · 2021-03-16T13:54:09Z

When adding that option, we should include a similar fix for hs.stack.

francisco-dlp · 2021-04-12T13:58:17Z

Fixed in #2691.

francisco-dlp added type: bug release: next patch labels Aug 30, 2018

ericpre mentioned this issue Sep 7, 2020

Large (original_) metadata significantly slowing hyperspy methods #2536

Closed

jlaehne mentioned this issue Mar 16, 2021

Lazy DictionaryTreeBrowser #2623

Merged

5 tasks

ericpre mentioned this issue Mar 28, 2021

Add option not to stack original metadata in hs.stack #2691

Merged

6 tasks

jlaehne linked a pull request Mar 28, 2021 that will close this issue

Add option not to stack original metadata in hs.stack #2691

Merged

6 tasks

ericpre added the status: fix-submitted label Apr 2, 2021

ericpre added this to the v1.6.2 milestone Apr 2, 2021

francisco-dlp closed this as completed Apr 12, 2021

ericpre removed the release: next patch label Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Memory issues using `map` with LazySignal #2045

Bug: Memory issues using `map` with LazySignal #2045

woozey commented Aug 30, 2018

francisco-dlp commented Aug 30, 2018

francisco-dlp commented Aug 30, 2018

woozey commented Aug 30, 2018

woozey commented Aug 30, 2018

francisco-dlp commented Aug 30, 2018

jlaehne commented Mar 16, 2021

francisco-dlp commented Apr 12, 2021

Bug: Memory issues using map with LazySignal #2045

Bug: Memory issues using map with LazySignal #2045

Comments

woozey commented Aug 30, 2018

francisco-dlp commented Aug 30, 2018

francisco-dlp commented Aug 30, 2018

woozey commented Aug 30, 2018

woozey commented Aug 30, 2018

francisco-dlp commented Aug 30, 2018

jlaehne commented Mar 16, 2021

francisco-dlp commented Apr 12, 2021

Bug: Memory issues using `map` with LazySignal #2045

Bug: Memory issues using `map` with LazySignal #2045