New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: O(n) growth of original_metadata with stacking #1398
Comments
Hello, to be clear what I do I have this function with which I go through all bcf files, read them and calculate the max_pixel_spectra: def collective_max_pixel_spectrum(bcf_file_list, **kwargs):
"""calculate and return the max_pixel_spectrum for whole list of bcf files"""
eds_list = []
for i in tqdm_notebook(bcf_file_list):
#print(' '.join([str(bcfs.index(i)),'from',str(len(bcf_file_list))]))
EDS = hs.load(i, select_type='spectrum', **kwargs)
eds_list.append(copy(EDS.max()))
del EDS
gc.collect()
stacky = hs.stack(eds_list)
max_pixel_spectra = stacky.max(axis=0)
return max_pixel_spectra when I apply to the real file list: bcfs = glob('*.bcf')
max_pixel_spectrum = collective_max_pixel_spectrum(bcfs, downsample=4,
cutoff_at_kV=13)
Out[]: 100% 336/336 [07:44<00:00, 1.45s/it] this max_pixel_spectrum is only one edx: In[]: max_pixel_spectrum.data.shape
Out[]: (1347,)
In[]: max_pixel_spectrum
Out[]: <EDSSEMSpectrum, title: Stack of EDX, dimensions: (|1347)> looks for me completely ordinary eds spectrum. then: In[]: lines = ['C_Ka', 'O_Ka', 'F_Ka', 'Na_Ka', 'Mg_Ka',
'Al_Ka', 'Si_Ka', 'P_Ka', 'Zr_La', 'Nb_La',
'K_Ka', 'Ca_Ka', 'Ti_Ka', 'Cr_Ka','Fe_Ka',
'Ni_Ka', 'S_Ka', 'Cl_Ka', 'La_La', 'Ce_La',
'Nd_La','Cu_Ka', 'Zn_Ka', 'Au_La', 'Th_Ma']
len(lines)
Out[]: 25 and on my laptop's amd Turion II (it is old cpu, but I get just ~twice faster on Intel Xenon) it takes:
oh and btw, If I plot without lines: %%time
max_pixel_spectrum.plot() output is:
|
Currently it plots the lines one by one. Maybe we could make it run faster using collections? |
I've labeled it as a bug because the speed issue make hyperspy unfit for the purpose and there is room for improvement. |
It does not look for me that problem is one-by-one plottin (I do such plotting a lot, never get into such issue outside hyperspy) |
Probably the best way to identify what's making it sluggish is to use a code profiler e.g. line_profiler. |
I dont undertand how to inject the lineprofiler to catch things, but cProfiler works, however it does the profiling of plot with all lines in 10 sec |
|
Ok, I think I am starting to isolate the problem. |
|
If I set original_metadata to None then it plots instantaneous, but does not plots the lines (gets errored), it looks it does |
Good catch! deepcopying |
So I found out the work around for my workflow: however it is still a bug, as this will be felt on any stack of images with formats providing huge original metadata trees:
|
|
PS, it is enought to implement 1., then 2 and 3 is not a problem anymore, However I guess somebody did this and somebodies workflow depends on this, so I am not sure I should steadfast remove it. |
Deecopying original_metadata when slicing does not make much sense: it is the original metadata of the original file, but, as you mention, why should it be copied to signals which are derived from it? I am tempted to think that this is actually a bug and it may be causing sluggishness elsewhere, not only when stacking. Regarding 1, I agree that most of the time it is unnecessary to stack the original metadata. We could add an option for this to 3 is a good idea anyway because `deepcopying`` a signal is an expensive operation that we should avoid as much as possible internally. |
Would it make sense for the original metadata to be read-only? If so, it would never have to be deep copied, and only references could be passed around. Then 1. and 2. go away (I would probably still fix 3.). By its name (original), it makes sense to keep it read only. Are there any cases where this is/should not be true? |
That's an excellent point. I am not sure though about not implementing 2. Any thoughts on pros and cons? |
Yeah, I don't really see any point in changing the content in the original metadata, since it is supposed to be whatever it loads from the original file. Any metadata which one wants to use/modify should (in my opinion) be in |
I am renaming the issue, to get closer to point |
I have done small test with Zeiss tifs (which have very rich original metadata): while increasing the number of slices and using
slices = hs.load('ROI2/FIBSLICE0**.TIF', stack=True, new_axis_name='depth', lazy=True)
%%time
slices.isig[:]
slices.original_metadata = DictionaryTreeBrowser()
%%time
slices.isig[:] increasing the slices by moving the wildcard in filename in
The results:
so this is not O(n^2) but only O(n) increase in time. |
Outline for possible solution:
|
Fixed in #2691. |
The current hyperspy behaviour causing issue with memory and slowness
When loading sliced dataset and using
stack=True
, hyperspy concatenates all original metadata from all slices which is deepcopied and attached to new subsets/slices of hypercube.Known issues
(for description of benchmark see below)
Known hack/workaround
after loading the dataset stack,
override whole concatenated original metadata:
Possible fixes
alternatively:
(see below, originally the third point was to prevent slicing when plotting EDS, but after finding that this bug affects not only EDS line plotting, but everything stacked, th point is no more considered, as points 1 or 2 would deal with this)
**the original message was moved down.
The text was updated successfully, but these errors were encountered: