Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: O(n) growth of original_metadata with stacking #1398

Closed
sem-geologist opened this issue Jan 22, 2017 · 22 comments · Fixed by #2691
Closed

BUG: O(n) growth of original_metadata with stacking #1398

sem-geologist opened this issue Jan 22, 2017 · 22 comments · Fixed by #2691

Comments

@sem-geologist
Copy link
Contributor

sem-geologist commented Jan 22, 2017

The current hyperspy behaviour causing issue with memory and slowness

When loading sliced dataset and using stack=True, hyperspy concatenates all original metadata from all slices which is deepcopied and attached to new subsets/slices of hypercube.

Known issues

  1. Plotting with lines of EDS spectra is getting very slow if stack is big
  2. Accessing slices is very slow if format contains very rich original_metadata (e.g. Zeiss tif files):
    original_metadata_time_dependancy
    (for description of benchmark see below)

Known hack/workaround

after loading the dataset stack,

thingy = hs.load('FIBSLICE*.tif', stack=True)

override whole concatenated original metadata:

from hyperspy.misc.utils import DictionaryTreeBrowser 
original_metadata = thingy.original_metadata  # optional, if you need something from it later
thingy.original_metadata = DictionaryTreeBrowser()
# enjoy de-crippled speed!!!

Possible fixes

  1. do not allow to accumulate the original metadata while stacking
  2. do not copy original metadata when slicing
    alternatively:
  3. make original_metadata read-only and pass only the reference (proposed by @vidartf)

(see below, originally the third point was to prevent slicing when plotting EDS, but after finding that this bug affects not only EDS line plotting, but everything stacked, th point is no more considered, as points 1 or 2 would deal with this)
**the original message was moved down.

@sem-geologist
Copy link
Contributor Author

sem-geologist commented Jan 22, 2017

Hello,
I don't know if anybody of You ever got into this problem, but on my linux debian and on hyperspy installed through bundle on windows I experience extremely slow plotting of eds spectra if there is geological amount of lines. I imagine that most of devs are doing a lot of physics and casually experiments are done with 2-4 elements so maybe this problem is not obvious, but in geology we deal often with much more. I.e. major elements like ['C', 'O', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ti', 'Cr', 'Fe'] (10 elements) is normally minimum. But there are plenty of complicated rocks where is 'Th', 'U', 'Zr', 'Nb', 'S', 'Pb', 'Cu', 'Ag', 'P', 'Y', 'Cl', 'Mn', 'Ba', 'Sr', whole bunch of REE (La-Lu) can be therein. So it is completely casual to have about 20 (or even more) elements in most of rocks. Plotting of that (only primary lines Ka or La or Ma) without no backgrounds, takes enough time to go make coffee. Plotting with multiply lines, background lines allows to go for two meal lunch.
I dont think that is the price of matplotlib, rather there is some particularly bad algorithms adding lines. I do time to time some matplotlib plotting (not with hyperspy, just some other stuff) and plot some quite rich in plot items plots (like 50 elippses, few thousands of points, lines and so on...), and it casually takes not more than a sec (in some cases max 2 sec), not minutes.

to be clear what I do

I have this function with which I go through all bcf files, read them and calculate the max_pixel_spectra:

def collective_max_pixel_spectrum(bcf_file_list, **kwargs):
    """calculate and return the max_pixel_spectrum for whole list of bcf files"""
    eds_list = []
    for i in tqdm_notebook(bcf_file_list):
        #print(' '.join([str(bcfs.index(i)),'from',str(len(bcf_file_list))]))
        EDS = hs.load(i, select_type='spectrum', **kwargs)
        eds_list.append(copy(EDS.max()))
        del EDS
        gc.collect()
    stacky = hs.stack(eds_list)
    max_pixel_spectra = stacky.max(axis=0)
    return max_pixel_spectra

when I apply to the real file list:

bcfs = glob('*.bcf')
max_pixel_spectrum = collective_max_pixel_spectrum(bcfs, downsample=4,
                                                   cutoff_at_kV=13)
Out[]: 100% 336/336 [07:44<00:00, 1.45s/it]

this max_pixel_spectrum is only one edx:

In[]: max_pixel_spectrum.data.shape
Out[]: (1347,)
In[]: max_pixel_spectrum
Out[]: <EDSSEMSpectrum, title: Stack of EDX, dimensions: (|1347)>

looks for me completely ordinary eds spectrum.

then:

In[]: lines = ['C_Ka', 'O_Ka', 'F_Ka', 'Na_Ka', 'Mg_Ka',
         'Al_Ka', 'Si_Ka', 'P_Ka', 'Zr_La', 'Nb_La',
         'K_Ka', 'Ca_Ka', 'Ti_Ka', 'Cr_Ka','Fe_Ka',
         'Ni_Ka', 'S_Ka', 'Cl_Ka', 'La_La', 'Ce_La',
         'Nd_La','Cu_Ka', 'Zn_Ka', 'Au_La', 'Th_Ma']
      len(lines)
Out[]: 25

and on my laptop's amd Turion II (it is old cpu, but I get just ~twice faster on Intel Xenon) it takes:

In[]: %%time
       max_pixel_spectrum.plot(xray_lines=lines)

       CPU times: user 1min 54s, sys: 2.2 s, total: 1min 57s
       Wall time: 1min 55s

oh and btw, If I plot without lines:

%%time
max_pixel_spectrum.plot()

output is:

CPU times: user 252 ms, sys: 56 ms, total: 308 ms
Wall time: 250 ms

@francisco-dlp
Copy link
Member

Currently it plots the lines one by one. Maybe we could make it run faster using collections?

@francisco-dlp
Copy link
Member

I've labeled it as a bug because the speed issue make hyperspy unfit for the purpose and there is room for improvement.

@sem-geologist
Copy link
Contributor Author

sem-geologist commented Jan 23, 2017

It does not look for me that problem is one-by-one plottin (I do such plotting a lot, never get into such issue outside hyperspy)

@francisco-dlp
Copy link
Member

Probably the best way to identify what's making it sluggish is to use a code profiler e.g. line_profiler.

@sem-geologist
Copy link
Contributor Author

I dont undertand how to inject the lineprofiler to catch things, but cProfiler works, however it does the profiling of plot with all lines in 10 sec

@sem-geologist
Copy link
Contributor Author

sem-geologist commented Jan 23, 2017

another weird thing is that without profiler the plot is hanged around 1:40min, then in the last 10 seconds the EDS curve apears, and lines are plotted. With profiller it takes same ten seconds but without delay of 1:40min O_o

@sem-geologist
Copy link
Contributor Author

Ok, I think I am starting to isolate the problem.
The larger bcf set I use to the function above to generate the max_pixel_spectrum, the longer delay is in plotting! So probably this max(axis=0) does not return clean single spectra, but single spectra with some traces?

@sem-geologist
Copy link
Contributor Author

  1. st sign of such traces:
    this max_pixel_spectra original_metadata have all metadata of the stack, so original metadata of whole stack is copied to the .max(axis=0) returned signal

@sem-geologist
Copy link
Contributor Author

If I set original_metadata to None then it plots instantaneous, but does not plots the lines (gets errored), it looks it does deepcopy of whole signal and copies everything for plotting?!? wtf, it looks highly inefficient! Why it needs to copy original_metadata for plotting lines?

@francisco-dlp
Copy link
Member

Good catch! deepcopying original_metadata is a common source of sluggishness. Today I don't have time to look into this. Isn't there a way to implement the functionality without calling Signal.max?

@sem-geologist
Copy link
Contributor Author

So I found out the work around for my workflow:
I need to do this before plotting, and then get no delay:
max_pixel_spectrum.original_metadata = DictionaryTreeBrowser()

however it is still a bug, as this will be felt on any stack of images with formats providing huge original metadata trees:
there is few possibilities to deal with this:

  1. do not allow to accumulate the original metadata while stacking, this hardly makes any sense
  2. do not copy original metadata when slicing
  3. do not slice when plotting the lines.

@francisco-dlp
Copy link
Member

  1. 👍
  2. 👍
  3. 👍

@sem-geologist
Copy link
Contributor Author

PS, it is enought to implement 1., then 2 and 3 is not a problem anymore, However I guess somebody did this and somebodies workflow depends on this, so I am not sure I should steadfast remove it.

@francisco-dlp
Copy link
Member

Deecopying original_metadata when slicing does not make much sense: it is the original metadata of the original file, but, as you mention, why should it be copied to signals which are derived from it? I am tempted to think that this is actually a bug and it may be causing sluggishness elsewhere, not only when stacking.

Regarding 1, I agree that most of the time it is unnecessary to stack the original metadata. We could add an option for this to load with default True for hspy 1.x and False for hspy 2+.

3 is a good idea anyway because `deepcopying`` a signal is an expensive operation that we should avoid as much as possible internally.

@vidartf
Copy link
Member

vidartf commented Jan 27, 2017

Would it make sense for the original metadata to be read-only? If so, it would never have to be deep copied, and only references could be passed around. Then 1. and 2. go away (I would probably still fix 3.). By its name (original), it makes sense to keep it read only. Are there any cases where this is/should not be true?

@francisco-dlp
Copy link
Member

That's an excellent point. I am not sure though about not implementing 2. Any thoughts on pros and cons?

@magnunor
Copy link
Contributor

Yeah, I don't really see any point in changing the content in the original metadata, since it is supposed to be whatever it loads from the original file. Any metadata which one wants to use/modify should (in my opinion) be in s.metadata

@sem-geologist
Copy link
Contributor Author

sem-geologist commented Sep 13, 2018

I am renaming the issue, to get closer to point

@sem-geologist sem-geologist changed the title extremely slow plotting with many lines BUG: O(n^2) growth of original_metadata with stacks Sep 13, 2018
@sem-geologist sem-geologist changed the title BUG: O(n^2) growth of original_metadata with stacks BUG: O(n) growth of original_metadata with stacking Sep 14, 2018
@sem-geologist
Copy link
Contributor Author

I have done small test with Zeiss tifs (which have very rich original metadata): while increasing the number of slices and using %%time to measure.
it was done in these steps:

  1. loading the slices:
slices = hs.load('ROI2/FIBSLICE0**.TIF', stack=True, new_axis_name='depth', lazy=True)
  1. measuring time for returning whole dataset:
%%time
slices.isig[:]
  1. overwriting the original metadata with empty tree
slices.original_metadata = DictionaryTreeBrowser()
  1. measuring time for returning whole dataset again:
%%time
slices.isig[:]

increasing the slices by moving the wildcard in filename in load():

  • FIBSLICE0000.TIF
  • FIBSLICE000*.TIF
  • FIBSLICE00*.TIF
  • FIBSLICE0*.TIF

The results:

slices with_OM without_OM
1 0.0356 0.0086
10 0.274 0.0101
100 2.42 0.0109
262 6.28 0.0107

and graphically:
original_metadata_time_dependancy

so this is not O(n^2) but only O(n) increase in time.
Anyway with huge datasets it gets very anoing.
Maybe it is not problem with deepcopying of original_metadata, but with original_metadata object itself, where it works much worse than simple dictionary?

@vidartf
Copy link
Member

vidartf commented Sep 14, 2018

Outline for possible solution:

  • After loading original_metadata, deep freeze it if it is not already. I.e. make it immutable/read-only recursively. Pass this dict to the dict browser.
  • On (deep)copy, simply copy the reference.

@ericpre
Copy link
Member

ericpre commented Apr 1, 2021

Fixed in #2691.

@ericpre ericpre added this to the v1.6.2 milestone Apr 1, 2021
@ericpre ericpre linked a pull request Apr 7, 2021 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants