BUG: O(n) growth of original_metadata with stacking #1398

sem-geologist · 2017-01-22T20:43:41Z

The current hyperspy behaviour causing issue with memory and slowness

When loading sliced dataset and using stack=True, hyperspy concatenates all original metadata from all slices which is deepcopied and attached to new subsets/slices of hypercube.

Known issues

Plotting with lines of EDS spectra is getting very slow if stack is big
Accessing slices is very slow if format contains very rich original_metadata (e.g. Zeiss tif files):

(for description of benchmark see below)

Known hack/workaround

after loading the dataset stack,

thingy = hs.load('FIBSLICE*.tif', stack=True)

override whole concatenated original metadata:

from hyperspy.misc.utils import DictionaryTreeBrowser 
original_metadata = thingy.original_metadata  # optional, if you need something from it later
thingy.original_metadata = DictionaryTreeBrowser()
# enjoy de-crippled speed!!!

Possible fixes

do not allow to accumulate the original metadata while stacking
do not copy original metadata when slicing
alternatively:
make original_metadata read-only and pass only the reference (proposed by @vidartf)

(see below, originally the third point was to prevent slicing when plotting EDS, but after finding that this bug affects not only EDS line plotting, but everything stacked, th point is no more considered, as points 1 or 2 would deal with this)
**the original message was moved down.

The text was updated successfully, but these errors were encountered:

sem-geologist · 2017-01-22T21:28:26Z

Hello,
I don't know if anybody of You ever got into this problem, but on my linux debian and on hyperspy installed through bundle on windows I experience extremely slow plotting of eds spectra if there is geological amount of lines. I imagine that most of devs are doing a lot of physics and casually experiments are done with 2-4 elements so maybe this problem is not obvious, but in geology we deal often with much more. I.e. major elements like ['C', 'O', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ti', 'Cr', 'Fe'] (10 elements) is normally minimum. But there are plenty of complicated rocks where is 'Th', 'U', 'Zr', 'Nb', 'S', 'Pb', 'Cu', 'Ag', 'P', 'Y', 'Cl', 'Mn', 'Ba', 'Sr', whole bunch of REE (La-Lu) can be therein. So it is completely casual to have about 20 (or even more) elements in most of rocks. Plotting of that (only primary lines Ka or La or Ma) without no backgrounds, takes enough time to go make coffee. Plotting with multiply lines, background lines allows to go for two meal lunch.
I dont think that is the price of matplotlib, rather there is some particularly bad algorithms adding lines. I do time to time some matplotlib plotting (not with hyperspy, just some other stuff) and plot some quite rich in plot items plots (like 50 elippses, few thousands of points, lines and so on...), and it casually takes not more than a sec (in some cases max 2 sec), not minutes.

to be clear what I do

I have this function with which I go through all bcf files, read them and calculate the max_pixel_spectra:

def collective_max_pixel_spectrum(bcf_file_list, **kwargs):
    """calculate and return the max_pixel_spectrum for whole list of bcf files"""
    eds_list = []
    for i in tqdm_notebook(bcf_file_list):
        #print(' '.join([str(bcfs.index(i)),'from',str(len(bcf_file_list))]))
        EDS = hs.load(i, select_type='spectrum', **kwargs)
        eds_list.append(copy(EDS.max()))
        del EDS
        gc.collect()
    stacky = hs.stack(eds_list)
    max_pixel_spectra = stacky.max(axis=0)
    return max_pixel_spectra

when I apply to the real file list:

bcfs = glob('*.bcf')
max_pixel_spectrum = collective_max_pixel_spectrum(bcfs, downsample=4,
                                                   cutoff_at_kV=13)
Out[]: 100% 336/336 [07:44<00:00, 1.45s/it]

this max_pixel_spectrum is only one edx:

In[]: max_pixel_spectrum.data.shape
Out[]: (1347,)
In[]: max_pixel_spectrum
Out[]: <EDSSEMSpectrum, title: Stack of EDX, dimensions: (|1347)>

looks for me completely ordinary eds spectrum.

then:

In[]: lines = ['C_Ka', 'O_Ka', 'F_Ka', 'Na_Ka', 'Mg_Ka',
         'Al_Ka', 'Si_Ka', 'P_Ka', 'Zr_La', 'Nb_La',
         'K_Ka', 'Ca_Ka', 'Ti_Ka', 'Cr_Ka','Fe_Ka',
         'Ni_Ka', 'S_Ka', 'Cl_Ka', 'La_La', 'Ce_La',
         'Nd_La','Cu_Ka', 'Zn_Ka', 'Au_La', 'Th_Ma']
      len(lines)
Out[]: 25

and on my laptop's amd Turion II (it is old cpu, but I get just ~twice faster on Intel Xenon) it takes:

In[]: %%time
       max_pixel_spectrum.plot(xray_lines=lines)

       CPU times: user 1min 54s, sys: 2.2 s, total: 1min 57s
       Wall time: 1min 55s

oh and btw, If I plot without lines:

%%time
max_pixel_spectrum.plot()

output is:

CPU times: user 252 ms, sys: 56 ms, total: 308 ms
Wall time: 250 ms

francisco-dlp · 2017-01-23T09:22:53Z

Currently it plots the lines one by one. Maybe we could make it run faster using collections?

francisco-dlp · 2017-01-23T09:24:28Z

I've labeled it as a bug because the speed issue make hyperspy unfit for the purpose and there is room for improvement.

sem-geologist · 2017-01-23T09:26:24Z

It does not look for me that problem is one-by-one plottin (I do such plotting a lot, never get into such issue outside hyperspy)

francisco-dlp · 2017-01-23T09:33:25Z

Probably the best way to identify what's making it sluggish is to use a code profiler e.g. line_profiler.

sem-geologist · 2017-01-23T13:20:04Z

I dont undertand how to inject the lineprofiler to catch things, but cProfiler works, however it does the profiling of plot with all lines in 10 sec

sem-geologist · 2017-01-23T13:22:16Z

another weird thing is that without profiler the plot is hanged around 1:40min, then in the last 10 seconds the EDS curve apears, and lines are plotted. With profiller it takes same ten seconds but without delay of 1:40min O_o

sem-geologist · 2017-01-23T13:27:36Z

Ok, I think I am starting to isolate the problem.
The larger bcf set I use to the function above to generate the max_pixel_spectrum, the longer delay is in plotting! So probably this max(axis=0) does not return clean single spectra, but single spectra with some traces?

sem-geologist · 2017-01-23T13:47:43Z

st sign of such traces:
this max_pixel_spectra original_metadata have all metadata of the stack, so original metadata of whole stack is copied to the .max(axis=0) returned signal

sem-geologist · 2017-01-23T14:00:52Z

If I set original_metadata to None then it plots instantaneous, but does not plots the lines (gets errored), it looks it does deepcopy of whole signal and copies everything for plotting?!? wtf, it looks highly inefficient! Why it needs to copy original_metadata for plotting lines?

francisco-dlp · 2017-01-23T14:12:43Z

Good catch! deepcopying original_metadata is a common source of sluggishness. Today I don't have time to look into this. Isn't there a way to implement the functionality without calling Signal.max?

sem-geologist · 2017-01-23T14:15:00Z

So I found out the work around for my workflow:
I need to do this before plotting, and then get no delay:
max_pixel_spectrum.original_metadata = DictionaryTreeBrowser()

however it is still a bug, as this will be felt on any stack of images with formats providing huge original metadata trees:
there is few possibilities to deal with this:

do not allow to accumulate the original metadata while stacking, this hardly makes any sense
do not copy original metadata when slicing
do not slice when plotting the lines.

francisco-dlp · 2017-01-23T14:17:37Z

👍
👍
👍

sem-geologist · 2017-01-23T14:19:53Z

PS, it is enought to implement 1., then 2 and 3 is not a problem anymore, However I guess somebody did this and somebodies workflow depends on this, so I am not sure I should steadfast remove it.

francisco-dlp · 2017-01-23T14:26:02Z

Deecopying original_metadata when slicing does not make much sense: it is the original metadata of the original file, but, as you mention, why should it be copied to signals which are derived from it? I am tempted to think that this is actually a bug and it may be causing sluggishness elsewhere, not only when stacking.

Regarding 1, I agree that most of the time it is unnecessary to stack the original metadata. We could add an option for this to load with default True for hspy 1.x and False for hspy 2+.

3 is a good idea anyway because `deepcopying`` a signal is an expensive operation that we should avoid as much as possible internally.

vidartf · 2017-01-27T09:59:47Z

Would it make sense for the original metadata to be read-only? If so, it would never have to be deep copied, and only references could be passed around. Then 1. and 2. go away (I would probably still fix 3.). By its name (original), it makes sense to keep it read only. Are there any cases where this is/should not be true?

francisco-dlp · 2017-01-27T12:03:26Z

That's an excellent point. I am not sure though about not implementing 2. Any thoughts on pros and cons?

magnunor · 2017-01-27T12:15:53Z

Yeah, I don't really see any point in changing the content in the original metadata, since it is supposed to be whatever it loads from the original file. Any metadata which one wants to use/modify should (in my opinion) be in s.metadata

sem-geologist · 2018-09-13T11:37:50Z

I am renaming the issue, to get closer to point

sem-geologist · 2018-09-14T09:39:31Z

I have done small test with Zeiss tifs (which have very rich original metadata): while increasing the number of slices and using %%time to measure.
it was done in these steps:

loading the slices:

slices = hs.load('ROI2/FIBSLICE0**.TIF', stack=True, new_axis_name='depth', lazy=True)

measuring time for returning whole dataset:

%%time
slices.isig[:]

overwriting the original metadata with empty tree

slices.original_metadata = DictionaryTreeBrowser()

measuring time for returning whole dataset again:

%%time
slices.isig[:]

increasing the slices by moving the wildcard in filename in load():

FIBSLICE0000.TIF
FIBSLICE000*.TIF
FIBSLICE00*.TIF
FIBSLICE0*.TIF

The results:

slices	with_OM	without_OM
1	0.0356	0.0086
10	0.274	0.0101
100	2.42	0.0109
262	6.28	0.0107

and graphically:

so this is not O(n^2) but only O(n) increase in time.
Anyway with huge datasets it gets very anoing.
Maybe it is not problem with deepcopying of original_metadata, but with original_metadata object itself, where it works much worse than simple dictionary?

vidartf · 2018-09-14T10:54:09Z

Outline for possible solution:

After loading original_metadata, deep freeze it if it is not already. I.e. make it immutable/read-only recursively. Pass this dict to the dict browser.
On (deep)copy, simply copy the reference.

ericpre · 2021-04-01T15:49:12Z

Fixed in #2691.

francisco-dlp added type: bug release: next patch labels Jan 23, 2017

This was referenced Feb 2, 2017

Fix bug when saving minimal signal in EMD-file format #1416

Closed

Fix emd minimal saving 2 #1424

Merged

francisco-dlp mentioned this issue Mar 17, 2017

RELEASE 1.3 #1535

Closed

sem-geologist changed the title ~~extremely slow plotting with many lines~~ BUG: O(n^2) growth of original_metadata with stacks Sep 13, 2018

sem-geologist changed the title ~~BUG: O(n^2) growth of original_metadata with stacks~~ BUG: O(n) growth of original_metadata with stacking Sep 14, 2018

sem-geologist mentioned this issue Sep 18, 2018

Zeiss format Date and Time missing in metadata #2057

Closed

3 tasks

ericpre mentioned this issue Sep 7, 2020

Large (original_) metadata significantly slowing hyperspy methods #2536

Closed

ericpre mentioned this issue Mar 28, 2021

Add option not to stack original metadata in hs.stack #2691

Merged

6 tasks

ericpre added the status: fix-submitted label Apr 1, 2021

ericpre added this to the v1.6.2 milestone Apr 1, 2021

ericpre linked a pull request Apr 7, 2021 that will close this issue

Add option not to stack original metadata in hs.stack #2691

Merged

6 tasks

francisco-dlp closed this as completed Apr 12, 2021

ericpre removed the release: next patch label Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: O(n) growth of original_metadata with stacking #1398

BUG: O(n) growth of original_metadata with stacking #1398

sem-geologist commented Jan 22, 2017 •

edited

sem-geologist commented Jan 22, 2017 •

edited

francisco-dlp commented Jan 23, 2017

francisco-dlp commented Jan 23, 2017

sem-geologist commented Jan 23, 2017 •

edited

francisco-dlp commented Jan 23, 2017

sem-geologist commented Jan 23, 2017

sem-geologist commented Jan 23, 2017 •

edited

sem-geologist commented Jan 23, 2017

sem-geologist commented Jan 23, 2017

sem-geologist commented Jan 23, 2017

francisco-dlp commented Jan 23, 2017

sem-geologist commented Jan 23, 2017

francisco-dlp commented Jan 23, 2017

sem-geologist commented Jan 23, 2017

francisco-dlp commented Jan 23, 2017

vidartf commented Jan 27, 2017

francisco-dlp commented Jan 27, 2017

magnunor commented Jan 27, 2017

sem-geologist commented Sep 13, 2018 •

edited

sem-geologist commented Sep 14, 2018

vidartf commented Sep 14, 2018

ericpre commented Apr 1, 2021

BUG: O(n) growth of original_metadata with stacking #1398

BUG: O(n) growth of original_metadata with stacking #1398

Comments

sem-geologist commented Jan 22, 2017 • edited

The current hyperspy behaviour causing issue with memory and slowness

Known issues

Known hack/workaround

Possible fixes

sem-geologist commented Jan 22, 2017 • edited

francisco-dlp commented Jan 23, 2017

francisco-dlp commented Jan 23, 2017

sem-geologist commented Jan 23, 2017 • edited

francisco-dlp commented Jan 23, 2017

sem-geologist commented Jan 23, 2017

sem-geologist commented Jan 23, 2017 • edited

sem-geologist commented Jan 23, 2017

sem-geologist commented Jan 23, 2017

sem-geologist commented Jan 23, 2017

francisco-dlp commented Jan 23, 2017

sem-geologist commented Jan 23, 2017

francisco-dlp commented Jan 23, 2017

sem-geologist commented Jan 23, 2017

francisco-dlp commented Jan 23, 2017

vidartf commented Jan 27, 2017

francisco-dlp commented Jan 27, 2017

magnunor commented Jan 27, 2017

sem-geologist commented Sep 13, 2018 • edited

sem-geologist commented Sep 14, 2018

vidartf commented Sep 14, 2018

ericpre commented Apr 1, 2021

sem-geologist commented Jan 22, 2017 •

edited

sem-geologist commented Jan 22, 2017 •

edited

sem-geologist commented Jan 23, 2017 •

edited

sem-geologist commented Jan 23, 2017 •

edited

sem-geologist commented Sep 13, 2018 •

edited