Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEW: bruker's composite file (bcf) hypermap io (read-only) library #712

Merged
merged 98 commits into from
Jul 13, 2016
Merged

NEW: bruker's composite file (bcf) hypermap io (read-only) library #712

merged 98 commits into from
Jul 13, 2016

Conversation

sem-geologist
Copy link
Contributor

@sem-geologist sem-geologist commented Aug 27, 2015

The library with reader of bruker composite file (BCF) hypermaps and images.

TO DO:

  • header parser, SEM imagery, sum EDS spectra to numpy
  • pure python parser functions of elemental mapping to numpy (fuly functioning, but memory inefficient)
  • pure python parser with dynamic loading of file and optional dynamic zlib decompression (required to enable cython based parser)
  • the hyperspy abstract/maped functions of library (file_reader)
  • cython based implementation of Delphi/Bruker packed array fast parsing
  • modify the setup.py, to compile cython code
  • some minimal loading tests
  • integration to hyperspy tests

@francisco-dlp francisco-dlp added this to the 0.9 milestone Aug 27, 2015
got rid of while loops, improved the functions to work better with many file holding files and older sfs containers with smaller chunk size.

It probably would be more practical to use xrange in for loops, but numbers are not huge and range in python 2 will work efficient enough. However in future porting to python3 will need no changes.
fixed one typo, and fixed missing part for compression, where problem was reading file pointer table instead of beginning of the packed file, as well some types were compared not correctly.
@sem-geologist
Copy link
Contributor Author

Why checking code quality takes so long?

@francisco-dlp
Copy link
Member

This is a recurrent issue with landscapeio, but it is unfortunately out
of our control.

On Fri, 2016-02-05 at 06:07 -0800, Petras wrote:

Why checking code quality takes so long?

Reply to this email directly or view it on GitHub.

@sem-geologist
Copy link
Contributor Author

The python code will not work on big endian machine... excatly between 563 to 581 lines, in function bin_to_numpy. Propositions how to fix it for big endian machines is highly apreciated. Cython code is not going to have this problem...
update: I think I fixed it, needs testing on big-endian machine...

@sem-geologist
Copy link
Contributor Author

I remember I did asked half year ago about how to return both imagery and hyperspectral data with one file_reader function.... Did something changed in that field? I just want to remind, that bcf contains both sem imagery and hyperspectral data.

@francisco-dlp
Copy link
Member

The emi file reader also returns several signals as a list. See https://github.com/hyperspy/hyperspy/blob/master/hyperspy/io_plugins/fei.py#L231

…g chunks by implementing generators,

fixing the silent "try, except, pass" into explicit exception handlings preventing
unintended  blocking of real errors.
part in bin_to_numpy function to work also on big-endian machines
from string and casting array back to string so as it was read.
The byteswap is needed, and that even simplifies the solution
as both, big endian, and little endian machines have to swap it once.
…f bcf

NEW: added downsample and cutoff abilities at parsing level.
MOVED: removed old memory inefficient parsing functions. From now parser is method of BCF_reader class
@sem-geologist
Copy link
Contributor Author

So in the end, now I have moved everything to classes and saw no slow down of excecution, contrary, I got rid of some parts and saw 1-2% improvement over previously heavy parsing functions siting outside of class. I think lanscape will complain about that function(now method of the BCF_parser class) like a hell for too many variables and complexity. The price of dividing it would be additional slowness (that was my experience on the first approuch, but calling function/method per pixel is expensive (not only calling, but also copying data for those functions)). In the end, I am not creating something new, the method have to parse binary data with so many quirks and whistles that to make it fast in python is nearly impossible... python is far from best too for doing that. Doing that with raw pointers would be much easier and faster (whats why I have the goal to put heavy parsing functions to cython). My profiling showed that the highest time spend is in numpy. Numpy is efficient in reading, saving, doing vectorised math on whole array, but it is very slow in doing element by element actions, what is the case in parsing bruker data.
My plan is to create the final numpy array (zeros), then pass it to cython as memoryview, where accesing and changing the array elements is orders of magnitude faster (what my experimental code do). I really would apreciate if somebody would look through my code and would suggest if it should be changed somehow. I tested it with single bcf. The largest map I have (where non compressed saved as ripple is ~4GB) is parsed in ~6-7minutes (on my overheating amd Turion II 2100Mhz laptop processor). I expect better results on more powerful machines. However as hyperspy main aim is FIB based slicing hypermaps, it tends to be small and have low counts. I specifically thought about it and added downsampling, and energy cuttof arguments for parser. In that way high resolution imagery (Bruker beam controlling board and Esprit software is capable of doing 4096x4096 pixels) and hypermapping can be done simultaniously. Then imagery can be read in full resolution, and hypermap in sensible downsampled resolution. Downsampling as I tested gives no (<1%) overhead coming from the slow numpy inplace addition on selection:

numpy_array[i] += value  # this is 3 times slower
numpy_array[i] = value  # this is slow

so in example on downsample ratio of 2, the 2x2 pixels will be squeezed into 1. So, slow assignemet or "inplace" addition to numpy slice is done 4 time per pixel, but there is 4 time less pixels in downsampled array so this counts out. Downsampled hypermaps takes just 2 times less of memory, instead of 4 times. The problem is that python integer cant be added "inplace" to numpy slice with unsinged type. So the final arrays are signed (instead of uint8-> int16, instead of uint16->int32....).
Another optional atribute/feature 'the cut_off by energy' I think could be also very valuable. The Brukers Esprit software allows seting the energy range to these: 10, 20, 40, 80kV. I am not very experienced on TEM ( and if somebody have Bruker on TEM, the testing is highly appreciated of this library), but on SEM the most commonly 0-15kV range is most useful, as it lets resolve some overlaps (Y,Nb,Zr,P). The 15-20kV for us have little value as in case of our SEM acceleration voltage limits, or lack of elements in geological speciments for that region. In hypermaps with 20kV range that region normally for us contains nothing except the noise. Another situation I come up with is FIB slicing, where on some SEM the imaging is wished to be done on low voltages (3-7kV) (InLens Detector and ESB (energy selective BSE) mounted in the beam column). Cut off value also will be useful for stacking the hypermaps, where it can be set same for all hypermap files imported, so instead of setting all slices to one of the Esprit setted value (10, 20...) of channels, it can be set to some sane value saving 10-40% of memory.

There is some stuff which makes me woried. If I test my library in ipython:
1.create the instance of the BCF_parser class and assign it to variable
2. del instance, or do variable (which pointed to instance) = 0

the memory is not released and garbage collected... if I import gc, and do gc.collect() it gets then released. Probably I will have to use gc in the file_reader function to cope with this... It probably points to some badly designed references by me :/ which I can't find, and default gc behaviour do not clean the memory automaticaly. Or does this have something to do with ipython as I found reported in many places....?

of bcf insanely packed array depacking to numpy
@sem-geologist sem-geologist changed the title Bcf implementation NEW: bruker's composite file (bcf) hypermap io (read-only) library Feb 28, 2016
uint16_t val

cdef struct Val32:
uint32_t val
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you not use ctypedef uint32_t Val32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try and see if that will work... I noticed some bugs in some not essential code in python code (night work...) and just updated...
I used structs as I find it much easier and cleaner to cast pointer to struct than to ctypedef'ed variables... those still will need to be be accesed by Val32[0] (probably, I just got rid of those stucts and '.val', and it do not compile anymore) notation instead of Val32.val, and casting getts complicated... I am very inexperienced with c and cython, and the syntax of python is quite mixing up... I will see how this will go

and metadata wrapping to the file_reader function.
@francisco-dlp
Copy link
Member

Could you update this with the current master?

@francisco-dlp
Copy link
Member

Let us know when this is ready to merge.

@sem-geologist
Copy link
Contributor Author

sem-geologist commented Jul 12, 2016

@francisco-dlp yesterday I looked throught bye bye record_by. While I renamed record_by which was used not as meant in the kwargs, it is used inside dictionary, which is passed to hyperspy loader. Can you confirm that it can stay in the dict, or should I change it to (is there some alternative/new api to tell the loader that this array is image and this array is eds signal raster/whatever?)

which accidentally slipped with merge with master
@francisco-dlp
Copy link
Member

If your reader doesn't read anything signal_dimension > 2 or signal_dimension == 0 it can carry on using record_by internally. record_by is removed from metadata afterwards. This is the code. I took this approach because, although record_by is not general enough for hyperspy, it is general enough for most file formats. In this was we avoid the danger of introducing bugs by touching all the io readers...

@sem-geologist
Copy link
Contributor Author

I dont get how much dimentions is what: image (BSE, SEI whatever, 1 channel) 2 dimentions (x, y)?
eds hypermap 3 dimmentions (x, y, Energy)? Aslo I can think 2 dimentions as crossection (x, Energy), one dimention - eds from one point (Energy). What is 0 dimentions? If I get this clear then my hypermaps is >2 and I am in the trouble... how to tell the loader what is what?

@francisco-dlp
Copy link
Member

You got it right. 0 signal dimension would be an scalar e.g. mapped across a 2D space:

>>> scalar = hs.signals.BaseSignal(np.random.random((32, 32)))
>>> scalar.axes_manager.set_signal_dimension(0)
>>> scalar
<BaseSignal, title: , dimensions: (32, 32|)>

as this gives headache in tests and something
changed in master what is hard to grasp
with my little brain...
@sem-geologist
Copy link
Contributor Author

sem-geologist commented Jul 12, 2016

@francisco-dlp I am very confused, tests cant pass throught on bcf and it complains about record_by. I got rid of any 'record_by' completely (also in dictionary which is passed to loader) and the test is still failing, What did changed in api and what do I need to change for it to work back?

@francisco-dlp
Copy link
Member

Notice that #1127 has not been merged yet, that's probably the issue: you can't get rid of record_by without #1127.

@sem-geologist
Copy link
Contributor Author

sem-geologist commented Jul 12, 2016

@francisco-dlp I see, it is not merged, however I tried to merge bcf implementation with #1127 on separate branch: the previously failing tests on record_by are gone... BUT it introduces the scale to be broken!

AssertionError: 0.010001 != 4.368392531e-05 within 12 places

which looks for me that axes got mixed again (instead of x or y scale it took the energy scale) (I had problem with this at the step of finalizing this io_plugin before)... :/ (maybe because it does not know what is what? as we don't tell it anymore...)

@francisco-dlp
Copy link
Member

That makes sense. I think that for your reader it is ok to keep record_by as it doesn't need the more general approach. I would simply revert the latest commit.

to dictionary,
fix test
@francisco-dlp
Copy link
Member

This is still failing. We'll probably release HyperSpy 1.0 today or early tomorrow, any chance of fixing this before the release? If not, no worries, HyperSpy 1.1 will probably be released in a few weeks anyway given the number of almost ready PRs in the pipeline.

@sem-geologist
Copy link
Contributor Author

does #1127 got merged?I tested this with it merged on separate branch (fix_bcf) and it works. if remove record_by will be merged on now then after i can udate this to master and it should stop throwing errors without any more tweeking. basicaly it is ready for review, as nothing is gona change

@francisco-dlp
Copy link
Member

I just merged #1127.

@francisco-dlp
Copy link
Member

Once Travis and Appveyor are satisfied ,I'll happily merge this.

@sem-geologist
Copy link
Contributor Author

sorry for that, I managed somehow to miss the places after coma in number comparison. So wasted testing cycles... especially then at this moment osx and appveyor is expensive. I should start wearing glasses...

@sem-geologist
Copy link
Contributor Author

@francisco-dlp , the appveoyr and travis just passed :) . The landscape probabably is hanged.
Do not forget documentation #934, which I updated and merged with master today.

@francisco-dlp
Copy link
Member

Reverse engineering this format is an amazing achievement. Thanks a lot for contributing your work to HyperSpy.

@sem-geologist
Copy link
Contributor Author

Indeed, it gives a lot of satisfaction. I look now to any binary format with a bit of hope :D that one day... I will rule it also.

@sem-geologist sem-geologist deleted the bcf_implementation branch March 4, 2017 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants