New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Would anybody be interested in the bruker's bcf format read-only functionality #597
Comments
Adding reading support for brucker bcf files would be really nice. I think that @pburdet intended to implement this at some point but I don't know how far he got. This is certainly within the scope of this project as the inability to read proprietary data formats is very often one of the main obstacle that people experience when using open data analysis packages such as HyperSpy. |
Being able to read bcf-files into HyperSpy would be very nice. I've got a little experience with making various data files loadable in HyperSpy, so I could probably help a little bit :) |
hmm I wonder how far @pburdet is in this, I will wait for his replay before starting. Mostly I do very large mapping (we planed to do 3D, because our lab have also FIB, but due lack in bruker Esprit(R) software to save to any other format than bcf in automatic mode, this turned us down even from trying. Sitting by SEM and clicking every 2-5 minutes makes little sense. However in geology field we do a lot large mapping 10x10cm and I wrote some primitive software (plan to publish it on git, when it will get more stable) which sticks tiles (i.e. 50x40) from txt files saved by bruker, however all interference problems of peaks are not solved in this way as txt files obtained is not from deconvolution, but from energy regions. That is why I started to look into bcf. So how to do that?: |
This seems related: http://probesoftware.com/smf/index.php?topic=274.0 |
@magnunor, it is rather not, the topic you bringed up is about raw format (they call it lispix), or ripple which is already fully implemented in the hyperspy, and I am talking here about the bcf, which is not the same. You can save acquired hypermap in this raw format just single manual hypermap (in bruker Esprit software), but you can't do that in automation (jobs in the Esprit) mode (in example 3000 mapping tiles), which is main stopper in doing FIB tomography/ 3D EDS with bruker spectrometers. Unless You are keen in sitting down by SEM and pushing buttons every 5-2 minutes for half-few days.... |
First problem in designing the library i got in: is how I should return items from bcf? bcf contains not just hyppermap, but also eds, and SEM imagery (BSE, SE etc...). I looked in hyperspy/io_plugins and I see that every plug-in return list with one dictionary inside. in case of of bcf 3 dictionaries inside list should be returned? |
@sem-geologist, I wasn't aware there were so many different formats, I'm so used to Digital Micrograph which basically saves everything in a single file format (.dm4) :) I'm not really sure how to best implement everything, @pburdet or @francisco-dlp will probably have some better ideas, but some thoughts: I don't have access to any .bcf files, could you share some .bcf-example file? Doesn't have to be any real data. With regards to how to return the data from a bcf-file contain several signals (hypermap, BSE, ...): currently HyperSpy doesn't have functionality for "projects", so each signal should be in a separate object. So it can probably return a list of signals found in the bcf-file. "Project" functionality has been discussed, but is (probably) still a bit far away: #576 How big are the datasets you're working on? |
hmm.. I tried to look into dm4 plugin to get idea how to return those parameters... generally bcf is also containing everything. The nice side of that is that mapping always fits SEM imagery, imagery are always saved in 16 bits. Imagery, eds sum spectrum and metadata are hold in xml part of the container. Those parts have to be read plainly (no fiddling with big-little endianes). The HyperSpectral part is complicated. The smallest data chunks are sometimes just a nibble (half byte) and reversed inside the byte (I were really confused with that little endian stuff in different scale, I thought always that switching the halfs happens in 2 or 4 byte chunks). Good thing it have instructions in the binary data stream. I guess that means that nmap'ing it like ripple is out of discussion. To get useful hypermap the thing have to be read nibble by nibble. This probably will result in slow... slow.. slow.... performance. I am thinking where to share example bcf. The problem is it is not enough to examine 1 bcf, you have to do a bunch, with different conditions: sizes, intensities, resolution... and if you really want to decode or check if you do it right, you need also raw/rpl files from the same mapping, so the storage space is an issue. Datasets... in geology they are big. In the extreme case I had ~2500 tiles with 512x478 pixels/tile. The good part the final picture is in 2D, and for surfing the results I think GIS systems are best in perfomance. However I started to notice some terrible performance drop lately, and started to consider if not to create lightweight app with pyqtgraph and hdf5 only. @francisco-dlp, what is procedure to get the code contribution in this project? |
@sem-geologist, currently only the FEI emi reader returns multiple signals, this is the relevant code. As you mention, before going any further I would wait until @pburdet joins the conversation. Our workflow for contributions is the standard in github: fork the project, clone your fork, create a branch to work on your new feature and, once you have a prototype, send a pull request. The pull request should be sent to our master branch if it is a new feature and to our maintenance branch (0.8.x currently) otherwise. It is better you send the pull request as soon as you have some code in your branch that you would like to show to others for discussion. Once the code is ready and passes the review of at least one member of the development team we'll merge it into HyperSpy. |
I get in contact with people of Brucker. They were interested to see their format read by open source software. They provide me the following program that can read their format. The source is open (the main file is 'Sample source\MainUnit.pas'). It could help @sem-geologist to understand further bcf format. The few time I was exposed to this format, I used Esprit to transform the map into rpl. I didn't seriously investigate further. |
Oh, stupid me... like in western films: "shoot first, then ask the questions..." so first I RE instead of asking bruker if they would be interested... |
this is what I feared... it is just partly open-source. BrukerDataImport(64).dll is not open. The open source is just the example how to use that library in delphi. I am not surprised, and thats why I didn't asked bruker for the help, as I noticed at first that bcf is container in the first place. Container which uses delphi(tm) and 'AidAim Software'(tm) proprietary SFS code, I strongly doubt bruker would have no NDA for that part of code. the above can be interpret as: another packaging pattern is even more complicated |
these example bcf in shared zip are making me crazy:
so it looks I am chasing the moving target... Any ideas how to deal with this? |
Here's a decent routine for unpacking 12-bit data in Python. I'm sure some of the steps can be improved, but the skeleton is there. def unpack12(data, sb_in=True, sb_out=True):
"""Unpack packed 12-bit data to 16-bit uints. 'sb_in' and 'sb_out'
determines whether the byte order should be switched on input
and output respectively.
"""
if sb_in:
switched_i2 = np.fromstring(data.tostring(), dtype='i2').byteswap()
data = np.fromstring(switched_i2.tostring(), dtype=np.uint8)
# Expand by map [0, 1, 1, 2, 3, 4, 4, 5, etc.].
# This can probably be optimized to one operation by intelligent indexing
expanded = np.ndarray((np.floor(len(data)/0.75)), dtype=np.uint8)
expanded[0::4] = data[0::3]
expanded[1::4] = expanded[2::4] = data[1::3]
expanded[3::4] = data[2::3]
# Reinterpret expanded as 16-bit
exp16 = np.fromstring(expanded.tostring(), np.uint16,
count=int(len(expanded)/2))
if sb_out:
exp16 = exp16.byteswap()
exp16[0::2] >>= 4 # Shift every second short by 4
exp16 &= np.uint16(0x0FFF) # Mask upper 4-bits on all shorts
return exp16
def test():
test_data = packed = np.array(
[0x70, 0x02, 0x02, 0x2D, 0x2D, 0xD0, 0xD0, 0x02, 0x02, 0x2E, 0x2E,
0xE0, 0xE0, 0x02, 0x02, 0x2E, 0x2E, 0xE0], dtype=np.uint8)
out = unpack12(test_data)
expected = [0x27, 0x2d, 0x2d, 0x2d, 0x2d, 0x2e,
0x2e, 0x2e, 0x2e, 0x2e, 0x2e, 0x2e]
assert(np.all(out == expected))
# Print output
print([hex(s) for s in out])
test() |
Thanks @vidartf, your approach with numpy is order faster than mine with bit-wise operations on integers. So all old our/ version 1 bcf contains 'EDSDatabase' directory, where are two files [HeaderData, SpectrumData0]. The bcf from demo, have the same named directory, but have more files: [HeaderData, SpectrumData0, SpectrumData1, SpectrumPositions0, SpectrumPositions1]. Another difference is that sfs containers uses different chunk size. If I open old bcf with demo program and save, the saved bcf is version2, it gets the additional file SpectrumPositions0, and chunksize are as in new type bcf. I thought how to do it rightly (Probably I did bad). I created the initial library which reads sfs and prepares for fetching data, it does traverse file, gets the list of files inside and fetches chunk_tables and constructs the tree (dictionary). What I am wondering is what is the purposes of empty SpectrumData1, and SpectrumPosition[x] files? Would it be a good idea to post the code here, or should I wait till I will have all working (I fear it will be big) pieces which will return the appropriate dictionary as other plugins? update: I understood what SpectrumPosition[x] files are used for. This is extension to the SpectrumData[x] file. The file contains pointers of pixel start position in the SpectrumData, thus to show some spectra on given pixels, just part of data need to be fetched. I can safely ignore it as this is delphist's approach to the problem. |
I long ago moved to python3. For python2 probably it needs to be minimally adjusted. Probably it is hardly readable mess... I thought this could go to misc folder of hyperspy after modifications. # -*- coding: utf-8 -*-
# Copyright 2015 Petras Jokubauskas
# and
# Copyright 20015 Hyperspy project
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with any project and source this library is coupled.
# If not, see <http://www.gnu.org/licenses/>.
#
# This python library provides basic reading capability of sfs proprietary
# AidAim Software(tm) files.
#
######################################################################
#Reverse engineering of this file format should be considered as FAIR USE:
# The main reason behind RE is interoperability of Bruker composite files (*.bcf,
# *.pan saved with Esprit software) with open source packages/software allowing
# much more advanced data manipulation and analysis.
#
######################################################################
### Introduction ###
#SFS is proprietary single file system developed by AidAim Software(tm)
# natively used with Delphi(tm) draconically over-expensive and useless languages.
# This function tries to implement minimal reading capabilities based on
# reverse engineering (RE), and can become useless with future format if mentioned
# developers would introduce critical changes.
#
# This function is very basic and can read just un-encrypted type of sfs.
# Observations of numerous bcf's shows that zlib compression (if used) is used.
# However other type of sfs containers could use different compression methods which
# is unknown for developer of this library
#
#STRUCTURE:
# The all non string data are little-endian.
# The data in sfs container are placed in chunks, and can be highly fragmented,
# every chunk begins with 32-byte header.
# The first header of the chunk is special as it contains:
# * 8-bytes filled with string 'AASFSSGN' (probably means
# AidAimSingleFileSystemSiGNature).
# * 8-bytes some kind of signature which differs for files generated on the different
# machines.
# * 4-bytes -- size of the chunks.
# * the meaninng of the rest of values in this header is still unknown.
# After this header there are some unknown values but value at 0x148 (4-bytes
# (uint32)) means the total number of chunks in sfs container.
#
# The SFS tries to mimic the file system, thus it contains tree/list of files,
# file table (order and index of chunks with the file) and data.
# The initial file tree/list starts at 5th chunk (0x4118 or 0x40118 or 0x400118...).
# If file tree is larger than a chunk, then it is continued in another chunk,
# in such case the chunk header begins with 4-byte integer chunk_index where the list
# is continued, otherwise the beginning of the header are filled with 00's, ff's
# or some number higher than max chunk index.
#
# The tree/list is constructed so:
# * 0x200 total address space/item:
# * first 0xe0 bytes contains non string information about item:
# * 4 bytes - the chunk index with file chunk table (or begining of it)
# (if it is 0xffffffff -- it is dir (and have no data) or file is empty)
# * 8 bytes - the lenght of the file (if preceeding 4 bytes ff's: 0)
# * 8 bytes + 8 bytes + 8 bytes -- 3 identical values in freshly made files,
# most possibly "times" (creation, modification, something else?)
# * 4 bytes of unknown; observed valules 0, 10... (guess it is file permissions,
# most of known file systems have such feature)
# * 4 bytes - the child of item (index, where first item 0);
# if value == 0xffffffff, it is child of root
# * 0xb0 bytes filled with 00's (what a waste of space)
# * last 4 bytes (uint32) can be 0 or 1:
# * 0 - item is a file
# * 1 - item is a directory
# * second part (next 0x120 bytes) -- string -- the file/dir name
# (this probably means that file/dir name should be <= 288 chars)
# # THE NOTE: the items in chunks is allwas aligned with the beginning of chunk
# # (+ 32 bytes of header). The chunk size after removal of 32 bytes header and
# # divided by 512 (0x200) will have reminder, which have to be simply skipped.
# # Most commonly the reminder is of 0x1e0 (512 - 32) size, but can be different
# # if chunk size is not dividable by 512. The simplest practice would be to read
# # (chunksize - 512) bytes from chunk after the header.
#
# The table of file is constructed so:
# * the 32byte header, where 4 first bytes contains index of chunk if table exceeds
# space of 1 chunk, else ff's 00's or something higher than max index.
# * ordered chunk indexes (4bytes/ index)
#
# Compression are used (if used) just on the data part
import os
import numpy as np
import struct
# helper functions:
def bin_tree_to_lists(data):
"""unpacks sfs list/tree data into dictionary.
args: data (the tree/list binary/string continuous data with 512 bytes per item)
returns: 5 lists (file table, size, parent, is_dir (boolen), name)"""
# due to possibility of child items appearing before parent in sfs list,
# before we construct tree like structure, we need to sort out
# child/parent relation, thus different kind of data should be fetched
# into separate lists. Function have lots of space for optimization....
# initializing dictionaries:
tab_pointers = [] # index of the chunk with file table (or beginning of it)
size = [] # in bytes
#create_time = []
#mod_time = []
#wtf_time = []
#permissions = []
parent = [] # if 0xffffffff, then file/dir is directly in root
is_dir = [] # boolen
name = []
#get data into lists:
item = 0
table_pointer = struct.unpack('<I', data[0:4])
while table_pointer[0] != 0:
tab_pointers += table_pointer
size += struct.unpack('<Q', data[4+item*0x200:12+item*0x200])
#create_time += struct.unpack('<d', data[12+item*0x200:20+item*0x200])
#mod_time += struct.unpack('<d', data[20+item*0x200:28+item*0x200])
#wtf_time += struct.unpack('<d', data[28+item*0x200:36+item*0x200])
#permissions += struct.unpack('<I', data[36+item*0x200:40+item*0x200])
parent += struct.unpack('<I', data[40+item*0x200:44+item*0x200])
is_dir += struct.unpack('?', data[0xDC+item*0x200:0xDD+item*0x200])
name.append(data[0xE0+item*0x200:0x200+item*0x200].strip(b'\x00'))
item += 1
table_pointer = struct.unpack('<I', data[item*0x200:4+item*0x200])
return tab_pointers, size, parent, is_dir, name
def items_to_dict(file_tables, size, parent, is_dir, name, last_size):
## sort items into dictionary:
tree0 = {}
## check if there is any tree structure or list is flat:
n_parents = np.sort(np.unique(np.fromiter(parent, dtype='uint8')))
## if above will cause trouble, change it to uint16 or uint32
if len(n_parents) > 1: # there is tree
for i in range(0,len(is_dir),1):
if parent[i] == 0xFFFFFFFF:
if is_dir[i]:
tree0[name[i]] = {}
else:
tree0[name[i]] = {'size': size[i],
'pointers': file_tables[i],
'last_size': last_size[i]}
if len(n_parents) == 2: #*.bcf's known till 2015.09
for i in range(0,len(is_dir),1):
if parent[i] != 0xFFFFFFFF:
if is_dir[i]:
tree0[name[parent[i]]][name[i]] = {}
else:
tree0[name[parent[i]]][name[i]] = {'size': size[i],
'pointers': file_tables[i],
'last_size': last_size[i]}
else: # unknow, but possible stuff
# if you need it, add the code there
print('Deep trees are Not Implemented')
else: #list is flat
for i in range(0,len(is_dir),1):
if is_dir[i]:
tree0[name[i]] = {}
else:
tree0[name[i]] = {'size': size[i],
'pointers': file_tables[i],
'last_size': last_size[i]}
return tree0
# main functions:
def get_sfs_file_tree(filename):
""" Function scans sfs type file and returns dictionary with file/dir structure.
The function fetches just metadata which is significantly smaller than data,
and safely can be fetch as whole into memory.
args: filename
returns: dictionary.
"""
with open(filename, 'rb') as fn:
a = fn.read(8)
if a == b'AAMVHFSS': # check: is file SFS container?
# usefull kind of "key" used in not-pointing chunk headers:
fn.seek(0x116)
thingy = fn.read(2)
# get the chunk size:
fn.seek(0x128)
chunksize = struct.unpack('<I', fn.read(4))[0]
chunk_usable = chunksize - 32
# get value, how many chunks are in sfs file?
fn.seek(0x148)
n_of_chunks = struct.unpack('<I', fn.read(4))[0]
tree_address = 4 # initial tree address
# jump to tree/list addr. and check the header
fn.seek(chunksize * tree_address + 0x118)
next_chunk = struct.unpack('<I', fn.read(4))[0]
if (next_chunk < n_of_chunks) and (next_chunk != 0):
#list/tree exceeds the chunk
fn.seek(28, 1)
list_data = fn.read(chunksize - 512)
tree_address = next_chunk
fn.seek(chunksize * tree_address + 0x118)
next_chunk = struct.unpack('<I', fn.read(4))[0]
#check for chain of chunks with list:
while (tree_address != next_chunk) and (
next_chunk < n_of_chunks) and (
next_chunk != 0):
fn.seek(28, 1)
list_data += fn.read(chunksize - 512)
tree_address = next_chunk
fn.seek(chunksize * tree_address + 0x118)
next_chunk = struct.unpack('<I', fn.read(4))[0]
else:
#if chunk header points to itself, or have non valid
# values, it means -- last chunk
fn.seek(28, 1)
list_data += fn.read(chunksize - 512)
else:
#list/tree fits into 1 chunk
fn.seek(28, 1)
list_data = fn.read(chunksize - 512)
tab_pointers, size, parent, is_dir, name = bin_tree_to_lists(list_data)
file_tables = []
last_size = []
for i in range(0, len(is_dir), 1):
if tab_pointers[i] != 0xFFFFFFFF:
size_in_last_chunk = size[i] % (chunk_usable)
if size_in_last_chunk == 0:
last_size.append(chunk_usable)
else:
last_size.append(size_in_last_chunk)
table_entries = -( -size[i] // (chunk_usable)) #getting ceiling
table_size = -( -table_entries // (chunk_usable // 4)) # in chunks
fn.seek(chunksize * tab_pointers[i] + 0x118)
if table_size > 1:
j = 0 # chunk counter
next_chunk = tab_pointers[i]
temp_table = b''
while j != table_size:
fn.seek(chunksize * next_chunk + 0x118)
next_chunk = struct.unpack('<I', fn.read(4))[0]
fn.seek(28, 1)
temp_table += fn.read(chunk_usable)
j += 1
else:
fn.seek(chunksize * tab_pointers[i] + 0x138)
temp_table = fn.read(chunk_usable)
table = np.trim_zeros(np.fromstring(temp_table, dtype='uint32'),
'b') * chunksize + 0x138
file_tables.append(table)
else:
file_tables.append(None)
last_size.append(None)
tree = items_to_dict(file_tables, size, parent, is_dir, name, last_size)
return tree
else:
raise RuntimeError('file is not SFS container') |
Thanks, I tried your script. I get that >>> get_sfs_file_tree(fn)
{'EDSDatabase': {'HeaderData': {'last_size': 2615,
'pointers': array([ 24888, 28984, 33080, ..., 7475512, 7479608, 7483704], dtype=uint32),
'size': 7399095},
'SpectrumData0': {'last_size': 515,
'pointers': array([ 7541048, 7545144, 7549240, ..., 122216760, 122220856,
122224952], dtype=uint32),
'size': 113682787}}} What the next step to get the data? |
@pburdet, this is already a bit outdated. This worked on the bcf, but were not practical on other sfs containers (in example *.pan -- bruker particle/features format). SFS (bcf is SFS) is file system, thus it have file tree, and file tables. The function get_sfs_file_tree returned not the data but the file tree and file tables. File tree can be used then to read the sfs (or bcf) at provided pointers and joined into binary string representing one of the files. This outdated function also didn't returned chunk size. It worked reasonably well with bcf ( as you see in your tested example of the version 1 BCF whivh contains just 2 files in the EDSDatabase directory), but very slow with *.pan files which can have thousand of files. That were non practical as calling the function would parse whole file. Thus I had to change it that the function would return just one pointer to the first chunk of the file table. The file tables would be later separately fetched in other functions, so if You want to get the file 'HeaderData' you would parse bcf just for it and would not get other file tables as in the old function. I have made the final version of this sfs extractor library which works reasonably fast and returns data in io.BytesIO format. I tested it on both python3 (my main working environment) and python2, with different files even up to 700Mb (the mappings which takes whole weekend). Another thing is I figured out compression and implemented it. This is of course just first step to extract data from the sfs container. |
Tanks, it seems to work. I'll be happy to get the parsing part. For submitting changes in hyperspy, you are doing things right. Fork hyperspy and create a branch as you have done. Now you can pull request it, even if it is not ready. We can then start discussing/reviewing/testing/improving your code from the pull request. |
I created repository with some bcf files (version 1). There it is. If somebody would have possibility to generate version 2 bcf (from Esprit 2.x) you are welcome to push it to that repository. There are 3 hypermaps. 2 generated on the demo version of Esprit1.9.4 with 4 on 3 pixels image size, other is real life mapping.
I didn't uploaded files which were shared above with demo pascal code. They are version2 bcf. Data in them are mostly constructed instructively + 16bit packed data per count. |
I mentioned that I have working parsing example, but it is shamefully slow. numpy.zeros(number_of_channels, height, width, dtype=depth) and then parse data stream pixel by pixel, and populate the final array.
data = unsfs.getFileFromTheSFS('P50_situation.bcf', 'EDSDatabase/SpectrumData0') The speed of this step is proportional to IO speed of the disk and memory.
def loop_the_data(data):
data.seek(0)
height, width = struct.unpack('<ii', data.read(8))
data.seek(0x1A0) # skipping some kind of header, meaning is unknown
for line_cnt in range(0, height, 1): # or xrange() in python2
#line_head contains number of non-empty pixels in line
line_head = struct.unpack('<i', data.read(4))[0]
for pix_cnt in range(0, line_head, 1):
#x_index of the pixel:
x_pix = struct.unpack('<i', data.read(4))[0]
#number of max channel with counts > 0; twice
chan1 = struct.unpack('<H', data.read(2))[0]
chan2 = struct.unpack('<H', data.read(2))[0]
# some placeholder 0x0000 0C00. always!
data.seek(4, 1)
# packing types:
# flag = 1 -- 12bit per address of 1 count
# flag = 2 or 3 -- instructively packed +
# 16 per address of 1 count (optionally)
flag = struct.unpack('<H', data.read(2))[0]
# if flag > 1, data_size1 > 0 and data_size1 == data_size2
data_size1 = struct.unpack('<H', data.read(2))[0]
# number of pulses in 1count/address type data:
n_of_pulses = struct.unpack('<H', data.read(2))[0]
# data_size2. always >0
data_size2 = struct.unpack('<H', data.read(2))[0]
data.seek(2, 1) # always 0x0000
# check the gathered header info and
# decide what to do:
if flag == 1: # and (chan1 != chan2)
data.seek(data_size2, 1) #skiping to next pixel/line
else:
data.seek(data_size2-4, 1) #skip to end of data - 4bytes
if n_of_pulses > 0:
# some additional data for the same pixel recorded
# as the pulses.
# additional data size:
add_s = struct.unpack('<I', data.read(4))[0]
data.seek(add_s, 1)
else: # if there is no addition pulses, jump to
data.seek(4, 1) # next pixel or line
return None I measured execution time with %%timeit magic in ipython3 with different files.
*I can't share, it is too big. It could be better, but generally it is not bad. Probably You realized that before I jump to next pixel I am reading the length of the pixel data in the pixel header: in other words to get to the wanted pixel (x, y) of the mapping I need to parse the mapping line by line, pixel by pixel up to that wanted pixel. And I guess that's why Bruker in the ver sion 2 bcf added SpectrumPositions0 (the last charter is the same count as for the SpectrumData last count) file in the sfs/bcf container. That file begins with height, width, and then there is the table with offset values pointing to every pixel in SpectrumData[x] file, 4bytes int values per pixel. If value is 0xFFFFFFFF or -1 as integer, then it means pixel is empty. Version 1 bcf have no such feature. Note this, I will come back to it later.
def bin_to_numpy(data, max_channels, depth):
data.seek(0)
height, width = struct.unpack('<ii', data.read(8))
#create the empty numpy array hypermap called hm:
hm = np.zeros((max_channels, height, width), dtype=depth)
data.seek(0x1A0) # skipping some kind of header, meaning is unknown
for line_cnt in range(0, height, 1): # or xrange() in python2
#line_head contains number of non-empty pixels in line
line_head = struct.unpack('<i', data.read(4))[0]
for pix_cnt in range(0, line_head, 1):
#x_index of the pixel:
x_pix = struct.unpack('<i', data.read(4))[0]
#number of max channel with counts > 0; twice
chan1 = struct.unpack('<H', data.read(2))[0]
chan2 = struct.unpack('<H', data.read(2))[0]
# some placeholder 0x0000 0C00. always!
data.seek(4, 1)
# packing types:
# flag = 1 -- 12bit per address of 1 count
# flag = 2 or 3 -- instructively packed +
# 16 per address of 1 count (optionally)
flag = struct.unpack('<H', data.read(2))[0]
# if flag > 1, data_size1 > 0 and data_size1 == data_size2
data_size1 = struct.unpack('<H', data.read(2))[0]
# number of pulses in 1count/address type data:
n_of_pulses = struct.unpack('<H', data.read(2))[0]
# data_size2. always >0
data_size2 = struct.unpack('<H', data.read(2))[0]
data.seek(2, 1) # always 0x0000
# check the gathered header info and
# decide what to do:
if flag == 1: # and (chan1 != chan2)
data.read(data_size2) #skiping to next pixel/line
hm[:chan1, line_cnt, x_pix] = 1
else:
data.read(data_size2-4) #skip to end of data - 4bytes
hm[:chan1, line_cnt, x_pix] = 1
if n_of_pulses > 0:
# some additional data for the same pixel recorded
# as the pulses.
# additional data size:
add_s = struct.unpack('<I', data.read(4))[0]
data.read(add_s)
hm[:chan1, line_cnt, x_pix] += 1
else: # if there is no addition pulses, jump to
data.seek(4, 1) # next pixel or line
return hm I got this results (first 2 used %%time magic instead of %%timeit): %%timeit
a = bin_to_numpy(data, 2048, np.uint16)
I was quite surprised by huge bcf not getting more overhead. I can't find explanation for that.
So what is my strategy?:
|
I didn't go through your code in detail, but from a general point and a quick overview I can offer these comments and suggestions:
unpack_str = "<i<H<H<i<H<H<H<H<H"
unpack_bytes = struct.calcsize(unpack_str) # 22
x_pix, chan1, chan2, _, flag, data_size1, n_of_pulses, data_size2, _ = \
struct.unpack(unpack_str, data.read(unpack_bytes)) However, it might make the code a bit more unreadable, so check the profiling first. |
Regarding how to get the actual pixel data into the array, the 'easiest' alternatives I see is this: if flag == 1:
hm[:chan1, line_cnt, x_pix] = np.frombuffer(data, count=1, dtype=depth) However, I'm a bit unsure if I understand the pixel data. What lengths/types can it be? E.g. can it be a triplet for RGB, or will it always be a single value? What can |
the problem is every pixel have variable lenght, there can be empty pixel,(simply skipped). Then if there is any pixel information it is packed in 3 possible methods: 1 - packed in 12bit packed values, 2- it can be packed with instruction described packing in small bits: in example: 0x00 0x25 -> would mean next 0x25 channels are coded with 0 nibbles/channel - result: [0,0,0,0...] of 0x25 length; 0x01 0x5 0x09 0x8f 0x65 0x04 -> next 5 channels is coded in by 1nibble/channel, + 9 as addition to those channels - result '[f+9,8+9,5+9,6+9,4+9]' (last 0 is discarded). And 3 is like 2, but can have 16bit packed values (similar to 12bit packed) impulse/channel at the end of the stream. The shape of pixel is recorded, however every pixel have different shape. dtype=depth is not and have to be decided before parsing. Maybe it is recorded in some mess in data stream header from 0xF to 0x1A0, but I could not find it. The same is with shape, maybe it is in mentioned header. My strategy is to use eds sum spectra in xml part, divide it by height and width, and use double 0kV peak (bruker have special peak at 0kV generated by Processing unit, which is comparable with intensive peaks) to find what max intensities could be and chose the dtype for main array. For shape there is no such info in xml part. this can be retrieved just after looping through all pixel headers, reading and finding max value. |
Well, from what you are saying, I interpret that it should be possible to preallocate a numpy ndarray of the correct dimensions/type. If you then pass this array + pixel indices to your unpack functions, you can read the unpacked values directly into the final array. |
theoretically yes, |
on my 2nd approach I did such stuff:
for y in height:
#check the line header code
for pixel in range(nonemptypixels_from_line):
#get to know which is the pixel, and after it get to know next pixel.
|
I would like to update the library, but I am not sure about python3 python2. I wrote it on python3, It was not working on python2 (the problem of simple unpacking of byte strings), thus added some code bits which works on both py2 and py3, but adds a bit overhead, marked them(with coment py2) and comented out py3 only pieces (if hyperspy would switch to python3 completely in thefuture). The library is rewriten completely (some piece of code was reused) from functions to classes (I am so happy I took the classes about object oriented programing:), much easier to do this in OO way). It contains some very handy general classes as SFS_reader with SFSTreeItem which can be used for general unpacking of SFS packed data. Then there is Bruker and Bruker hypermap specific classes: and finally there is this class inheriting and using all above: BCF_reader In the meanwhile I am working on cython version of hypermap parsing function, which should speed up the parsing, however will it be possible to use it in hyperspy? It would require compiling during installation. |
I think that we shouldn't merge this until travis runs properly the tests on the cython code. Many projects use cython (see e.g. sklearn) and manage to get their tests running in travis. So should we... Also, I think that distributing the C code is unnecessary since it can be generated by cython. We should distribute hyperspy using wheels, and therefore the end-user shouldn't need to compile anything nor install Cython anyway. And talking about wheels, we should make sure that we can easily produce wheels for all relevant platforms (mac and windows?), possibly before merging this. @vidartf, would it be possible to ask appveyor to create windows wheels automatically on release? And, how can we produce wheels for mac, can this be done from linux? |
@francisco-dlp : You can get AppVeyor to make wheels quite easily, yes. Just add
(or possibly I have no experience with wheels for Mac. |
That is the whole point of distributing c code. It does not require cython in dependencies just any C compiler (the same as python, so on windows it is MSVC, on linux gcc and so on...). In cython manual it is even warned that cython can change, and proper code can become obsolete, but not c, thus it is highly recomended to distribute c along the pyx, also setup.py building extentions are suggested by default compile from c (look http://docs.cython.org/src/reference/compilation.html at "Distributing Cython modules" section), the only thing I would change is I would add to setup.py possibility to force using cython (but it should not be by default) if USE_CYTHON flag would be passed. I guess for most of windows users it is little relevant as they are anyway using binary. However generated C should be left in source. ( I know it takes near 1 MB. Maybe I will get courage one day to rewrite it in clean C and wrap through cython). What is wheels? binary? It is not that travis or appveyroy doesn't work with cython (both compiles it nicely and places where it is supposed to be placed), the testing doesn't. |
@sem-geologist the full setup and testing procedure is written in the log. In particular, this line runs the actual tests python continuous_integration/nosetest.py --with-coverage hyperspy Where the The actual "run tests" is the last line, |
oh, ok so this is nosetest behind. I know now at least what to search. Thanks |
Yes, you even import it yourself ! :) |
I thought it is just toolset of handy assert functions :D |
I think the problem is that nosetest tests everything inside source folder. Installation however copy the *.so or *.pyd files to final destination. I think I should inject somethere |
@sem-geologist, sorry that I didn't explain myself. I agree that we should distribute the binaries (with wheels) or C files (with source distributions) when packaging HyperSpy. However, I don't think that we should distribute the C files in the github repository. In other words, the C/binary files should be generated on packaging i.e. when running |
@vidartf, thanks for the appveyor lines. It seems that travis can run and deploy for macos too. I have no experience whatsoever with mac though. |
http://eng.hroncok.cz/2016/03/28/pytest-mac might help, though I have never used it Or, apparently travis can do it too. |
so where is the place of those c files. And when and where they would be exactly generated (it should not be on customer side). |
No, they'll be generated on packaging e.g. when producing the source distribution "python setup.py sdist". So, the code to generate the C files must be in |
ok.. the existing code is ok (it compiles c extension during |
No, it does |
But we'll modify it to make |
So we'll do for appveyor as @vidartf showed. |
I did tested locally and observed that nosetest fails with cython if there is no files compiled in the source tree. and if it is compile then my addition cython vs python tests succeed. |
why? |
cause tests will fail? (It was written "on release", so until release it would fail) |
Appveyor successfully builds the unbcf_fast at testing environment and passes. Travis does not want to build_ext --inplace. Now I am trying to put @francisco-dlp ,how is it that there is requirements for setuptools, but setup.py use distutils.core instead? |
@sem-geologist, some for spotting the issue with setuptools. Probably the issue was caused by an incorrect merge. It should be setuptools. Would you like to send a separate PR to correct it? |
Regarding the C file issue, you're right, Cython should generate it whenever |
Yes I will make separate PR for setuptools, so it gets quite complicated. |
Then, install may require Cython in this case :)
It seems so. However, there are a lot of well established packages that may have found a good solution. I suggest having a look at what others are doing. |
@francisco-dlp , I did a separate PR to change the distutils to setuptools: looks old warning about unrecognised 'install_requires' and 'setup_requires' is gone, but 3 new warnings appeared (look in travis logs) #949. |
It seems that the main issue with#949 is that |
That is, no more. It was due to copy-paste of some piece of code from other branch... (my bad habit) |
the #949 got merged, so this can be closed |
I think I studied (spent half week in hex) enough of brukers *.bcf files and figured out how stuff works. How many people would be interested in importing data straight from bcf functionality? Would authors of this project be interested, or does this contradicts the aims of the project (because bcf is proprietary format)? I am going to write library anyway (as in europe union we have right to RE and open our data) and I wish it would be useful to this project (by the way I have some piece of code for reading brukers EDS...). However I am not very experienced (I am geologist in the first place and programming just as hobby), I am average in python and I do not know C or C++. How many people would be interested, and would it be useful for the hyperspy, and could I get some helping hand or advices?
The text was updated successfully, but these errors were encountered: