Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would anybody be interested in the bruker's bcf format read-only functionality #597

Closed
sem-geologist opened this issue Jul 16, 2015 · 93 comments
Milestone

Comments

@sem-geologist
Copy link
Contributor

I think I studied (spent half week in hex) enough of brukers *.bcf files and figured out how stuff works. How many people would be interested in importing data straight from bcf functionality? Would authors of this project be interested, or does this contradicts the aims of the project (because bcf is proprietary format)? I am going to write library anyway (as in europe union we have right to RE and open our data) and I wish it would be useful to this project (by the way I have some piece of code for reading brukers EDS...). However I am not very experienced (I am geologist in the first place and programming just as hobby), I am average in python and I do not know C or C++. How many people would be interested, and would it be useful for the hyperspy, and could I get some helping hand or advices?

@francisco-dlp
Copy link
Member

Adding reading support for brucker bcf files would be really nice. I think that @pburdet intended to implement this at some point but I don't know how far he got.

This is certainly within the scope of this project as the inability to read proprietary data formats is very often one of the main obstacle that people experience when using open data analysis packages such as HyperSpy.

@francisco-dlp francisco-dlp added this to the Wish list milestone Jul 16, 2015
@magnunor
Copy link
Contributor

Being able to read bcf-files into HyperSpy would be very nice.

I've got a little experience with making various data files loadable in HyperSpy, so I could probably help a little bit :)

@sem-geologist
Copy link
Contributor Author

hmm I wonder how far @pburdet is in this, I will wait for his replay before starting. Mostly I do very large mapping (we planed to do 3D, because our lab have also FIB, but due lack in bruker Esprit(R) software to save to any other format than bcf in automatic mode, this turned us down even from trying. Sitting by SEM and clicking every 2-5 minutes makes little sense. However in geology field we do a lot large mapping 10x10cm and I wrote some primitive software (plan to publish it on git, when it will get more stable) which sticks tiles (i.e. 50x40) from txt files saved by bruker, however all interference problems of peaks are not solved in this way as txt files obtained is not from deconvolution, but from energy regions. That is why I started to look into bcf.
Why I want to do modular library? I am not very happy with some parts used in the hyperspy.
For such large data sets I think matplotlib is a bit overkill and from my experience pyqtgraph works smoother, and thats why I think about library which could be reused to different projects. Anyway I thought that duplication of work should not happen (we all have quite expensive time in our fields) and I wrote here because I think it could benefit everybody.

So how to do that?:
Generally bruker uses two forms for writing data: xml (with base64 and (zlib compression optionally)) and SFS (single file system) to contain xml and binary data. In my opinion the proper way would be to do two libraries: one to read the SFS container, other to decode the xml and binary. The SFS containers are used not only for the hypermapping, but also for particle analysis (so called features in Esprit software). However I have insufficient knowledge how exactly to deal with SFS (particle data have much more parts in container, and amount of parts is different). To extract data from bcf is much easier -- it contains always just two "files"/pieces, and decoding the first (which is very easy) helps to get right with the second. I thought about concept how to do it easiest way. If I would go with two libraries (actually more than two, as xml in the bcf contains eds, and many stuff which is needed), when it could proceed as this: firstly extract those two files as objects in memory, and then decode them (more memory hungry way). Other approach would be to skip idea about proper SFS decoding library and instead, decode xml from the bcf in the one, homogeneous library, get the data which helps to find the beginning of hypermap spectrum data from that decoded xml and using the right addresses populate the table (as I understand hdf5? or first to numpy? and then to some more advanced container?) reading the binary part of bcf in very small chunks.
update: I sorted out the SFS format.
Sorry for long post.

@magnunor
Copy link
Contributor

This seems related: http://probesoftware.com/smf/index.php?topic=274.0

@sem-geologist
Copy link
Contributor Author

@magnunor, it is rather not, the topic you bringed up is about raw format (they call it lispix), or ripple which is already fully implemented in the hyperspy, and I am talking here about the bcf, which is not the same. You can save acquired hypermap in this raw format just single manual hypermap (in bruker Esprit software), but you can't do that in automation (jobs in the Esprit) mode (in example 3000 mapping tiles), which is main stopper in doing FIB tomography/ 3D EDS with bruker spectrometers. Unless You are keen in sitting down by SEM and pushing buttons every 5-2 minutes for half-few days....

@sem-geologist
Copy link
Contributor Author

First problem in designing the library i got in: is how I should return items from bcf? bcf contains not just hyppermap, but also eds, and SEM imagery (BSE, SE etc...). I looked in hyperspy/io_plugins and I see that every plug-in return list with one dictionary inside. in case of of bcf 3 dictionaries inside list should be returned?

@magnunor
Copy link
Contributor

@sem-geologist, I wasn't aware there were so many different formats, I'm so used to Digital Micrograph which basically saves everything in a single file format (.dm4) :)

I'm not really sure how to best implement everything, @pburdet or @francisco-dlp will probably have some better ideas, but some thoughts:

I don't have access to any .bcf files, could you share some .bcf-example file? Doesn't have to be any real data.
Could you also put the code you got for reading the bcf-files in a repository somewhere, or point to where we can find it?

With regards to how to return the data from a bcf-file contain several signals (hypermap, BSE, ...): currently HyperSpy doesn't have functionality for "projects", so each signal should be in a separate object. So it can probably return a list of signals found in the bcf-file. "Project" functionality has been discussed, but is (probably) still a bit far away: #576

How big are the datasets you're working on?

@sem-geologist
Copy link
Contributor Author

hmm.. I tried to look into dm4 plugin to get idea how to return those parameters... generally bcf is also containing everything. The nice side of that is that mapping always fits SEM imagery, imagery are always saved in 16 bits. Imagery, eds sum spectrum and metadata are hold in xml part of the container. Those parts have to be read plainly (no fiddling with big-little endianes). The HyperSpectral part is complicated. The smallest data chunks are sometimes just a nibble (half byte) and reversed inside the byte (I were really confused with that little endian stuff in different scale, I thought always that switching the halfs happens in 2 or 4 byte chunks). Good thing it have instructions in the binary data stream. I guess that means that nmap'ing it like ripple is out of discussion. To get useful hypermap the thing have to be read nibble by nibble. This probably will result in slow... slow.. slow.... performance.
I still don't have a fully working code example, I am trying to figure out how to write it that it would be modular enough to reuse by myself. I already wrote today the piece which finds xml part - objectifies it, (so in this stage if I would know exact geometry of list dictionary format, I could already output images, and eds). I raised the issue here because yesterday I finished the examination of whole bunch of bcf in hex and I figured out the meaning of the most bits and got ready to write the code. I simply wanted to know if it would be here useful so instead of writing selfishly independent code I could make it compatible with hyperspy. (I hate duplication efforts, and who loves?) This positive feedback made me consider to try doing it compatible with the hyperspy.
I still would like to hear @pburdet opinion.

I am thinking where to share example bcf. The problem is it is not enough to examine 1 bcf, you have to do a bunch, with different conditions: sizes, intensities, resolution... and if you really want to decode or check if you do it right, you need also raw/rpl files from the same mapping, so the storage space is an issue.

Datasets... in geology they are big. In the extreme case I had ~2500 tiles with 512x478 pixels/tile. The good part the final picture is in 2D, and for surfing the results I think GIS systems are best in perfomance. However I started to notice some terrible performance drop lately, and started to consider if not to create lightweight app with pyqtgraph and hdf5 only.

@francisco-dlp, what is procedure to get the code contribution in this project?

@francisco-dlp
Copy link
Member

@sem-geologist, currently only the FEI emi reader returns multiple signals, this is the relevant code.

As you mention, before going any further I would wait until @pburdet joins the conversation.

Our workflow for contributions is the standard in github: fork the project, clone your fork, create a branch to work on your new feature and, once you have a prototype, send a pull request. The pull request should be sent to our master branch if it is a new feature and to our maintenance branch (0.8.x currently) otherwise. It is better you send the pull request as soon as you have some code in your branch that you would like to show to others for discussion. Once the code is ready and passes the review of at least one member of the development team we'll merge it into HyperSpy.

@pburdet
Copy link
Contributor

pburdet commented Jul 20, 2015

I get in contact with people of Brucker. They were interested to see their format read by open source software. They provide me the following program that can read their format.

The source is open (the main file is 'Sample source\MainUnit.pas'). It could help @sem-geologist to understand further bcf format.

The few time I was exposed to this format, I used Esprit to transform the map into rpl. I didn't seriously investigate further.

@sem-geologist
Copy link
Contributor Author

Oh, stupid me... like in western films: "shoot first, then ask the questions..." so first I RE instead of asking bruker if they would be interested...
@pburdet I have trouble in downloading the link, tried on different machines. Could you share it in the other form.

@sem-geologist
Copy link
Contributor Author

this is what I feared... it is just partly open-source. BrukerDataImport(64).dll is not open. The open source is just the example how to use that library in delphi. I am not surprised, and thats why I didn't asked bruker for the help, as I noticed at first that bcf is container in the first place. Container which uses delphi(tm) and 'AidAim Software'(tm) proprietary SFS code, I strongly doubt bruker would have no NDA for that part of code.
And in the end it is good I RE :) . Another thing what I observed during RE that data are very densely packed in sub-byte chunks, and of course not aligned. This is one of the "features" of delphi which is completely discouraged in C.
And this is my main obstacle now.
I am thinking how effectively read such packed data and unpack it into python integers.
I tried struct, bitarray libraries... they would be perfect if the data would be aligned...
Also It needs to be fast.
I am trying to do it with bitwise operations.
The example of the depacking steps needed for one of the packing algorithms:
(such patter of packaging is used when low counts)
The raw binary stream of one pixel:
70 02 02 2D 2D D0 D0 02 02 2E 2E E0 E0 02 02 2E 2E E0 E0 02 ...
1st step little-endian to big endian:
02 70 2D 02 D0 2D 02 D0 2E 02 E0 2E 02 E0 2E 02 E0 2E 02 E0...
2nd step cut every 3 nibbles:
027 02D 02D 02D 02D 02E 02E 02E 02E 02E 02E 02E 02E 0....

the above can be interpret as:
channel 0x027 have 1 count,
channel 0x02D have 4 counts,
and channel 0x02E have 8 counts

another packaging pattern is even more complicated
it have index of first signal and when instruction/counts patterns with variable lengths (according to my observations from at least 2 bytes to 1 nibble).

@sem-geologist
Copy link
Contributor Author

these example bcf in shared zip are making me crazy:

  1. the bcf I analysed before comes from 3 SEM with bruker detectors and I thought that some of the things in this file can be safely simplified and treated with static address: in our labs bcf all files/containers I saw ever have header part beginning at 0x4218 address. In examples from zip file they start at 0x40218 (this looks very similar but there is additional 0). When in our labs bcf's the xml part starts always on 0x6138 address. In examples in this zip the xml starts at 0x60138.
  2. Generally bcf's in zip file looks cleaner: i.e. in bcf from our lab bcf have 32bit unknown pointers every 0x1000 bytes with 0x118 offset (0x118, 0x1188, 0x2118....), which is always populated with some "random" noise. In shared examples inside zip those pointers are empty (all zeroes) unless they are at touching positions with real data.
    3.Our lab bcf's always contains just two components: the xml, and binary spectra database. In zipped bcf's there are 5 components: xml (header), two binary databases, where one of them is empty, and some 'Spectrum position' x2 binary stuff which I have no idea what it does :/
    I have no idea how it is possible to save two databases inside one of the bcf??? It is impossible to do that through Esprit Bruker software which we posses in our lab. @pburdet, when did you got that bruker examples? Our files which I used in RE were done with Esprit 1.9... I am thinking if zipped files are older or newer format (in zip file I saw this ***64.dll what I think it would mean being newer implementation (our 1 year old SEM's still uses 32bit Esprit software))

so it looks I am chasing the moving target...

Any ideas how to deal with this?

@vidartf
Copy link
Member

vidartf commented Jul 20, 2015

I am thinking how effectively read such packed data and unpack it into python integers.
I tried struct, bitarray libraries... they would be perfect if the data would be aligned...
Also It needs to be fast.
I am trying to do it with bitwise operations.

Here's a decent routine for unpacking 12-bit data in Python. I'm sure some of the steps can be improved, but the skeleton is there.

def unpack12(data, sb_in=True, sb_out=True):
    """Unpack packed 12-bit data to 16-bit uints. 'sb_in' and 'sb_out'
    determines whether the byte order should be switched on input
    and output respectively.
    """
    if sb_in:
        switched_i2 = np.fromstring(data.tostring(), dtype='i2').byteswap()
        data = np.fromstring(switched_i2.tostring(), dtype=np.uint8)

    # Expand by map [0, 1, 1, 2, 3, 4, 4, 5, etc.].
    # This can probably be optimized to one operation by intelligent indexing
    expanded = np.ndarray((np.floor(len(data)/0.75)), dtype=np.uint8)
    expanded[0::4] = data[0::3]
    expanded[1::4] = expanded[2::4] = data[1::3]
    expanded[3::4] = data[2::3]
    # Reinterpret expanded as 16-bit
    exp16 = np.fromstring(expanded.tostring(), np.uint16,
                          count=int(len(expanded)/2))
    if sb_out:
        exp16 = exp16.byteswap()
    exp16[0::2] >>= 4             # Shift every second short by 4
    exp16 &= np.uint16(0x0FFF)    # Mask upper 4-bits on all shorts
    return exp16


def test():
    test_data = packed = np.array(
        [0x70, 0x02, 0x02, 0x2D, 0x2D, 0xD0, 0xD0, 0x02, 0x02, 0x2E, 0x2E,
         0xE0, 0xE0, 0x02, 0x02, 0x2E, 0x2E, 0xE0], dtype=np.uint8)
    out = unpack12(test_data)
    expected = [0x27, 0x2d, 0x2d, 0x2d, 0x2d, 0x2e,
                0x2e, 0x2e, 0x2e, 0x2e, 0x2e, 0x2e]
    assert(np.all(out == expected))
    # Print output
    print([hex(s) for s in out])

test()

@sem-geologist
Copy link
Contributor Author

Thanks @vidartf, your approach with numpy is order faster than mine with bit-wise operations on integers.
I was absent for a week, had a holiday, but I am getting back to the issue. After closer examination of our own bcfs, bcf from demo zip, and bcf opened and re-saved by demo program, I decided to implement sfs (bcf is pure sfs container) reader as the separate library. I am no more troubled about different types of files. Technically the old and new bcf's are very close. I understood that Bruker released new Esprit 2 (x64) (We use 1.9.4... which is just 32-bit), and there are some differences.
Generally both versions use sfs container. This sfs container is a file system in one file. Many things are still not completely clear, but I think I grasped enough to get some functionality.

So all old our/ version 1 bcf contains 'EDSDatabase' directory, where are two files [HeaderData, SpectrumData0]. The bcf from demo, have the same named directory, but have more files: [HeaderData, SpectrumData0, SpectrumData1, SpectrumPositions0, SpectrumPositions1]. Another difference is that sfs containers uses different chunk size. If I open old bcf with demo program and save, the saved bcf is version2, it gets the additional file SpectrumPositions0, and chunksize are as in new type bcf.
SFS is a file system: it divides and contains data in chunks, which can be extremely fragmented, thus as many file systems it contains tables, which have info where is the chunks of files, and ordered sequence of chunks. It have list/tree of files and directories, which points to file chunk tables. The data can be compressed, If compression is applied it is applied just on the data, but not on the internal sfs structure (chunk headers, tables, lists). It looks that bruker uses zlib compression, but to my knowledge sfs containers can be compressed with bzip2, and some weird compression.

I thought how to do it rightly (Probably I did bad). I created the initial library which reads sfs and prepares for fetching data, it does traverse file, gets the list of files inside and fetches chunk_tables and constructs the tree (dictionary).

What I am wondering is what is the purposes of empty SpectrumData1, and SpectrumPosition[x] files?
The first idea was that new Esprit software can do stacks, but HeaderData file normally holds pictures (BSE, SEI) just from one slide. I am not sure if our laboratory will ever upgrade to Esprit 2, and I have dilemma of future proof concept of the library. Another not sure stuff is my sfs reading library, the approach of getting the metadata first, and reading and combining real data later. Oh and BTW, the demo program from bruker is buggy, it doubles EDS spectra every time bcf is opened and saved as different file.

Would it be a good idea to post the code here, or should I wait till I will have all working (I fear it will be big) pieces which will return the appropriate dictionary as other plugins?

update: I understood what SpectrumPosition[x] files are used for. This is extension to the SpectrumData[x] file. The file contains pointers of pixel start position in the SpectrumData, thus to show some spectra on given pixels, just part of data need to be fetched. I can safely ignore it as this is delphist's approach to the problem.

@sem-geologist
Copy link
Contributor Author

I long ago moved to python3. For python2 probably it needs to be minimally adjusted. Probably it is hardly readable mess... I thought this could go to misc folder of hyperspy after modifications.

# -*- coding: utf-8 -*-
# Copyright 2015 Petras Jokubauskas
# and
# Copyright 20015 Hyperspy project
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with any project and source this library is coupled.
# If not, see <http://www.gnu.org/licenses/>.
#
# This python library provides basic reading capability of sfs proprietary
# AidAim Software(tm) files.
#
######################################################################
#Reverse engineering of this file format should be considered as FAIR USE:
#  The main reason behind RE is interoperability of Bruker composite files (*.bcf,
#   *.pan saved with Esprit software) with open source packages/software allowing
#   much more advanced data manipulation and analysis.
#
######################################################################
### Introduction ###
#SFS is proprietary single file system developed by AidAim Software(tm)
# natively used with Delphi(tm) draconically over-expensive and useless languages.
# This function tries to implement minimal reading capabilities based on
#  reverse engineering (RE), and can become useless with future format if mentioned
#  developers would introduce critical changes.
#   
# This function is very basic and can read just un-encrypted type of sfs.
# Observations of numerous bcf's shows that zlib compression (if used) is used.
# However other type of sfs containers could use different compression methods which
# is unknown for developer of this library
#
#STRUCTURE:
# The all non string data are little-endian.
# The data in sfs container are placed in chunks, and can be highly fragmented,
#  every chunk begins with 32-byte header.
# The first header of the chunk is special as it contains:
#   * 8-bytes filled with string 'AASFSSGN' (probably means
#    AidAimSingleFileSystemSiGNature).
#   * 8-bytes some kind of signature which differs for files generated on the different
#     machines.
#   * 4-bytes -- size of the chunks.
#   * the meaninng of the rest of values in this header is still unknown.
#   After this header there are some unknown values but value at 0x148 (4-bytes 
#    (uint32)) means the total number of chunks in sfs container.
#    
# The SFS tries to mimic the file system, thus it contains tree/list of files,
#   file table (order and index of chunks with the file) and data.
# The initial file tree/list starts at 5th chunk (0x4118 or 0x40118 or 0x400118...).
# If file tree is larger than a chunk, then it is continued in another chunk,
#  in such case the chunk header begins with 4-byte integer chunk_index where the list
#  is continued, otherwise the beginning of the header are filled with 00's, ff's
#  or some number higher than max chunk index.
#    
# The tree/list is constructed so:
#  * 0x200 total address space/item:
#    * first 0xe0 bytes contains non string information about item:
#      * 4 bytes - the chunk index with file chunk table (or begining of it)
#          (if it is 0xffffffff -- it is dir (and have no data) or file is empty)
#      * 8 bytes - the lenght of the file (if preceeding 4 bytes ff's: 0)
#      * 8 bytes + 8 bytes + 8 bytes -- 3 identical values in freshly made files,
#          most possibly "times" (creation, modification, something else?)
#      * 4 bytes of unknown; observed valules 0, 10... (guess it is file permissions,
#          most of known file systems have such feature)
#      * 4 bytes - the child of item (index, where first item 0);
#            if value == 0xffffffff, it is child of root
#      * 0xb0 bytes filled with 00's (what a waste of space)
#      * last 4 bytes (uint32) can be 0 or 1:
#        * 0 - item is a file
#        * 1 - item is a directory
#    * second part (next 0x120 bytes) -- string -- the file/dir name
#        (this probably means that file/dir name should be <= 288 chars)
#     # THE NOTE: the items in chunks is allwas aligned with the beginning of chunk 
#     #  (+ 32 bytes of header). The chunk size after removal of 32 bytes header and
#     #  divided by 512 (0x200) will have reminder, which have to be simply skipped.
#     #  Most commonly the reminder is of 0x1e0 (512 - 32) size, but can be different
#     #  if chunk size is not dividable by 512. The simplest practice would be to read
#     # (chunksize - 512) bytes from chunk after the header.
#
#     The table of file is constructed so:
#     * the 32byte header, where 4 first bytes contains index of chunk if table exceeds
#          space of 1 chunk, else ff's 00's or something higher than max index.
#     * ordered chunk indexes (4bytes/ index)
#
#  Compression are used (if used) just on the data part

import os
import numpy as np
import struct

# helper functions:

def bin_tree_to_lists(data):
    """unpacks sfs list/tree data into dictionary.

    args: data (the tree/list binary/string continuous data with 512 bytes per item)

    returns: 5 lists (file table, size, parent, is_dir (boolen), name)"""

    # due to possibility of child items appearing before parent in sfs list,
    # before we construct tree like structure, we need to sort out
    # child/parent relation, thus different kind of data should be fetched
    # into separate lists. Function have lots of space for optimization....

    # initializing dictionaries:
    tab_pointers = []  # index of the chunk with file table (or beginning of it)
    size = []  # in bytes
    #create_time = []
    #mod_time = []
    #wtf_time = []
    #permissions = []
    parent = []  # if 0xffffffff, then file/dir is directly in root
    is_dir = []  # boolen
    name = []

    #get data into lists:
    item = 0
    table_pointer = struct.unpack('<I', data[0:4])
    while table_pointer[0] != 0:
        tab_pointers += table_pointer
        size += struct.unpack('<Q', data[4+item*0x200:12+item*0x200])
        #create_time += struct.unpack('<d', data[12+item*0x200:20+item*0x200])
        #mod_time += struct.unpack('<d', data[20+item*0x200:28+item*0x200])
        #wtf_time += struct.unpack('<d', data[28+item*0x200:36+item*0x200])
        #permissions += struct.unpack('<I', data[36+item*0x200:40+item*0x200])
        parent += struct.unpack('<I', data[40+item*0x200:44+item*0x200])
        is_dir += struct.unpack('?', data[0xDC+item*0x200:0xDD+item*0x200])
        name.append(data[0xE0+item*0x200:0x200+item*0x200].strip(b'\x00'))
        item += 1
        table_pointer = struct.unpack('<I', data[item*0x200:4+item*0x200])

    return tab_pointers, size, parent, is_dir, name


def items_to_dict(file_tables, size, parent, is_dir, name, last_size):
    ## sort items into dictionary:
    tree0 = {}
    ## check if there is any tree structure or list is flat:
    n_parents = np.sort(np.unique(np.fromiter(parent, dtype='uint8')))
    ## if above will cause trouble, change it to uint16 or uint32
    if len(n_parents) > 1:               # there is tree
        for i in range(0,len(is_dir),1):
            if parent[i] == 0xFFFFFFFF:
                if is_dir[i]:
                    tree0[name[i]] = {}
                else:
                    tree0[name[i]] = {'size': size[i],
                                      'pointers': file_tables[i],
                                      'last_size': last_size[i]}
        if len(n_parents) == 2: #*.bcf's known till 2015.09
            for i in range(0,len(is_dir),1):
                if parent[i] != 0xFFFFFFFF:
                    if is_dir[i]:
                        tree0[name[parent[i]]][name[i]] = {}
                    else:
                       tree0[name[parent[i]]][name[i]] = {'size': size[i],
                                            'pointers': file_tables[i],
                                            'last_size': last_size[i]}
        else:  # unknow, but possible stuff
            # if you need it, add the code there
            print('Deep trees are Not Implemented')
    else:                               #list is flat
        for i in range(0,len(is_dir),1):
            if is_dir[i]:
                tree0[name[i]] = {}
            else:
                tree0[name[i]] = {'size': size[i],
                                  'pointers': file_tables[i],
                                  'last_size': last_size[i]}
    return tree0


# main functions:

def get_sfs_file_tree(filename):
    """ Function scans sfs type file and returns dictionary with file/dir structure.
    The function fetches just metadata which is significantly smaller than data,
    and safely can be fetch as whole into memory.

    args: filename

    returns: dictionary.
    """
    with open(filename, 'rb') as fn:
        a = fn.read(8)
        if a == b'AAMVHFSS': # check: is file SFS container?
            # usefull kind of "key" used in not-pointing chunk headers:
            fn.seek(0x116)  
            thingy = fn.read(2) 
            # get the chunk size:
            fn.seek(0x128) 
            chunksize = struct.unpack('<I', fn.read(4))[0]
            chunk_usable = chunksize - 32
            # get value, how many chunks are in sfs file?
            fn.seek(0x148) 
            n_of_chunks = struct.unpack('<I', fn.read(4))[0]
            tree_address = 4 # initial tree address
            # jump to tree/list addr. and check the header
            fn.seek(chunksize * tree_address + 0x118)
            next_chunk = struct.unpack('<I', fn.read(4))[0]
            if (next_chunk < n_of_chunks) and (next_chunk != 0):
                #list/tree exceeds the chunk
                fn.seek(28, 1)
                list_data = fn.read(chunksize - 512)
                tree_address = next_chunk
                fn.seek(chunksize * tree_address + 0x118)
                next_chunk = struct.unpack('<I', fn.read(4))[0]
                #check for chain of chunks with list:
                while (tree_address != next_chunk) and (
                       next_chunk < n_of_chunks) and (
                       next_chunk != 0):
                    fn.seek(28, 1)
                    list_data += fn.read(chunksize - 512)
                    tree_address = next_chunk
                    fn.seek(chunksize * tree_address + 0x118)
                    next_chunk = struct.unpack('<I', fn.read(4))[0]
                else:
                    #if chunk header points to itself, or have non valid
                    # values, it means -- last chunk
                    fn.seek(28, 1)
                    list_data += fn.read(chunksize - 512)
            else:
                #list/tree fits into 1 chunk
                fn.seek(28, 1)
                list_data = fn.read(chunksize - 512)
            tab_pointers, size, parent, is_dir, name = bin_tree_to_lists(list_data)
            file_tables = []
            last_size = []
            for i in range(0, len(is_dir), 1):
                if tab_pointers[i] != 0xFFFFFFFF:
                    size_in_last_chunk = size[i] % (chunk_usable)
                    if size_in_last_chunk == 0:
                        last_size.append(chunk_usable)
                    else:
                        last_size.append(size_in_last_chunk)
                    table_entries = -( -size[i] // (chunk_usable)) #getting ceiling
                    table_size = -( -table_entries // (chunk_usable // 4)) # in chunks
                    fn.seek(chunksize * tab_pointers[i] + 0x118)
                    if table_size > 1:
                        j = 0  # chunk counter
                        next_chunk = tab_pointers[i]
                        temp_table = b''
                        while j != table_size:
                            fn.seek(chunksize * next_chunk + 0x118)
                            next_chunk = struct.unpack('<I', fn.read(4))[0]
                            fn.seek(28, 1)
                            temp_table += fn.read(chunk_usable)
                            j += 1
                    else:
                        fn.seek(chunksize * tab_pointers[i] + 0x138)
                        temp_table = fn.read(chunk_usable)
                    table = np.trim_zeros(np.fromstring(temp_table, dtype='uint32'),
                                          'b') * chunksize + 0x138
                    file_tables.append(table)
                else:
                    file_tables.append(None)
                    last_size.append(None)

            tree = items_to_dict(file_tables, size, parent, is_dir, name, last_size)

            return tree

        else:
            raise RuntimeError('file is not SFS container')

@pburdet
Copy link
Contributor

pburdet commented Aug 21, 2015

Thanks, I tried your script. I get that

>>> get_sfs_file_tree(fn)
{'EDSDatabase': {'HeaderData': {'last_size': 2615,
   'pointers': array([  24888,   28984,   33080, ..., 7475512, 7479608, 7483704], dtype=uint32),
   'size': 7399095},
  'SpectrumData0': {'last_size': 515,
   'pointers': array([  7541048,   7545144,   7549240, ..., 122216760, 122220856,
          122224952], dtype=uint32),
   'size': 113682787}}}

What the next step to get the data?

@sem-geologist
Copy link
Contributor Author

@pburdet, this is already a bit outdated. This worked on the bcf, but were not practical on other sfs containers (in example *.pan -- bruker particle/features format). SFS (bcf is SFS) is file system, thus it have file tree, and file tables. The function get_sfs_file_tree returned not the data but the file tree and file tables. File tree can be used then to read the sfs (or bcf) at provided pointers and joined into binary string representing one of the files. This outdated function also didn't returned chunk size. It worked reasonably well with bcf ( as you see in your tested example of the version 1 BCF whivh contains just 2 files in the EDSDatabase directory), but very slow with *.pan files which can have thousand of files. That were non practical as calling the function would parse whole file. Thus I had to change it that the function would return just one pointer to the first chunk of the file table. The file tables would be later separately fetched in other functions, so if You want to get the file 'HeaderData' you would parse bcf just for it and would not get other file tables as in the old function. I have made the final version of this sfs extractor library which works reasonably fast and returns data in io.BytesIO format. I tested it on both python3 (my main working environment) and python2, with different files even up to 700Mb (the mappings which takes whole weekend). Another thing is I figured out compression and implemented it.
I forked the hyperspy to my git account and I am trying to figure out the system of contribution (I am very inexperienced with git). I would like to put the sfs reader into hyperspy/misc/io/unsfs.py as this is just helper functions. You can check it out in my account hyperspy fork bcf_implementation branch in the mentioned folder. The main function in latest library is called 'getFileFromTheSFS()' where first argument is filename/path, and second - inner path to the file. i.e. to retrieve xml part from bcf this would look like this:
getFileFromTheSFS('/path/to/Your/bcf/file.bcf', 'EDSDatabase/HeaderData').
to retrieve hypermapping part:
getFileFromTheSFS('/path/to/Your/bcf/file.bcf', 'EDSDatabase/SpectrumData0')
*(in newer versions of bcf can be also SpectrumData1 and so on...)

This is of course just first step to extract data from the sfs container.
I am working on the parsing the data part, and got some working code, but it is so slow that I feel ashamed of it. I can get the numpy array out of demo Sample.bcf in 1-2 minutes, and I don't want to tell how long it takes on bigger files. I have it in the messy ipython notebook, I will put the code in my branch as soon I will get some more reasonable results and clean the mess.

@pburdet
Copy link
Contributor

pburdet commented Aug 27, 2015

Tanks, it seems to work. I'll be happy to get the parsing part.

For submitting changes in hyperspy, you are doing things right. Fork hyperspy and create a branch as you have done. Now you can pull request it, even if it is not ready. We can then start discussing/reviewing/testing/improving your code from the pull request.

@sem-geologist
Copy link
Contributor Author

I created repository with some bcf files (version 1). There it is. If somebody would have possibility to generate version 2 bcf (from Esprit 2.x) you are welcome to push it to that repository. There are 3 hypermaps. 2 generated on the demo version of Esprit1.9.4 with 4 on 3 pixels image size, other is real life mapping.

  • 4on3.bcf had very short acquisition, and demonstrates 12bit packed data (1 count/ 12bit where record is the channel index of the EDX) with some completely empty pixels (0-2 counts/pixel).
  • 4on3full.bcf demonstrates high intensity mapping, where data are stored in instructively packed arrays (data are recorded from channel 0 to the max channel, record means number of counts for channel).
  • P50_situation.bcf real life few minute mapping with low intensity and higher intensity zones, thus data in some pixels are packed in pulses/12bit and in other instructively.

I didn't uploaded files which were shared above with demo pascal code. They are version2 bcf. Data in them are mostly constructed instructively + 16bit packed data per count.

@sem-geologist
Copy link
Contributor Author

I mentioned that I have working parsing example, but it is shamefully slow.
After bumping my head to wall I think I found the problem. I want to comment on my parsing attempts here, maybe it will help.
My first strategy for parsing was to create final numpy array with

numpy.zeros(number_of_channels, height, width, dtype=depth)

and then parse data stream pixel by pixel, and populate the final array.
I learned in the hard way that it is very bad bad idea. (or very slow slow).
I started to reduce code and did a lot of timeit on different pieces of code.
Lets start from simplest construction, and benchmark it!

  • lets use unsfs library (which I pushed) and get the data stream of bcf:
data = unsfs.getFileFromTheSFS('P50_situation.bcf', 'EDSDatabase/SpectrumData0')

The speed of this step is proportional to IO speed of the disk and memory.

  • Now lets construct very thin algorithm to go through the data stream pixel by pixel:
def loop_the_data(data):
    data.seek(0)
    height, width = struct.unpack('<ii', data.read(8))
    data.seek(0x1A0) # skipping some kind of header, meaning is unknown
    for line_cnt in range(0, height, 1):  # or xrange() in python2
        #line_head contains number of non-empty pixels in line
        line_head = struct.unpack('<i', data.read(4))[0]
        for pix_cnt in range(0, line_head, 1):
            #x_index of the pixel:
            x_pix = struct.unpack('<i', data.read(4))[0]
            #number of max channel with counts > 0; twice
            chan1 = struct.unpack('<H', data.read(2))[0]
            chan2 = struct.unpack('<H', data.read(2))[0]
            # some placeholder 0x0000 0C00. always!
            data.seek(4, 1)
            # packing types:
            # flag = 1 -- 12bit per address of 1 count
            # flag = 2 or 3 -- instructively packed + 
            #             16 per address of 1 count (optionally)
            flag = struct.unpack('<H', data.read(2))[0]
            # if flag > 1, data_size1 > 0 and data_size1 == data_size2 
            data_size1 = struct.unpack('<H', data.read(2))[0]
            # number of pulses in 1count/address type data:
            n_of_pulses = struct.unpack('<H', data.read(2))[0]
            # data_size2. always >0
            data_size2 = struct.unpack('<H', data.read(2))[0]
            data.seek(2, 1) # always 0x0000
            # check the gathered header info and
            # decide what to do:
            if flag == 1:  # and (chan1 != chan2)
                data.seek(data_size2, 1) #skiping to next pixel/line
            else:
                data.seek(data_size2-4, 1) #skip to end of data - 4bytes
                if n_of_pulses > 0:
                    # some additional data for the same pixel recorded
                    # as the pulses.
                    # additional data size:
                    add_s = struct.unpack('<I', data.read(4))[0]
                    data.seek(add_s, 1)
                else:  # if there is no addition pulses, jump to
                    data.seek(4, 1)  # next pixel or line

    return None

I measured execution time with %%timeit magic in ipython3 with different files.

File Time
huge bcf (582.3MiB)* 4.44s
P50_situation.bcf (72MiB) 1.15s
Sample.bcf (14MiB) 248 ms
4on3full.bcf 75.9 µs
4on3.bcf 22.9 µs

*I can't share, it is too big.

It could be better, but generally it is not bad.
If I change the the last 3 data skipping (seeking) 'data.seek' into data.read - the results are the same (fits inside deviation).

Probably You realized that before I jump to next pixel I am reading the length of the pixel data in the pixel header: in other words to get to the wanted pixel (x, y) of the mapping I need to parse the mapping line by line, pixel by pixel up to that wanted pixel. And I guess that's why Bruker in the ver sion 2 bcf added SpectrumPositions0 (the last charter is the same count as for the SpectrumData last count) file in the sfs/bcf container. That file begins with height, width, and then there is the table with offset values pointing to every pixel in SpectrumData[x] file, 4bytes int values per pixel. If value is 0xFFFFFFFF or -1 as integer, then it means pixel is empty. Version 1 bcf have no such feature. Note this, I will come back to it later.

  • So lets go back to my bad idea, and add the main array and modify it pixel by pixel. Let's not add any unpacking, just simulate adding the unpacked spectra to main numpy array (hm) every pixel:
def bin_to_numpy(data, max_channels, depth):
    data.seek(0)
    height, width = struct.unpack('<ii', data.read(8))
    #create the empty numpy array hypermap called hm:
    hm = np.zeros((max_channels, height, width), dtype=depth)
    data.seek(0x1A0) # skipping some kind of header, meaning is unknown
    for line_cnt in range(0, height, 1):  # or xrange() in python2
        #line_head contains number of non-empty pixels in line
        line_head = struct.unpack('<i', data.read(4))[0]
        for pix_cnt in range(0, line_head, 1):
            #x_index of the pixel:
            x_pix = struct.unpack('<i', data.read(4))[0]
            #number of max channel with counts > 0; twice
            chan1 = struct.unpack('<H', data.read(2))[0]
            chan2 = struct.unpack('<H', data.read(2))[0]
            # some placeholder 0x0000 0C00. always!
            data.seek(4, 1)
            # packing types:
            # flag = 1 -- 12bit per address of 1 count
            # flag = 2 or 3 -- instructively packed + 
            #             16 per address of 1 count (optionally)
            flag = struct.unpack('<H', data.read(2))[0]
            # if flag > 1, data_size1 > 0 and data_size1 == data_size2 
            data_size1 = struct.unpack('<H', data.read(2))[0]
            # number of pulses in 1count/address type data:
            n_of_pulses = struct.unpack('<H', data.read(2))[0]
            # data_size2. always >0
            data_size2 = struct.unpack('<H', data.read(2))[0]
            data.seek(2, 1) # always 0x0000
            # check the gathered header info and
            # decide what to do:
            if flag == 1:  # and (chan1 != chan2)
                data.read(data_size2) #skiping to next pixel/line
                hm[:chan1, line_cnt, x_pix] = 1
            else:
                data.read(data_size2-4) #skip to end of data - 4bytes
                hm[:chan1, line_cnt, x_pix] = 1
                if n_of_pulses > 0:
                    # some additional data for the same pixel recorded
                    # as the pulses.
                    # additional data size:
                    add_s = struct.unpack('<I', data.read(4))[0]
                    data.read(add_s)
                    hm[:chan1, line_cnt, x_pix] += 1
                else:  # if there is no addition pulses, jump to
                    data.seek(4, 1)  # next pixel or line

    return hm

I got this results (first 2 used %%time magic instead of %%timeit):

%%timeit
a = bin_to_numpy(data, 2048, np.uint16)
File Time size
huge bcf (582.3MiB) 29.6 s 987x730
P50_situation.bcf (72MiB) 33.1 s 512x384
Sample.bcf (14MiB) 2.86 s 221x166
4on3full.bcf 147 µs 4x3
4on3.bcf 42.7 µs 4x3

I was quite surprised by huge bcf not getting more overhead. I can't find explanation for that.

  • And now imagine that every pixel would use function which returns pixel spectrum as numpy array.
    This was very bad idea...

So what is my strategy?:

  • generate the missing SpectrumPositions file for version 1 files, or use the the file in version 2.
  • with that I think I could use my depacking functions inside list comprehensions, and create array from those lists in final step, instead numpy arraying per pixel. Any suggestions?

@vidartf
Copy link
Member

vidartf commented Aug 28, 2015

I didn't go through your code in detail, but from a general point and a quick overview I can offer these comments and suggestions:

  • For figuring out where time is spent while executing I would recommend to use a profiling tool instead of timeit. It basically tells you on where you are spending your time. It is a lot easier to use once setup, and many python IDE's integrate a profiler so that it is a one-click action. I might also recommend pprofile or line_profiler as they are line-profilers which gives you a line-by-line breakdown on time spent.
  • Python is an interpreted language, which means there is quite a bit of overhead per code instruction compared to compiled code. For this reason it is preferable to vectorize operations instead of looping. For your code, I can't tell whether you're limited by I/O or code execution, but if a lot of time is spent on the minor reads before your conditionals I'd recommend collapsing the unpacking:
unpack_str = "<i<H<H<i<H<H<H<H<H"
unpack_bytes = struct.calcsize(unpack_str)    # 22
x_pix, chan1, chan2, _, flag, data_size1, n_of_pulses, data_size2, _ = \
    struct.unpack(unpack_str, data.read(unpack_bytes))

However, it might make the code a bit more unreadable, so check the profiling first.

@vidartf
Copy link
Member

vidartf commented Aug 28, 2015

Regarding how to get the actual pixel data into the array, the 'easiest' alternatives I see is this:

if flag == 1:
    hm[:chan1, line_cnt, x_pix] = np.frombuffer(data, count=1, dtype=depth)

However, I'm a bit unsure if I understand the pixel data. What lengths/types can it be? E.g. can it be a triplet for RGB, or will it always be a single value? What can depth be?

@sem-geologist
Copy link
Contributor Author

the problem is every pixel have variable lenght, there can be empty pixel,(simply skipped). Then if there is any pixel information it is packed in 3 possible methods: 1 - packed in 12bit packed values, 2- it can be packed with instruction described packing in small bits: in example: 0x00 0x25 -> would mean next 0x25 channels are coded with 0 nibbles/channel - result: [0,0,0,0...] of 0x25 length; 0x01 0x5 0x09 0x8f 0x65 0x04 -> next 5 channels is coded in by 1nibble/channel, + 9 as addition to those channels - result '[f+9,8+9,5+9,6+9,4+9]' (last 0 is discarded). And 3 is like 2, but can have 16bit packed values (similar to 12bit packed) impulse/channel at the end of the stream. The shape of pixel is recorded, however every pixel have different shape. dtype=depth is not and have to be decided before parsing. Maybe it is recorded in some mess in data stream header from 0xF to 0x1A0, but I could not find it. The same is with shape, maybe it is in mentioned header. My strategy is to use eds sum spectra in xml part, divide it by height and width, and use double 0kV peak (bruker have special peak at 0kV generated by Processing unit, which is comparable with intensive peaks) to find what max intensities could be and chose the dtype for main array. For shape there is no such info in xml part. this can be retrieved just after looping through all pixel headers, reading and finding max value.
update:
Shape can be deduced actually from sum EDS spectra in the xml part, looking up for last non 0 channel.

@vidartf
Copy link
Member

vidartf commented Sep 2, 2015

Well, from what you are saying, I interpret that it should be possible to preallocate a numpy ndarray of the correct dimensions/type. If you then pass this array + pixel indices to your unpack functions, you can read the unpacked values directly into the final array.

@sem-geologist
Copy link
Contributor Author

theoretically yes,
and that was my first approach.
but practically the loops are killing the performance, looping through array is slow.
I also thought about generator, and making the numpy array from iterator. But it resulted in slower execution. The latest code got below 30s to load Sample.bcf (~14MiB) (which is instructed data dominated) and just below 30s for P50_situation.bcf (~72MiB (which is 12bit packed pixel data dominated). The first code was more readable, the last lost some readability, however I see I got more than 50% speedup which is nice, but I want more :). If to parse 300 slices I would need to wait few hours... and some times I have around 1000 hypermaps.... that would take a 8 hours! +/-50%. (700MiB bcf (instructed data based) parses in 11min. I am thinking.. should I upload too git two files, one for readability 2nd for speed?

@sem-geologist
Copy link
Contributor Author

on my 2nd approach I did such stuff:

  1. precalculate spectra flat table if bcf is version 1 (up to 2-4s in 700MiB file) else I can use ready precalculated table.
  2. iterate through that table and populate numpy array. Having this table I no more need such loop in the main function:
for y in height:
    #check the line header code
    for pixel in range(nonemptypixels_from_line):
        #get to know which is the pixel, and after it get to know next pixel.
  1. Another speed up is to not create numpy arrays for small chunks, as generating numpy array have huge overheads. Instructed data are parsed using lists because instructed data have most common 5-15 channel long pieces and lists are faster in this case.
  2. reshaping completely flat array into map at the end.

@sem-geologist
Copy link
Contributor Author

I would like to update the library, but I am not sure about python3 python2. I wrote it on python3, It was not working on python2 (the problem of simple unpacking of byte strings), thus added some code bits which works on both py2 and py3, but adds a bit overhead, marked them(with coment py2) and comented out py3 only pieces (if hyperspy would switch to python3 completely in thefuture).
As I said, it is tested on py3 mostly, and in py2 just very limited testing was done.

The library is rewriten completely (some piece of code was reused) from functions to classes (I am so happy I took the classes about object oriented programing:), much easier to do this in OO way). It contains some very handy general classes as SFS_reader with SFSTreeItem which can be used for general unpacking of SFS packed data. Then there is Bruker and Bruker hypermap specific classes:
HyperHeader -- class parsing the header xml
EDXSpectrum -- class parsing the EDS from objectified piece of xml
HyperMap -- optional class containing the hypermap(s) if it needs to be keept persistently.

and finally there is this class inheriting and using all above: BCF_reader
which inherits the SFS container opening andreading from SFS_reader class, then checks if it is Bruker hypermap format, sets up and parses Header xml. It do not parse hypermaps automaticaly, instead it presents few methods to parse (not persistent or persistent) the hypermaps, which call the separate functions (not methods) in the library for doing that (bin_to_numpy).

In the meanwhile I am working on cython version of hypermap parsing function, which should speed up the parsing, however will it be possible to use it in hyperspy? It would require compiling during installation.

@francisco-dlp
Copy link
Member

francisco-dlp commented Apr 6, 2016

I think that we shouldn't merge this until travis runs properly the tests on the cython code. Many projects use cython (see e.g. sklearn) and manage to get their tests running in travis. So should we...

Also, I think that distributing the C code is unnecessary since it can be generated by cython. We should distribute hyperspy using wheels, and therefore the end-user shouldn't need to compile anything nor install Cython anyway.

And talking about wheels, we should make sure that we can easily produce wheels for all relevant platforms (mac and windows?), possibly before merging this. @vidartf, would it be possible to ask appveyor to create windows wheels automatically on release? And, how can we produce wheels for mac, can this be done from linux?

@vidartf
Copy link
Member

vidartf commented Apr 6, 2016

@francisco-dlp : You can get AppVeyor to make wheels quite easily, yes. Just add

build:
    - python setup.py bdist_wheel

(or possibly python setup.py build_ext bdist_wheel). And then add the wheel to the artifacts (either by field in deploy: or by function as used for bundle). It might be possible to set up a second deployment that creates wheels for every push on master as well (excluding PRs), if that would be helpful.

I have no experience with wheels for Mac.

@sem-geologist
Copy link
Contributor Author

That is the whole point of distributing c code. It does not require cython in dependencies just any C compiler (the same as python, so on windows it is MSVC, on linux gcc and so on...). In cython manual it is even warned that cython can change, and proper code can become obsolete, but not c, thus it is highly recomended to distribute c along the pyx, also setup.py building extentions are suggested by default compile from c (look http://docs.cython.org/src/reference/compilation.html at "Distributing Cython modules" section), the only thing I would change is I would add to setup.py possibility to force using cython (but it should not be by default) if USE_CYTHON flag would be passed. I guess for most of windows users it is little relevant as they are anyway using binary. However generated C should be left in source. ( I know it takes near 1 MB. Maybe I will get courage one day to rewrite it in clean C and wrap through cython).

What is wheels? binary?

It is not that travis or appveyroy doesn't work with cython (both compiles it nicely and places where it is supposed to be placed), the testing doesn't.
To make travis testing working with cython I need to know how testing is done. (I mean how does it works, does travis on its own search for testing folder and iterates through it, or are there some other libraries which are called from travis and those libraries iterates through tests?)

@to266
Copy link
Contributor

to266 commented Apr 7, 2016

@sem-geologist the full setup and testing procedure is written in the log. In particular, this line runs the actual tests

python continuous_integration/nosetest.py --with-coverage hyperspy

Where the python is from anaconda that was set up in previous commands, obviously.

The actual "run tests" is the last line, run_exit(), which is just alias for this: https://nose.readthedocs.org/en/latest/api/core.html#nose.core.TestProgram

@sem-geologist
Copy link
Contributor Author

oh, ok so this is nosetest behind. I know now at least what to search. Thanks

@to266
Copy link
Contributor

to266 commented Apr 7, 2016

Yes, you even import it yourself ! :)

@sem-geologist
Copy link
Contributor Author

I thought it is just toolset of handy assert functions :D

@sem-geologist
Copy link
Contributor Author

I think the problem is that nosetest tests everything inside source folder. Installation however copy the *.so or *.pyd files to final destination. I think I should inject somethere python setup.py develop, which would build *.so in place. Or alternatively python setup.py build_extensions --inplace should do the trick.

@francisco-dlp
Copy link
Member

@sem-geologist, sorry that I didn't explain myself. I agree that we should distribute the binaries (with wheels) or C files (with source distributions) when packaging HyperSpy. However, I don't think that we should distribute the C files in the github repository. In other words, the C/binary files should be generated on packaging i.e. when running setup.py. Does it make sense?

@francisco-dlp
Copy link
Member

@vidartf, thanks for the appveyor lines. It seems that travis can run and deploy for macos too. I have no experience whatsoever with mac though.

@to266
Copy link
Contributor

to266 commented Apr 7, 2016

http://eng.hroncok.cz/2016/03/28/pytest-mac might help, though I have never used it

Or, apparently travis can do it too.

@sem-geologist
Copy link
Contributor Author

so where is the place of those c files. And when and where they would be exactly generated (it should not be on customer side).

@francisco-dlp
Copy link
Member

No, they'll be generated on packaging e.g. when producing the source distribution "python setup.py sdist". So, the code to generate the C files must be in setup.py

@sem-geologist
Copy link
Contributor Author

ok.. the existing code is ok (it compiles c extension during install, or build_extensions, or develop. the code to generate C is very simple and I will add it. however it will be invoked. does travis do sdist?

@francisco-dlp
Copy link
Member

No, it does install

@francisco-dlp
Copy link
Member

But we'll modify it to make sdist and wheels on release too.

@francisco-dlp
Copy link
Member

So we'll do for appveyor as @vidartf showed.

@sem-geologist
Copy link
Contributor Author

I did tested locally and observed that nosetest fails with cython if there is no files compiled in the source tree. and if it is compile then my addition cython vs python tests succeed.
I am going to add python setup.py build_ext --inplace which should fix the issue (to travis and appveyor configurations). if you are going to put sdist and wheel only at the release, then C file have to stay in source.

@francisco-dlp
Copy link
Member

f you are going to put sdist and wheel only at the release, then C file have to stay in source.

why?

@sem-geologist
Copy link
Contributor Author

cause tests will fail? (It was written "on release", so until release it would fail)

@sem-geologist
Copy link
Contributor Author

Appveyor successfully builds the unbcf_fast at testing environment and passes. Travis does not want to build_ext --inplace. Now I am trying to put python setup.py develop instead to travis.yml... and it fails...

@francisco-dlp ,how is it that there is requirements for setuptools, but setup.py use distutils.core instead?

@francisco-dlp
Copy link
Member

@sem-geologist, some for spotting the issue with setuptools. Probably the issue was caused by an incorrect merge. It should be setuptools. Would you like to send a separate PR to correct it?

@francisco-dlp
Copy link
Member

Regarding the C file issue, you're right, Cython should generate it whenever setup.py is invoked for installation or packaging purpose, not only for packaging purposes.

@sem-geologist
Copy link
Contributor Author

Yes I will make separate PR for setuptools,
No, the setup.py install should not require cython (it would be bad pattern). It should be optional command argument to re-cythonise the .pyx files (generate c), but not mandatory, so that user would not need dependency on cython.
However distribution (source and binary) should depend on cython, and cythonize the pyx files to c.
Also the setup.py have to have checking mechanism if c files is missing (if user tries to install straight from git: in example pip git+html://......... or local git copy, as it will have no c file).

so it gets quite complicated.

@francisco-dlp
Copy link
Member

Also the setup.py have to have checking mechanism if c files is missing

Then, install may require Cython in this case :)

so it gets quite complicated.

It seems so. However, there are a lot of well established packages that may have found a good solution. I suggest having a look at what others are doing.

@sem-geologist
Copy link
Contributor Author

@francisco-dlp , I did a separate PR to change the distutils to setuptools: looks old warning about unrecognised 'install_requires' and 'setup_requires' is gone, but 3 new warnings appeared (look in travis logs) #949.

@francisco-dlp
Copy link
Member

It seems that the main issue with#949 is that ext_modules tries to compile files that aren't in that branch (unbcf_fast).

@sem-geologist
Copy link
Contributor Author

That is, no more. It was due to copy-paste of some piece of code from other branch... (my bad habit)

@sem-geologist
Copy link
Contributor Author

the #949 got merged, so this can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants