Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse container #653

Open
kif opened this issue Mar 28, 2019 · 3 comments
Open

Sparse container #653

kif opened this issue Mar 28, 2019 · 3 comments

Comments

@kif
Copy link

kif commented Mar 28, 2019

Hi,

We need to store "sparse data", either sparse matrices (https://en.wikipedia.org/wiki/Sparse_matrix), or data acquired with very few events. This is especially true when using faster detectors with constant flux. PyFAI also uses intensively sparse matrices internally for performances and some users wish to persist the state on disk.

There has been a proposal for using an extension to HDF5 directly (https://github.com/appier/h5sparse) but I believe it would be better to have some specification at the nexus level.

The sparse matrix representation is well normalized so we could imagine having an
@NXClass: NXsparse

Because there are a few flavours of sparse representation, I would add an attribute @format being equal to a string (DOK, COO, CSR, CSC, ...) to specify the type of storage.
Then there would be a few other attributes (depending on the @format attribute to specify the name of the dataset which contains the actual data.

Maybe an example is simpler than a description. The CSR sparse matrix representation uses 3 arrays called data, indices and indptr as defined in https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix

This would give in nexus this layout:

My_matrix
@NXclass: NXsparse
@format: CSR
@shape: (nlines, ncol)
@data: my_data 
@indices: my_indices
@indptr: line_start

    line_start = dataset(nlines+1, dtype=int)
    my_indices = dataset(nnz, dtype=int)
    my_data = dataset(nnz, dtype=float)

While the CSR format uses data, indices and indptr, but other format have different components. which needs to be normalized...

Once the 2D case is addressed with matrices, we will probably want to represent larger dimensionality datasets, for example considering the storage of a stack of images (nframes x nlines x ncolumns) and there should be a way to indicate how frames, lines and columns are packed into a sparse matrix.

My_matrix_of_4D_dataset
@NXclass: NXsparse
@format: CSR
@shape: (nlines, ncol) 
@data: my_data 
@indices: my_indices
@indptr: line_start
@original_shape: (i, z, y, x)
@map_row: (0, 1)
@map_col: (3, 2)

    line_start = dataset(nlines+1, dtype=int)
    my_indices = dataset(nnz, dtype=int)
    my_data = dataset(nnz, dtype=float)

In this example the original dataset is a 4D array with i * z = nlines and y * x = ncol. Moreover the map_raw and map_col provide the packing of the i, x, y, z into matrix dimensions. This scheme can of course be extended into larger dimensionality,

Accessing large dimensionality is needed for many new types of experiments (https://www.nature.com/articles/nature16060), but as an archival format, nexus needs to take care of the ability to store and the manipulate the associated data.

Your comments are welcome.
Jérôme from ESRF

A copy for @jonwright as he is interested in the subject.

@benajamin
Copy link
Contributor

We discussed this proposal at the telco. This is a bit special in that it is describing how a dataset is represented, while NeXus tends to concentrate on the physical meaning of datasets. It is, however, similar to NXlog and NXevent_data in this respect and so should definitely be considered as a worthy addition to NeXus. While we think it is probably best (long-term) dealt with in HDF5, rather than NeXus, the HDF group need money for development and a NeXus solution might be a better work-around than the h5sparse code. This idea should be further developed into a proposal for the NIAC to consider adopting.

The NXlog and NXevent_data classes are not widely used and how they integrate with the rest of the NeXus structure is not clearly described in the documentation (to me, at least), so it would help to clear that up and treat each of the data representations generally. Perhaps @FreddieAkeroyd has used these classes and can share some wisdom.

@rayosborn
Copy link
Contributor

Sorry I missed the telco, but I think you will find that the NXlog classes (and I presume the NXevent_data classes, though I leave that to Mantid) are heavily used at neutron sources to store sample environment and data acquisition logs. In fact, the number of NXlog groups often swamps the number of NXdata groups, although they are usually discreetly buried in subgroups.

@FreddieAkeroyd
Copy link
Member

The NXevent_data class is used for storing neutron event data as both ISIS and ORNL. It stores a list of tuples (detector_id, time_offset_from_pulse) with an additional array storing indices into this list that allow splitting the list according to each accelerator pulses. Is is thus like the COO sparse matrix representation you describe above with (detector_id, time_offset_from_pulse) being (row, column) and value always being 1. The class could probably be extended to allow for non-zero values (which might be useful if weighted events needed to be stored in NeXus) and also for storing a general n-tuple. Would only supporting a COO style representation be OK or are there sufficient benefits to being able to store in a CSR style?

@benajamin benajamin removed the telco label Aug 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants