Sparse container #653

kif · 2019-03-28T08:34:19Z

Hi,

We need to store "sparse data", either sparse matrices (https://en.wikipedia.org/wiki/Sparse_matrix), or data acquired with very few events. This is especially true when using faster detectors with constant flux. PyFAI also uses intensively sparse matrices internally for performances and some users wish to persist the state on disk.

There has been a proposal for using an extension to HDF5 directly (https://github.com/appier/h5sparse) but I believe it would be better to have some specification at the nexus level.

The sparse matrix representation is well normalized so we could imagine having an
@NXClass: NXsparse

Because there are a few flavours of sparse representation, I would add an attribute @format being equal to a string (DOK, COO, CSR, CSC, ...) to specify the type of storage.
Then there would be a few other attributes (depending on the @format attribute to specify the name of the dataset which contains the actual data.

Maybe an example is simpler than a description. The CSR sparse matrix representation uses 3 arrays called data, indices and indptr as defined in https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix

This would give in nexus this layout:

My_matrix
@NXclass: NXsparse
@format: CSR
@shape: (nlines, ncol)
@data: my_data 
@indices: my_indices
@indptr: line_start

    line_start = dataset(nlines+1, dtype=int)
    my_indices = dataset(nnz, dtype=int)
    my_data = dataset(nnz, dtype=float)

While the CSR format uses data, indices and indptr, but other format have different components. which needs to be normalized...

Once the 2D case is addressed with matrices, we will probably want to represent larger dimensionality datasets, for example considering the storage of a stack of images (nframes x nlines x ncolumns) and there should be a way to indicate how frames, lines and columns are packed into a sparse matrix.

My_matrix_of_4D_dataset
@NXclass: NXsparse
@format: CSR
@shape: (nlines, ncol) 
@data: my_data 
@indices: my_indices
@indptr: line_start
@original_shape: (i, z, y, x)
@map_row: (0, 1)
@map_col: (3, 2)

    line_start = dataset(nlines+1, dtype=int)
    my_indices = dataset(nnz, dtype=int)
    my_data = dataset(nnz, dtype=float)

In this example the original dataset is a 4D array with i * z = nlines and y * x = ncol. Moreover the map_raw and map_col provide the packing of the i, x, y, z into matrix dimensions. This scheme can of course be extended into larger dimensionality,

Accessing large dimensionality is needed for many new types of experiments (https://www.nature.com/articles/nature16060), but as an archival format, nexus needs to take care of the ability to store and the manipulate the associated data.

Your comments are welcome.
Jérôme from ESRF

A copy for @jonwright as he is interested in the subject.

The text was updated successfully, but these errors were encountered:

benajamin · 2019-04-24T15:53:27Z

We discussed this proposal at the telco. This is a bit special in that it is describing how a dataset is represented, while NeXus tends to concentrate on the physical meaning of datasets. It is, however, similar to NXlog and NXevent_data in this respect and so should definitely be considered as a worthy addition to NeXus. While we think it is probably best (long-term) dealt with in HDF5, rather than NeXus, the HDF group need money for development and a NeXus solution might be a better work-around than the h5sparse code. This idea should be further developed into a proposal for the NIAC to consider adopting.

The NXlog and NXevent_data classes are not widely used and how they integrate with the rest of the NeXus structure is not clearly described in the documentation (to me, at least), so it would help to clear that up and treat each of the data representations generally. Perhaps @FreddieAkeroyd has used these classes and can share some wisdom.

rayosborn · 2019-04-24T15:57:53Z

Sorry I missed the telco, but I think you will find that the NXlog classes (and I presume the NXevent_data classes, though I leave that to Mantid) are heavily used at neutron sources to store sample environment and data acquisition logs. In fact, the number of NXlog groups often swamps the number of NXdata groups, although they are usually discreetly buried in subgroups.

FreddieAkeroyd · 2019-04-24T22:19:07Z

The NXevent_data class is used for storing neutron event data as both ISIS and ORNL. It stores a list of tuples (detector_id, time_offset_from_pulse) with an additional array storing indices into this list that allow splitting the list according to each accelerator pulses. Is is thus like the COO sparse matrix representation you describe above with (detector_id, time_offset_from_pulse) being (row, column) and value always being 1. The class could probably be extended to allow for non-zero values (which might be useful if weighted events needed to be stored in NeXus) and also for storing a general n-tuple. Would only supporting a COO style representation be OK or are there sufficient benefits to being able to store in a CSR style?

benajamin added the telco label Apr 16, 2019

benajamin removed the telco label Aug 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse container #653

Sparse container #653

kif commented Mar 28, 2019

benajamin commented Apr 24, 2019

rayosborn commented Apr 24, 2019

FreddieAkeroyd commented Apr 24, 2019

Sparse container #653

Sparse container #653

Comments

kif commented Mar 28, 2019

benajamin commented Apr 24, 2019

rayosborn commented Apr 24, 2019

FreddieAkeroyd commented Apr 24, 2019