Parquet skeletons #134

clbarnes · 2023-11-14T14:31:41Z

I had just been thinking about using feather (arrow IPC) or parquet for skeleton storage and as is often the case, you've thought about it first.

Something I'd considered would be to use a null rather than -1 for the parent_id of root nodes. This means we're not wasting a bit to encode a single value per skeleton, and we can map node IDs onto the whole uint64 space (not that we're likely to run out of IDs, but they're not necessarily counting up from 0). AFAIK, nulls are encoded in the header of the parquet so retrieving root nodes could be much faster than checking through the whole file. If we could fix the pandas version to >=2, we could use the arrow backend and switch navis generally onto using a nullable column, but until then it's a fairly simple switch to do at the IO stage. N.B. using the arrow backend would, I think, allow an extremely fast IO mode where the memory buffer could be either dumped straight to a file or read directly by other libraries as feather format (NBLAST memory sharing?).

For bundling several parquets together I'd consider tar rather than zip (tar-quet??) - parquet files are probably best compressed using internal codecs which make best use of the file's structure, and any compression over the top of that will slow IO without significant space savings. Tarballs don't have that overhead.

The text was updated successfully, but these errors were encountered:

clbarnes · 2023-11-14T15:32:26Z

This is possibly better for a Discussion rather than an issue, but I'll continue my roaming train of thought...

Rather than a neuron column, we could consider calling it something like fragment_id. This simply avoids presupposing that all fragments are complete and identified as the same neuron; there are cases where someone might want to represent a neuron as a forest (i.e. multiple fragments) which would throw off readers' assumptions about the number of roots to associate with a single neuron. There is a trade-off here between a strict definition with reliable assumptions but useful in fewer places ("neurons must be complete"), and a broader definition which can be used for either complete neurons or incomplete fragments.

Returning to the arrow point above, defining the spec in terms of an arrow schema ("neurarrow"?) would give us the feather/ IPC format for free. The implementations of parquet I've seen tend to read it into arrow format anyway.

ceesem · 2023-11-14T22:00:04Z

I can't speak for others, but I would be interested in making and using a generic modern skeleton format that lives in something like arrow, particularly if it has capabilities to be more expressive than an SWC in terms of things like synapses and other annotations (and optionally meshes, if I can ask for extra nice things).

I've implemented an h5 format that does things like this within our tooling and it works perfectly well, but is special purpose and lacks some features I've come to think I want. This is a scenario where it would make a lot more sense for us connectomics folks to do the extra 20-50% effort and instead of doing yet another special purpose format, let's try to make something more general-purpose and use a core library with all the main functions which we can build our more opinionated dataset specific features on. Arrow is definitely the right kind of technology for this, between the fast IO and the new pandas implementations.

(Also, segment_id is the term I'd suggest)

clbarnes · 2023-11-16T15:00:03Z

This may be a good reference: geoparquet (and more generally geoarrow).

clbarnes · 2024-01-12T16:57:24Z

I made an attempt: https://github.com/clbarnes/neurarrow

I would have liked to use structs for the coordinates (they're still stored in a column-oriented way, i.e. all the Xs would be contiguous), but pandas doesn't support structs yet (it's in beta). It makes some use of list and map types too (n.b. in arrow, dictionary means a categorical type, map means key-value like python's dicts).

I added some required and suggested metadata, a way of specifying optional and derived columns, and so on.

schlegelp added the enhancement New feature or request label Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet skeletons #134

Parquet skeletons #134

clbarnes commented Nov 14, 2023 •

edited

clbarnes commented Nov 14, 2023

ceesem commented Nov 14, 2023

clbarnes commented Nov 16, 2023

clbarnes commented Jan 12, 2024

Parquet skeletons #134

Parquet skeletons #134

Comments

clbarnes commented Nov 14, 2023 • edited

clbarnes commented Nov 14, 2023

ceesem commented Nov 14, 2023

clbarnes commented Nov 16, 2023

clbarnes commented Jan 12, 2024

clbarnes commented Nov 14, 2023 •

edited