Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet skeletons #134

Open
clbarnes opened this issue Nov 14, 2023 · 4 comments
Open

Parquet skeletons #134

clbarnes opened this issue Nov 14, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@clbarnes
Copy link
Collaborator

clbarnes commented Nov 14, 2023

I had just been thinking about using feather (arrow IPC) or parquet for skeleton storage and as is often the case, you've thought about it first.

Something I'd considered would be to use a null rather than -1 for the parent_id of root nodes. This means we're not wasting a bit to encode a single value per skeleton, and we can map node IDs onto the whole uint64 space (not that we're likely to run out of IDs, but they're not necessarily counting up from 0). AFAIK, nulls are encoded in the header of the parquet so retrieving root nodes could be much faster than checking through the whole file. If we could fix the pandas version to >=2, we could use the arrow backend and switch navis generally onto using a nullable column, but until then it's a fairly simple switch to do at the IO stage. N.B. using the arrow backend would, I think, allow an extremely fast IO mode where the memory buffer could be either dumped straight to a file or read directly by other libraries as feather format (NBLAST memory sharing?).

For bundling several parquets together I'd consider tar rather than zip (tar-quet??) - parquet files are probably best compressed using internal codecs which make best use of the file's structure, and any compression over the top of that will slow IO without significant space savings. Tarballs don't have that overhead.

@clbarnes
Copy link
Collaborator Author

This is possibly better for a Discussion rather than an issue, but I'll continue my roaming train of thought...

Rather than a neuron column, we could consider calling it something like fragment_id. This simply avoids presupposing that all fragments are complete and identified as the same neuron; there are cases where someone might want to represent a neuron as a forest (i.e. multiple fragments) which would throw off readers' assumptions about the number of roots to associate with a single neuron. There is a trade-off here between a strict definition with reliable assumptions but useful in fewer places ("neurons must be complete"), and a broader definition which can be used for either complete neurons or incomplete fragments.

Returning to the arrow point above, defining the spec in terms of an arrow schema ("neurarrow"?) would give us the feather/ IPC format for free. The implementations of parquet I've seen tend to read it into arrow format anyway.

@ceesem
Copy link

ceesem commented Nov 14, 2023

I can't speak for others, but I would be interested in making and using a generic modern skeleton format that lives in something like arrow, particularly if it has capabilities to be more expressive than an SWC in terms of things like synapses and other annotations (and optionally meshes, if I can ask for extra nice things).

I've implemented an h5 format that does things like this within our tooling and it works perfectly well, but is special purpose and lacks some features I've come to think I want. This is a scenario where it would make a lot more sense for us connectomics folks to do the extra 20-50% effort and instead of doing yet another special purpose format, let's try to make something more general-purpose and use a core library with all the main functions which we can build our more opinionated dataset specific features on. Arrow is definitely the right kind of technology for this, between the fast IO and the new pandas implementations.

(Also, segment_id is the term I'd suggest)

@clbarnes
Copy link
Collaborator Author

This may be a good reference: geoparquet (and more generally geoarrow).

@schlegelp schlegelp added the enhancement New feature or request label Nov 30, 2023
@clbarnes
Copy link
Collaborator Author

I made an attempt: https://github.com/clbarnes/neurarrow

I would have liked to use structs for the coordinates (they're still stored in a column-oriented way, i.e. all the Xs would be contiguous), but pandas doesn't support structs yet (it's in beta). It makes some use of list and map types too (n.b. in arrow, dictionary means a categorical type, map means key-value like python's dicts).

I added some required and suggested metadata, a way of specifying optional and derived columns, and so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants