Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

precomputed annotation format related ID lookup #529

Closed
fcollman opened this issue Feb 13, 2024 · 2 comments
Closed

precomputed annotation format related ID lookup #529

fcollman opened this issue Feb 13, 2024 · 2 comments

Comments

@fcollman
Copy link
Contributor

I'm wondering if there is a path forward to allow users to select segments based on interactions with pre computed annotations.

The present precomputed annotation format doesn't store the relationships with the annotation data, so it's hard for the UI to allow users to identify and select related segments based on interacting the the annotations. The related ID index allows for an efficient way to query all the annotations associated with a related segment, but not which related segments are associated with an annotation. This presently works for local annotations.

The simplest solution I think would be to simply encode the relationships along with the other properties in a "v2" format. I don't know if you have other/better ideas. The present v1 approach is highly data duplicative, and has the advantage of being a fixed length of bytes per annotation and this would break that convention.

@jbms
Copy link
Collaborator

jbms commented Feb 13, 2024

Yes, currently you have to make a separate request to the by_id index per annotation in order to retrieve the list of all related segments, because only the by_id index stores that information. It would be reasonable to store the relationship data in the other indices if there were a use for it, but that would indeed be a format change.

Regarding a v2 format, there are a few thoughts I had on that:

  • It may make sense to use Parquet or similar arrow-related format for encoding each chunk rather than the custom binary format currently used, but I have not investigated that too much. I don't think Parquet is particularly suitable for representing an entire index, but I could be mistaken.
  • It would be nice to allow indices to be defined on arbitrary (ordered) combinations of geometry, relationships, properties.
  • It would be nice if there were an existing database format that could be leveraged (i.e. designed to be read directly without a server over high-latency storage, can be written via batch process also without a server) but unfortunately I don't think there is.
  • OCDBT could be used in place of precomputed sharded format, as that would allow ordered indices over arbitrary strings rather than just hash indices over uint64 values.

@fcollman
Copy link
Contributor Author

I realized that part of my confusion was i had a bug in #522 which was not writing the by_id index right, so the related segments were not showing up for my layers. I fixed that, so thank you for clarifying so i realized my bug. Comments on a v2 format make sense and I whole heartedly agree. We've been discussing what format to write old versions of materialized data from CAVE to and many of these same issues came up in that discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants