Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Propose to hard-code lineage data path in the dataset metadata document #482
In datacube there is this concept of "Product Metadata". It is a mapping from "well known name" to a path within a metadata document containing the value of a variable with a "well known name". It is an indirection mechanism that allows to treat datasets of different type in a consistent fashion. For example
This is fine and consistent with the approach of having metadata indirection mechanism, but it comes with some significant limitations. Lineage location being a "dynamic property" like this makes it impossible to traverse lineage tree without first associating every dataset along the lineage path to a product. This is needed since Metadata is a property of a product.
In the current system where lineage is "All or Nothing" deal (you can either skip all of it, or otherwise all lineage should be indexable), this coupling of traversal to product matching is annoying from implementation perspective but is not a deal breaker. But if we ever want to implement in-between-solutions where some of the lineage data is recorded, but not mapped to actual datasets in the database, this inability to navigate lineage tree without establishing
We can of course have partial solutions like "assume eo metadata scheme" for those datasets that do not map to any product in the database. But this feels like "hard-coding by convention", rather than demanding a certain path for lineage data.
Just realised that
But auto-matching for lineage data makes user experience super miserable.
@omad @jeremyh thoughts? To me it looks like we aren't really using metadata mechanism to describe sources outside of our control, instead we write converters that massage some external input into
I think it is safe to forgo "metadata as a mechanism to describe external sources" use-case, and treat it only as "sort-of type system for describing dataset documents" + "mechanism for bridging yaml/sql gap for basic types".
If we do that, we can add constraints like "path for
This gives us the ability to traverse dataset documents without having access to DB and without relying on product matching. Tools like "dump all lineage uuids" as a flat list will then become possible.