Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose to hard-code lineage data path in the dataset metadata document #482

Closed
Kirill888 opened this issue Jun 12, 2018 · 2 comments

Comments

@Kirill888
Copy link
Contributor

commented Jun 12, 2018

In datacube there is this concept of "Product Metadata". It is a mapping from "well known name" to a path within a metadata document containing the value of a variable with a "well known name". It is an indirection mechanism that allows to treat datasets of different type in a consistent fashion. For example id is a property that every dataset ought to have, but it can be stored in the metadata document under different name/location depending on the product. This mechanism is particularly useful when metadata document generation is outside of the ODC control, and instead of massaging original metadata into a consistent form ODC expects, one can just define a custom metadata document that describes where information can be found.

sources is the name used to point to a document sub-tree containing lineage data.

# Where to find a dict of embedded source datasets
# -> The dict is of form: classifier->source_dataset_doc
# -> 'classifier' is how to classify/identify the relationship (usually the type of source it was eg. 'nbar').
# An arbitrary string, but you should be consistent between datasets (to query relationships).
sources: ['lineage', 'source_datasets']

This is fine and consistent with the approach of having metadata indirection mechanism, but it comes with some significant limitations. Lineage location being a "dynamic property" like this makes it impossible to traverse lineage tree without first associating every dataset along the lineage path to a product. This is needed since Metadata is a property of a product.

In the current system where lineage is "All or Nothing" deal (you can either skip all of it, or otherwise all lineage should be indexable), this coupling of traversal to product matching is annoying from implementation perspective but is not a deal breaker. But if we ever want to implement in-between-solutions where some of the lineage data is recorded, but not mapped to actual datasets in the database, this inability to navigate lineage tree without establishing dataset<->product mappings for every node in the lineage tree becomes a real problem.

We can of course have partial solutions like "assume eo metadata scheme" for those datasets that do not map to any product in the database. But this feels like "hard-coding by convention", rather than demanding a certain path for lineage data.

@Kirill888

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2018

Just realised that id field also doesn't have fixed location in the document. So we MUST rely on product auto-matching even for datasets that are already in the database, because until we matched dataset to a product, we don't know what metadata schema to use, and we can't tell what uuid this document has so we can't check if it is already associated to a product in our database.

But auto-matching for lineage data makes user experience super miserable.

@omad @jeremyh thoughts? To me it looks like we aren't really using metadata mechanism to describe sources outside of our control, instead we write converters that massage some external input into eo metadata format anyway.

@Kirill888

This comment has been minimized.

Copy link
Contributor Author

commented Jun 18, 2018

I think it is safe to forgo "metadata as a mechanism to describe external sources" use-case, and treat it only as "sort-of type system for describing dataset documents" + "mechanism for bridging yaml/sql gap for basic types".

If we do that, we can add constraints like "path for id and sources" shall be fixed to current convention which is id, and lineage.source_datasets.

This gives us the ability to traverse dataset documents without having access to DB and without relying on product matching. Tools like "dump all lineage uuids" as a flat list will then become possible.

@Kirill888 Kirill888 referenced this issue Jun 27, 2018
9 of 9 tasks complete

@Kirill888 Kirill888 closed this Jul 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.