Spike: Can we pick up domains from create a derived table #108

seanprivett · 2024-02-28T11:45:03Z

Resources: https://mojdt.slack.com/archives/C03QZ776JVA/p1708360503491379?thread_ts=1708360446.980769&cid=C03QZ776JVA

YvanMOJdigital · 2024-02-29T13:25:24Z

Investigate process for importing domains from CDT. Import to test environment to avoid disrupting UR.

LavMatt · 2024-02-29T13:27:31Z

see domains here (the folders) https://github.com/moj-analytical-services/create-a-derived-table/tree/main/mojap_derived_tables/models

LavMatt · 2024-03-11T11:07:59Z

The short answer is yes, we can pick up domains from create-a-derived-table in an automated way.

But there are some points to consider in the details of implementation.

The concept of domains, whilst recognised by DBT as a core concept of data management, it is not something that DBT captures as an explicit metadata property, i.e. it does not have it's own key/value in the overall DBT manifest json file (file containing all table metadata/configurations).

This video from DBT shows create-a-derived-table follows the recommended implementation of data domains through folders.

Domain Ingestion Methods

It's possible to create a custom ingestion method to handle domain ingestion from create-a-derived-table, which can derive and ingest domains as set in the manifest json created on each run of DBT. Because of the point raised above (no domain key) there are a couple of different approaches that could be taken:

Infer domains via the latest DBT manifest file, using the fqn key (fully qualified name). This is a list created by DBT related to the path of tables, where the 2nd item of that list is always the domain folder.
Infer domains via the latest DBT manifest file using the external_location key, which is the s3 path to table files. This follows hive partition pathing conventions e.g. domain=prison/database_name=db_1/table_name=tb1/...
Could also be possible (although not explored due to it needing changes made to create a derived table projects) to add domain tags to model configurations and then pull these from the central manifest.

I have created a PoC custom ingestion source following option 1 and have ingested domains into the Datahub test env

Potential Issues

Neither of the options is perfect and would potentially break with changes to folder structure or naming conventions.
Custom ingestions can not yet be run via the datahub UI, so we couldn't setup and configure domain ingestions/alignment to run out of our Datahub instance. (we could run via airflow though if wanting to regularly schedule). Mat M thought maybe we can do this though.
Datahub domains allow for other associated metadata, e.g. a description. These methods would only allow ingestion (with current create-a-derived-table setup) of the names of the domains

LavMatt · 2024-03-19T12:43:59Z

Talking with @SoumayaMauthoorMOJ, there is a possibly a chance that domains will be introduced as dynamic tags within the config of create-a-derived-table and as part of the work planned to alter how domains are represented within s3 paths.

LavMatt self-assigned this Mar 7, 2024

YvanMOJdigital changed the title ~~Can we pick up domains from create a derived table~~ Spike: Can we pick up domains from create a derived table Mar 13, 2024

LavMatt mentioned this issue Mar 14, 2024

Replace data product labelling in Datahub for source system metadata #175

Closed

5 tasks

MatMoore mentioned this issue Mar 18, 2024

Populate DataHub instances with appropriate data ministryofjustice/data-catalogue#5

Open

10 tasks

murdo-moj mentioned this issue May 29, 2024

Assign domains to entities in CaDeT #343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Can we pick up domains from create a derived table #108

Spike: Can we pick up domains from create a derived table #108

seanprivett commented Feb 28, 2024 •

edited by tom-webber

Loading

YvanMOJdigital commented Feb 29, 2024

LavMatt commented Feb 29, 2024

LavMatt commented Mar 11, 2024 •

edited

Loading

LavMatt commented Mar 19, 2024

Spike: Can we pick up domains from create a derived table #108

Spike: Can we pick up domains from create a derived table #108

Comments

seanprivett commented Feb 28, 2024 • edited by tom-webber Loading

YvanMOJdigital commented Feb 29, 2024

LavMatt commented Feb 29, 2024

LavMatt commented Mar 11, 2024 • edited Loading

Domain Ingestion Methods

Potential Issues

LavMatt commented Mar 19, 2024

seanprivett commented Feb 28, 2024 •

edited by tom-webber

Loading

LavMatt commented Mar 11, 2024 •

edited

Loading