Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Can we pick up domains from create a derived table #108

Open
seanprivett opened this issue Feb 28, 2024 · 4 comments
Open

Spike: Can we pick up domains from create a derived table #108

seanprivett opened this issue Feb 28, 2024 · 4 comments
Assignees

Comments

@seanprivett
Copy link
Contributor

seanprivett commented Feb 28, 2024

Resources: https://mojdt.slack.com/archives/C03QZ776JVA/p1708360503491379?thread_ts=1708360446.980769&cid=C03QZ776JVA

@YvanMOJdigital
Copy link

Investigate process for importing domains from CDT. Import to test environment to avoid disrupting UR.

@LavMatt
Copy link
Contributor

LavMatt commented Feb 29, 2024

@LavMatt LavMatt self-assigned this Mar 7, 2024
@LavMatt
Copy link
Contributor

LavMatt commented Mar 11, 2024

The short answer is yes, we can pick up domains from create-a-derived-table in an automated way.

But there are some points to consider in the details of implementation.

The concept of domains, whilst recognised by DBT as a core concept of data management, it is not something that DBT captures as an explicit metadata property, i.e. it does not have it's own key/value in the overall DBT manifest json file (file containing all table metadata/configurations).

This video from DBT shows create-a-derived-table follows the recommended implementation of data domains through folders.

Domain Ingestion Methods

It's possible to create a custom ingestion method to handle domain ingestion from create-a-derived-table, which can derive and ingest domains as set in the manifest json created on each run of DBT. Because of the point raised above (no domain key) there are a couple of different approaches that could be taken:

  1. Infer domains via the latest DBT manifest file, using the fqn key (fully qualified name). This is a list created by DBT related to the path of tables, where the 2nd item of that list is always the domain folder.
  2. Infer domains via the latest DBT manifest file using the external_location key, which is the s3 path to table files. This follows hive partition pathing conventions e.g. domain=prison/database_name=db_1/table_name=tb1/...
  3. Could also be possible (although not explored due to it needing changes made to create a derived table projects) to add domain tags to model configurations and then pull these from the central manifest.

I have created a PoC custom ingestion source following option 1 and have ingested domains into the Datahub test env

Potential Issues

  • Neither of the options is perfect and would potentially break with changes to folder structure or naming conventions.
  • Custom ingestions can not yet be run via the datahub UI, so we couldn't setup and configure domain ingestions/alignment to run out of our Datahub instance. (we could run via airflow though if wanting to regularly schedule). Mat M thought maybe we can do this though.
  • Datahub domains allow for other associated metadata, e.g. a description. These methods would only allow ingestion (with current create-a-derived-table setup) of the names of the domains

@YvanMOJdigital YvanMOJdigital changed the title Can we pick up domains from create a derived table Spike: Can we pick up domains from create a derived table Mar 13, 2024
@LavMatt
Copy link
Contributor

LavMatt commented Mar 19, 2024

Talking with @SoumayaMauthoorMOJ, there is a possibly a chance that domains will be introduced as dynamic tags within the config of create-a-derived-table and as part of the work planned to alter how domains are represented within s3 paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants