Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate DataHub instances with appropriate data #5

Open
7 of 10 tasks
tom-webber opened this issue Mar 13, 2024 · 7 comments
Open
7 of 10 tasks

Populate DataHub instances with appropriate data #5

tom-webber opened this issue Mar 13, 2024 · 7 comments
Assignees

Comments

@tom-webber
Copy link
Contributor

tom-webber commented Mar 13, 2024

We need to populate the different DataHub instances (Preprod, dev, test) with appropriate data. Preprod and test should contain copies of the data currently in dev.

Metadata that needs importing

In theory we could sink from instance to instance using the datahub source however I don't think we can filter this(?)

Definition of Done

  • test DataHub instance populated
  • preprod DataHub instance populated
@MatMoore MatMoore self-assigned this Mar 18, 2024
@MatMoore
Copy link
Contributor

Extract glossary terms from a datahub export json

jq '.[] | select(.entityType == "glossaryTerm") | [.entityUrn, .aspect.json.name, .aspect.json.definition, .aspect.json.parentNode] | @csv' datahub_export.json

@MatMoore
Copy link
Contributor

I'll have to create the well documented view manually I think - can't see a way to export these.

I also had a look in datahub slack to see if anyones thought about a terraform provider for some of the data that is relatively static. There doesn't seem to be anything available at the moment, but there was some discussion about this about a year ago https://datahubspace.slack.com/archives/CUMUWQU66/p1635352024067800

@MatMoore
Copy link
Contributor

MatMoore commented Mar 19, 2024

Draft runbook section on populating environments

We have 3 datahub environments:

  1. Dev (may be unstable if we are working on deployment)
  2. Test (relatively stable, will be kept up to date as needed for user testing)
  3. Pre-production (most stable, most out of date)

Note: At least during alpha, all of these should be populated from the same sources of metadata, so that research participants are working with a catalogue that is as realistic as possible. This means there should be nothing important in our pre-production catalogue that is not also in the dev & test catalogues (including metadata from the production analytical platform).

Prerequisites for populating environments

  1. Install the Datahub CLI
  2. Create an access token and configure it in the CLI

One-off ingestions

In each environment, we (the Data Catalogue team) have prepopulated a set of metadata that we have collated ourselves.
This represents our current best guess at what good metadata may look like, and may form the basis of future ingestion mechanisms.

These ingestions are push based, using the Datahub API and/or command line.

After these steps, we expect the following to be created:

  • Datasets and containers describing the Analytical Platform, based on the previous Data Discovery Tool
  • Example data products - our best guess of what a well maintained data product will look like
  • Our draft domain model
  • Our draft glossary
  • A sample of users we have identified as possible data owners or custodians - and ownership associations to the datasets and data products above

Step 1: Import draft domain model from create a derived table

  1. Download the DBT manifest.json from s3://mojap-derived-tables/prod/run_artefacts/
  2. Use the Datahub CLI to ingest the domain model from Create a Derived Table, using the custom ingestion source (TODO: add commands for this)

Example yaml:

source:
  type: create_derived_table_domains_source.source.CreateDerivedTableDomainsSource
  config:
    manifest_local_path: "manifest.json"

sink:
  type: datahub-rest
  config:
    server: "https://datahub-catalogue-ENV.apps.live.cloud-platform.service.justice.gov.uk/api/gms"
    token: xxxxx

Step 2: Import draft glossary and users

Follow the instructions to import via the CLI in https://github.com/ministryofjustice/data-catalogue-metadata

Step 3: Import metadata taken from Data Discovery Tool

Follow the instructions to run the python script in https://github.com/ministryofjustice/data-catalogue-metadata

Scheduled ingestions

Each environment is configured with scheduled ingestions for metadata we expect to be updated. This demonstrates how we can continually pull data from other parts of the MOJ estate that the catalogue has direct access to.

These sources are configured from the ingestion tab in Datahub.

TBC: should the configuration for these be checked into a github repo?

Note that the ingestion tab may also show ingestions triggered from the command line, although these will show up as view-only and cannot be triggered again from the UI.

Step 4: Schedule DBT ingestion

This brings in derived tables and their lineage. Source tables may overlap with those ingested from other sources.

Step 5: Schedule custom ingestion for Justice Data charts (currently untested)

Manual environment setup

Certain aspects of the environment are not reproducible from code. These include

  • Setting a warning notice on the datahub homepage
  • Creating custom views in datahub, e.g. "Well documented data"

These must be set up manually when recreating an environment.

@MatMoore
Copy link
Contributor

Our previous DBT config used a presigned s3 url to access the manifest. This time round we would like to try

  • populating via s3 using an assumed role
  • broadening the list of tables we pull in
source:
    type: dbt
    config:

        # insert s3 path here

        target_platform: s3
        entities_enabled:
            test_results: No
            seeds: No
            snapshots: No
            models: Yes
            sources: No
            test_definitions: No
        node_name_pattern:
            allow:
                - '.*oasys_set$'
                - '.*oasys_section$'
                - '.*oasys_question$'
                - '.*oasys_answer$'
                - '.*oasys_assessment_group$'
                - '.*offender$'
                - '.*ref_question$'
                - '.*prison_population_history__imprisonment_spells$'
                - '.*prison_population_history__jicsl_lookup_ao_population_nart$'
                - '.*derived_delius__components_at_latest$'
                - '.*derived_delius__components_at_comm$'
                - '.*derived_delius__components_at_term$'
                - '.*derived_delius__contacts$'
                - '.*derived_delius__court_appearances$'
                - '.*derived_delius__court_reports$'
                - '.*derived_delius__first_release$'
                - '.*derived_delius__releases$'
                - '.*derived_delius__sentences_at_disp$'
                - '.*derived_delius__sentences_at_latest$'
                - '.*derived_delius__sentences_at_term$'
                - '.*derived_delius__upw_appointments$'
                - '.*common_platform_derived__all_offence_fct$'
                - '.*common_platform_derived__cases_fct$'
                - '.*common_platform_derived__crown_trials_fct$'
                - '.*common_platform_derived__def_hearing_summary_fct$'
                - '.*common_platform_derived__defendant_summary_fct$'
                - '.*common_platform_derived__disposal_summary_fct$'
                - '.*common_platform_derived__sjp_all_offence_fct$'
                - '.*common_platform_derived__sjp_defendant_summary_fct$'
                - '.*common_platform_derived__sjp_disposal_summary_fct$'
                - '.*common_platform_derived__sjp_session_summary_fct$'
                - '.*lookup_offence_v2__cjs_offence_code_to_ho_offence_code$'
                - '.*lookup_offence_v2__ho_offence_codes$'
                - '.*lookup_offence_v2__offence_group$'
                - '.*lookup_offence_v2__offence_group_code$'
                - '.*lookup_offence_v2__offence_priority$'
        stateful_ingestion:
            remove_stale_metadata: true

@MatMoore
Copy link
Contributor

Modified version for running on test.

I haven't set this up as a scheduled task yet - but we can try this once https://github.com/moj-analytical-services/create-a-derived-table/pull/1269 is merged

source:
    type: dbt
    config:
        platform_instance: create_a_derived_table
        manifest_path: 'manifest.json'
        catalog_path: 'catalog.json'
        target_platform: athena
        entities_enabled:
            test_results: 'NO'
            seeds: 'NO'
            snapshots: 'NO'
            models: 'YES'
            sources: 'NO'
            test_definitions: 'NO'
        node_name_pattern:
            allow:
                - '.*oasys_set$'
                - '.*oasys_section$'
                - '.*oasys_question$'
                - '.*oasys_answer$'
                - '.*oasys_assessment_group$'
                - '.*offender$'
                - '.*ref_question$'
                - '.*prison_population_history__imprisonment_spells$'
                - '.*prison_population_history__jicsl_lookup_ao_population_nart$'
                - '.*derived_delius__components_at_latest$'
                - '.*derived_delius__components_at_comm$'
                - '.*derived_delius__components_at_term$'
                - '.*derived_delius__contacts$'
                - '.*derived_delius__court_appearances$'
                - '.*derived_delius__court_reports$'
                - '.*derived_delius__first_release$'
                - '.*derived_delius__releases$'
                - '.*derived_delius__sentences_at_disp$'
                - '.*derived_delius__sentences_at_latest$'
                - '.*derived_delius__sentences_at_term$'
                - '.*derived_delius__upw_appointments$'
                - '.*common_platform_derived__all_offence_fct$'
                - '.*common_platform_derived__cases_fct$'
                - '.*common_platform_derived__crown_trials_fct$'
                - '.*common_platform_derived__def_hearing_summary_fct$'
                - '.*common_platform_derived__defendant_summary_fct$'
                - '.*common_platform_derived__disposal_summary_fct$'
                - '.*common_platform_derived__sjp_all_offence_fct$'
                - '.*common_platform_derived__sjp_defendant_summary_fct$'
                - '.*common_platform_derived__sjp_disposal_summary_fct$'
                - '.*common_platform_derived__sjp_session_summary_fct$'
                - '.*lookup_offence_v2__cjs_offence_code_to_ho_offence_code$'
                - '.*lookup_offence_v2__ho_offence_codes$'
                - '.*lookup_offence_v2__offence_group$'
                - '.*lookup_offence_v2__offence_group_code$'
                - '.*lookup_offence_v2__offence_priority$'
        stateful_ingestion:
            remove_stale_metadata: true

sink:
  type: datahub-rest
  config:
    server: "https://datahub-catalogue-test.apps.live.cloud-platform.service.justice.gov.uk/api/gms"
    token: xxxx

@MatMoore
Copy link
Contributor

MatMoore commented Mar 20, 2024

manifest_path will change to s3://mojap-derived-tables/dev/run_artefacts/latest/target/manifest.json

with dev replaced with prod so we don't pick up metadata that isn't live

@MatMoore
Copy link
Contributor

This is done as far as the test environment is concerned.

The remaining work is blocked on ministryofjustice/find-moj-data#175

Just need to make sure the python script runs on preprod after this.

The above DBT job can also be converted into a scheduled ingestion once there is data in s3://mojap-derived-tables/prod/run_artefacts/latest/target/manifest.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants