Add get_identifier to hive source in metadata ingestion #2667

zack3241 · 2021-06-08T23:00:45Z

The hive source currently does not include the catalog name
in the schema name. This is due to the SQLAlchemy
reflection that pyhive performs. Without the catalog
name, multiple hive sources could have collisions
with the urns generated. This function will add
the database name where applicable and solve this problem.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable)

zack3241 · 2021-06-08T23:33:11Z

The build is failing because of a mypy issue stemming from their 0.900 and 0.901 release that went out today. I can either fix those issues or I can pin the mypy package to a specific version.

hsheth2

LGTM - thanks for adding this @zack3241

The hive source currently does not include the catalog name in the schema name. This is due to the SQLAlchemy reflection that pyhive performs. Without the catalog name, multiple hive sources could have collisions with the urns generated. This function will add the database name where applicable and solve this problem.

jjoyce0510 · 2021-06-09T18:20:30Z

LGTM. Thanks!

shirshanka

LGTM. Thank you Zack!

* fix(react): update ispartofbuilderfromdataflow, update ui in datajob header (datahub-project#2619) * fix(ingestion): improve robustness of glue ingestion source (datahub-project#2626) fixes: datahub-project#2625 Co-authored-by: thomas.larsson <thomas.larsson@klarna.com> * feat(react): custom properties are now sortable by name in the UI (datahub-project#2622) * fix(ci): update trigger to always generate docker images (datahub-project#2636) * fix(ingest): include urn as key for kafka emitter (datahub-project#2634) * fix(react): url encoding urns and tag profile fix (datahub-project#2623) * feat(react): replace user urn with username (datahub-project#2599) Co-authored-by: shubham.garg <shubham.garg@thoughtworks.com> * fix(ingest): improve redshift ingestion performance (datahub-project#2635) * docs: update roadmap with accomplished items (datahub-project#2617) * feat: No Code Metadata Modeling (datahub-project#2629) Co-authored-by: Dexter Lee <dexter@acryl.io> Co-authored-by: Gabe Lyons <itsgabelyons@gmail.com> Co-authored-by: Shirshanka Das <shirshanka@apache.org> * docs(no-code): removing unused snapshot annotations (datahub-project#2642) * feat(datahub-upgrade): Use the "head" image tag. (datahub-project#2641) * fix(datahub-upgrade): remove debug flag (datahub-project#2640) * fix(build): don't purge ingestion codegen files on gradle clean (datahub-project#2645) * fix(analytics): fix index names (datahub-project#2647) * fix(nocode): Fix docker image tag for run_upgrade script (datahub-project#2648) * docs(nocode): Update migration docs with workaround for stuck upgrade(datahub-project#2649) * fix: use head tag by default in quickstart (datahub-project#2650) * fix(datahub-upgrade): fixing mis-spelled schema registry url (datahub-project#2652) * fix(gms): Return empty snapshots when one does not exist (datahub-project#2646) * docs(nocode): fix links to datahub-upgrade (datahub-project#2651) * feat(datahub-upgrade): improve no code upgrade logging (datahub-project#2653) * fix(docker): pin containers to golden hash for release (datahub-project#2654) * feat(nocode): Add datahub-upgrade job to helm chart and set version to v0.8.1 (datahub-project#2643) * docs(nocode): Adding documentation for no-migration upgrade option (datahub-project#2656) * revert: "fix(docker): pin containers to golden hash for release (datahub-project#2654)" (datahub-project#2659) This reverts commit a483933 and moves us back to using HEAD in quickstart on the master branch. * fix(docker): use debug tag in local dev images (datahub-project#2658) * fix(ingest): support mssql encryption via ODBC (datahub-project#2657) * fix(NoCode): Update snapshot json to latest (datahub-project#2655) * fix(ingest): exclude mssql-odbc from "all" extra (datahub-project#2660) * docs(nocode): adding documentation to handle backwards incompatibility issues (datahub-project#2662) * docs(nocode): fix typo in autocomplete annotation (datahub-project#2664) * build(docker): test docker builds in pull request CI (datahub-project#2661) * fix(ingest): fix MyPy stubs (datahub-project#2666) * feat(ingest): Feast ingestion integration (datahub-project#2605) * Add feast testing setup * Init Feast test script * Add feast to dependencies * Update feast descriptors * Sort integrations * Working feast pytest * Clean up feast docker-compose file * Expand Feast tests * Setup feast classes * Add continuous and bytes data to feature types * Update field type mapping * Add PDLs * Add MLFeatureSetUrn.java * Comment out feast setup * Add snapshot file and update inits * Init Feast golden files generation * Clean up Feast ingest * Feast testing comments * Yield Feature snapshots * Fix Feature URN naming * Update feast MCE * Update Feature URN prefix * Add MLEntity * Update golden files with entities * Specify feast sources * Add feast source configs * Working feast docker ingestion * List entities and features before adding tables * Add featureset names * Remove unused * Rename feast image * Update README * Add env to feast URNs * Fix URN naming * Remove redundant URN names * Fix enum backcompatibility * Move feast testing to docker * Move URN generators to mce_builder * Add source for features * Switch TypeClass -> enum_type * Rename source -> sourceDataset * Add local Feast ingest image builds * Rename Entity -> MLPrimaryKey * Restore features and keys for each featureset * Do not json encode source configs * Remove old source properties from feature sets * Regenerate golden file * Fix race condition with Feast tests * Exclude unknown source * Update feature datatype enum * Update README and fix typos * Fix Entity typo * Fix path to local docker image * Specify feast config and version * Fix feast env variables * PR fixes * Refactor feast ingest constants * Make feature sources optional for back-compatibility * Remove unused GCP files * adding docker publish workflow * Simplify name+namespace in PrimaryKeys * adding docker publish workflow * debug * final attempt * final final attempt * final final final commit * Switch to published ingestion image * Update name and namespace in java files * Rename FeatureSet -> FeatureTable * Regenerate codegen * Fix initial generation errors * Update snapshot jsons * Regenerated schemas * Fix URN formats * Revise builds * Clean up feast URN builders * Fix naming typos * Fix Feature Set -> Feature Table * Fix comments * PR fixes * All you need is Urn * Regenerate snapshots and update validation * Add UNKNOWN data type * URNs for source types * Add note on docker requirement * Fix typo * Reorder aspect unions * Refactor feast ingest functions * Update snapshot jsons * Rebuild Co-authored-by: Shirshanka Das <shirshanka@apache.org> * fix(no-code): Adding Chart input relationship annotations (datahub-project#2669) * chart input relationship pdl fix * commiting schema.avsc changes * Add get_identifier to hive source in metadata ingestion (datahub-project#2667) * feat(k8s): Add imagePullSecrets to all K8's jobs (datahub-project#2671) * fix(analytics): Support sasl authentication to kafka (datahub-project#2675) * feat(ingest): headers for codegen Python scripts (datahub-project#2637) * fix(ingest): pin to new mypy version (datahub-project#2670) * Fix 2592 neo4j connection options (datahub-project#2632) * fix(ingest): upgrade acryl-pyhive to use sasl3 instead of sasl (datahub-project#2684) * feat(ingest): support Oracle service names (datahub-project#2676) * feat(sql_views): added views as datasets for SQLAlchemy DBs (datahub-project#2663) * removing whitespace from service aspect (datahub-project#2688) * Only work with dbt catalog data if load_catalog is False (datahub-project#2686) * feat(docs): Docs for S3 ingestion with AWS Glue (datahub-project#2672) * feat(datahub cli): DataHub CLI Quickstart (datahub-project#2689) * feat(gms): Merge MAE, MCE consumers into GMS (datahub-project#2690) * fix(NoCode): Fix product analytics (datahub-project#2697) * feat(gms): support basic auth header when connecting to elasticSearch (datahub-project#2701) * Add support for different oidc client authentication methods (datahub-project#2691) * feat(aspects): support fetching of versioned aspects (datahub-project#2677) * feat(entities): add markdown description update/viewer feature in dataset, datajob, dataflow, chart and dashboard, update ui/ux (datahub-project#2707) * feat(ingest): Add test case and docs for SQL view ingestion (datahub-project#2709) * fix(ingest): use `looker` data platform (datahub-project#2708) * test(ingest): simplify docker cleanup commands (datahub-project#2699) * feat: Adding annotation option to frontend service (datahub-project#2674) * feat(ingest): expose additional types to Python via codegen (datahub-project#2712) * Removed key not in catalog filter from dbt and added node filter using AllowDenyPattern (datahub-project#2702) * added node_type_pattern in dbt yaml file (datahub-project#2705) * fix(editable descriptions): adding indexing for editable descriptions (datahub-project#2710) * feat(quickstart): use versioned docker images when on a release tag (datahub-project#2715) * fix(noCode): Improving efficiency of EntityService "listLatestAspects" API (datahub-project#2711) * fix(react): update schema description edit behavior (datahub-project#2718) * Update version for helm (datahub-project#2719) * fixing docs sidebar (datahub-project#2721) * fix(docs): update extending-the-metadata-model.md (datahub-project#2727) * fix(docker): update default tags to head (datahub-project#2730) * fix(looker): fix invalid URN syntax error (datahub-project#2737) * fix(ci): increase Feast docker setup timeout (datahub-project#2722) * fix(ingest): types for dbt (datahub-project#2716) * feat(ingest): add support for Glue ETL jobs (datahub-project#2687) * fix(ingest): fix lookml platform URN (datahub-project#2742) * fix(ingest): update lookml test (datahub-project#2741) * fix(gms): fixes for version aspect fetching (datahub-project#2723) * feat(graph): support using elasticsearch as graph backend. (datahub-project#2726) * feat(ingest): print docker logs on timeout (datahub-project#2743) * fix(ci): increase wait-for-it timeout to fix flaky feast test (datahub-project#2747) * feat(docker): reduce quickstart footprint (datahub-project#2744) * feat(quickstart): remove orphaned docker containers on quickstart through cli (datahub-project#2748) * fix(gms): add rest.li validation in gms (datahub-project#2745) * fix(datahub-upgrade): add support for postgres migration (datahub-project#2740) * fix(docker): modernize docker images and fix vulnerabilities (datahub-project#2746) * feat(ingest): ingest last-modified from dbt sources.json (datahub-project#2729) * feat(ingest): add option to specify source platform database in lookml ingestion (datahub-project#2749) * fix(gms): make get return entity type (datahub-project#2739) * fix(elastic-as-graph): adding elasticsearch setup back in (datahub-project#2752) * fix(browse): sort by doc count descending (datahub-project#2751) * Revert "fix(gms): add rest.li validation in gms (datahub-project#2745)" (datahub-project#2754) This reverts commit f0b4adc. * fix(docker): use head tag for datahub-ingestion (datahub-project#2760) * feat(elastic-as-graph): defaulting to elastic in quickstart (datahub-project#2753) * fix(react): reverse result of topological sort, autofocus in add tag modal (datahub-project#2759) * feat: usage stats (part 1) (datahub-project#2750) Co-authored-by: Gabe Lyons <itsgabelyons@gmail.com> * feat: usage stats (part 2) (datahub-project#2762) Co-authored-by: Gabe Lyons <itsgabelyons@gmail.com> * docs: update for Jun townhall (datahub-project#2764) * fix(frontend): auth session ttl specified in hours instead of days. M… (datahub-project#2695) Fixes: datahub-project#2680 Co-authored-by: thomas.larsson <thomas.larsson@klarna.com> * fix(react): move percent sign after number and update meta tag (datahub-project#2765) * fix(docker): Fix dependency vulnerability (datahub-project#2763) * fix(datahub-upgrade): removing the CleanupStep from datahub upgrade (datahub-project#2728) * docs(ingest): move usage stats docs into the "sources" section (datahub-project#2766) * fix(react): update the query frequency text label (datahub-project#2767) * feat(k8s): add GCP deploy recipe (datahub-project#2768) * fix(react): fix graphql apollo cache update issue cause of usagestats (datahub-project#2769) * feat(logs): improve logging in GMS and datahub-frontend (datahub-project#2761) * fix(nocode): removing service PDL (datahub-project#2772) * fix(react): update platform text in dataset profile header (datahub-project#2771) * feat(logs): add thresholding, misc cleanup (datahub-project#2773) * docs: upgrade docusaurus, minor ingestion updates (datahub-project#2774) * feat(ingest): add non-random sampling for mongo (datahub-project#2778) * feat(react-graphql-mock): mock graphql queries and mutations with miragejs * feat(react): configure Cypress + MirageJS + GraphQL mock for functional testing plus a couple of example tests * fix(react): fix mutation helper for updateentitylink * feat(react): update graphql issues with mock data * fix(docker): Upgrade to 4.1.44 netty-all library (datahub-project#2784) * refactor(ingest): use common get_sys_time method (datahub-project#2782) * fix(docs): links to Feast entities (datahub-project#2780) * feat(k8s): Upgrade to v0.8.4 (datahub-project#2792) * fix(frontend): making nested SSL configs optional (datahub-project#2785) * fix(ingest): use correct platform for MongoDB ingestion (datahub-project#2783) * Update quickstart to include system prune (datahub-project#2794) * docs(website): add releases page (datahub-project#2776) * refactor(ingest): remove deprecated methods and warn on deprecated import (datahub-project#2797) * fix(usage-stats): add indices target for usage stats query (datahub-project#2798) * fix(ingest): handle case when view definition handler is not implemented (datahub-project#2796) * fix(ingest): quote table names in hive (datahub-project#2801) * fix(datahub-upgrade): add runtime dependency on logbackClassic (datahub-project#2799) * fix(search): Filter out "removed" entities from autocomplete and analytics (datahub-project#2781) * feat(ingest): Improve lookml sql derived tables detection, add cascading derived tables to lineage (datahub-project#2770) * feat(ingest): SageMaker feature store ingestion (datahub-project#2758) * fix(ingest): better warnings and error handling for rest sink (datahub-project#2800) * fix(usage-stats): Add usage stats factory to mae processor config (datahub-project#2803) * refactor(ingest): extract sqlalchemy uri generation logic (datahub-project#2786) * fix(quickstart): Fixing manual mysql quickstart (datahub-project#2811) * feat(ingest): support ingesting from multiple snowflake dbs (datahub-project#2793) * feat(backup): Add restore indices and restore backup tasks (datahub-project#2779) * docs(website): hide outdated FAQs page (datahub-project#2809) * feat(ingest): refactor mce comparison and add pytest update golden files option (datahub-project#2812) * fix(k8s): upgrade helm version (datahub-project#2810) * feat(ingest): basic support for complex hive types (datahub-project#2804) * fix(datahub-upgrade): fix vulnerabilities (datahub-project#2813) * chore(helm): upgrade version to v0.8.5 (datahub-project#2814) * feat(datahub-frontend): Adding basic file-based authentication to datahub-frontend (datahub-project#2818) * fix(ingest): view handling resilience for redshift (datahub-project#2816) * docs(ingest): add extra info for Redshift behind a proxy (datahub-project#2817) * fix(docs): update OIDC docs (datahub-project#2823) * docs(docker): fix check command reference (datahub-project#2822) Fixes datahub-project#2820. * docs(elastic-for-graph): Add migrating from neo4j to elastic instructions (datahub-project#2826) * fix(ingest): convert superset timestamps to micros (datahub-project#2827) * Previous values were in seconds, which rendered incorrectly in the UI. * feat(ci): separate metadata-ingestion into a separate workflow (datahub-project#2828) * fix(cli): change docker nuke to also remove stopped containers (datahub-project#2825) * fix(mae-consumer): support standalone mae consumer without neo4j (datahub-project#2824) * docs: update H2 2021 roadmap (datahub-project#2829) * docs(ingest): update links to Kafka docs (datahub-project#2834) * fix(ingest): mask password in info-level logs (datahub-project#2835) * fix(ingest): various BigQuery source fixes (datahub-project#2836) * docs(ingest): clarify that the Kafka options are pass-through (datahub-project#2837) * fix(ingest): do not fail dbt ingestion when encountering missing nodes (datahub-project#2833) * feat (helm datahub-gms): add ingress template to datahub-gms helm chart (datahub-project#2832) Co-authored-by: christopher.coulson <christopher.coulson@higpagesgroup.com.au> * feat(react): fix apollo client cache issues in entities profile pages (datahub-project#2841) * fix(ingest): avoid setting timestamps unless source system provides it (datahub-project#2843) * fix(docs): fix some broken links in the docs (datahub-project#2846) * fix(ingest): Fix glob pattern and handle possible recursion in lookml (datahub-project#2851) * feat(ingest): prettify stack traces in CLI (datahub-project#2845) * fix(analytics): Fix SSL issue with analytics on frontend (datahub-project#2840) * fix(ingest): check for dbt materialization before proceeding (datahub-project#2842) * feat(docs): throttle and retry requests in doc generation (datahub-project#2847) * feat(ingest): SageMaker jobs and models (datahub-project#2830) * docs(ingest): remove hanging sentence from docs (datahub-project#2853) * fix(ingest): delete pycache files when running clean (datahub-project#2852) * fix(ingest): handle 'fields' list missing in bigquery-usage (datahub-project#2844) * feat(k8s): Extract helm charts into a separate repo (datahub-project#2848) * fix(react): fix searchbar width issue in small screen (datahub-project#2856) * feat(react): implement visualizing historical versions of dataset schema (datahub-project#2855) * feat(docs): carousels for videos and articles (datahub-project#2857) * fix(frontend): handling null aspects in AspectType (datahub-project#2860) * fix: fixing lint issue (datahub-project#2861) * feat(ingest): support dynamic imports for transfomer methods (datahub-project#2858) * feat(docs): swap Medium and videos sections (datahub-project#2859) * feat(ingest): add browse paths + dataplatform for Feast features (datahub-project#2849) * fix(build): increase retries for dependency fetches (datahub-project#2867) * build(ingest): separate metadata-ingestion build workflow fully (datahub-project#2862) * fix(quickstart): update compose spec version (datahub-project#2873) Also update the defaults to point at the head tag instead of latest. * docs(quickstart): add default password to quickstart (datahub-project#2875) * feat(ingest): extract dbt meta fields (datahub-project#2876) * feat(ingest): update golden files only when diff fails (datahub-project#2869) * build(ingestion): add version prompt to release script (datahub-project#2866) * fix(react): fix bug in description update modal (datahub-project#2874) * fix(react): update dataset graphql mock Co-authored-by: Thomas Larsson <thomas.p.larsson@gmail.com> Co-authored-by: thomas.larsson <thomas.larsson@klarna.com> Co-authored-by: Rickard Cardell <rickard.cardell@klarna.com> Co-authored-by: Dexter Lee <dexter@acryl.io> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> Co-authored-by: Gabe Lyons <itsgabelyons@gmail.com> Co-authored-by: shubham garg <gargshubham49@gmail.com> Co-authored-by: shubham.garg <shubham.garg@thoughtworks.com> Co-authored-by: Shirshanka Das <shirshanka@apache.org> Co-authored-by: John Joyce <john@acryl.io> Co-authored-by: Kevin Hu <6051736+kevinhu@users.noreply.github.com> Co-authored-by: zack3241 <zmjlawson@gmail.com> Co-authored-by: Vincenzo Lavorini <vincenzo.lavorini@protonmail.ch> Co-authored-by: Remi <salmon.remi@gmail.com> Co-authored-by: JeffSkinner <Jeff@RigaBee.com> Co-authored-by: Peter Mortier <kwark@users.noreply.github.com> Co-authored-by: kuntalkumarbasu <42655948+kuntalkumarbasu@users.noreply.github.com> Co-authored-by: vijayan-nallasami-curve <85870918+vijayan-nallasami-curve@users.noreply.github.com> Co-authored-by: wan <wan@noreply.gitlab.com> Co-authored-by: datascienceChris <coulson.christopher@gmail.com> Co-authored-by: christopher.coulson <christopher.coulson@higpagesgroup.com.au> Co-authored-by: Fredrik Sannholm <fredrik.sannholm@wolt.com>

hsheth2 approved these changes Jun 8, 2021

View reviewed changes

zack3241 and others added 2 commits June 8, 2021 16:47

fix(ingest): fix MyPy stubs (datahub-project#2666)

38af7ca

zack3241 force-pushed the feature/hive-database-identifier branch from 1b252db to 38af7ca Compare June 8, 2021 23:47

shirshanka approved these changes Jun 9, 2021

View reviewed changes

shirshanka merged commit 91eb3cc into datahub-project:master Jun 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add get_identifier to hive source in metadata ingestion #2667

Add get_identifier to hive source in metadata ingestion #2667

zack3241 commented Jun 8, 2021 •

edited

zack3241 commented Jun 8, 2021

hsheth2 left a comment

jjoyce0510 commented Jun 9, 2021

shirshanka left a comment

Add get_identifier to hive source in metadata ingestion #2667

Add get_identifier to hive source in metadata ingestion #2667

Conversation

zack3241 commented Jun 8, 2021 • edited

Checklist

zack3241 commented Jun 8, 2021

hsheth2 left a comment

Choose a reason for hiding this comment

jjoyce0510 commented Jun 9, 2021

shirshanka left a comment

Choose a reason for hiding this comment

zack3241 commented Jun 8, 2021 •

edited