Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Column Level Lineage Support #2931

Closed
harshach opened this issue Feb 22, 2022 · 0 comments
Closed

Add Column Level Lineage Support #2931

harshach opened this issue Feb 22, 2022 · 0 comments
Assignees
Labels

Comments

@harshach
Copy link
Collaborator

harshach commented Feb 22, 2022

Is your feature request related to a problem? Please describe.
In 0.8 we added lineage APIs and currently support lineage extraction from several sources

  1. Airflow
  2. BigQuery
  3. Redshift
  4. Snowflake
  5. MSSQL
  6. Vertica
  7. View Level Lineage Capture
  8. Metabase
  9. Superset
  10. Tableau
    This list is continuing to grow with each release.
    We should also allow capturing lineage at the column level through the APIs and also through the above sources, enabling the UI lineage editor to build lineage at the column level as well

Task breakdown

API Support

Currently, OpenMetadata supports the following lineage:

  • Table to Table
  • Tables to a pipeline and then to the output table
  • Nodes in the lineage graph can also include other data assets, such as dashboard and reports, etc.
    Blank diagram - Page 13

This is the proposal to add column-level lineage to both the backend and supporting it in Lineage APIs. The lineage edge between two tables are enhanced to add column level lineage information as follows:

Blank diagram - Page 13 (1)

The column lineage details are stored as additional property of the lineage edge between two tables using the following new type, lineageDetails with:

  • columnLineage that has a set of source columns, function used for transforming them into destination column
  • SQLQuery that consumes a set of source tables to generate the destination table
  • pipeline that ran the SQL query to generate the destination table

The lineageDetails schema is shown below:

    "columnLineage": {
      "type" : "object",
      "properties": {
        "fromColumns" : {
          "description": "One or more source columns identified by fully qualified column name used by transformation function to create destination column.",
          "type" : "array",
          "items" : {
            "$ref" : "../type/basic.json#/definitions/fullyQualifiedEntityName"
          }
        },
        "toColumn" : {
          "description": "Destination column identified by fully qualified column name created by the transformation of source columns.",
          "$ref" : "../type/basic.json#/definitions/fullyQualifiedEntityName"
        },
        "function" : {
          "description": "Transformation function applied to source columns to create destination column. That is `function(fromColumns) -> toColumn`.",
          "$ref" : "../type/basic.json#/definitions/sqlFunction"
        }
      }
    },
    "lineageDetails" : {
      "description" : "Lineage details including sqlQuery + pipeline + columnLineage.",
      "type" : "object",
      "properties": {
        "sqlQuery" : {
          "description": "SQL used for transformation.",
          "$ref" : "../type/basic.json#/definitions/sqlQuery"
        },
        "columnsLineage" : {
          "description" : "Lineage information of how upstream columns were combined to get downstream column.",
          "type" : "array",
          "items" : {
            "$ref" : "#/definitions/columnLineage"
          }
        },
        "pipeline" : {
          "description": "Pipeline where the sqlQuery is periodically run.",
          "$ref" : "../type/entityReference.json"
        }
      },
      "required": ["sqlQuery", "columnsLineage"]
    },

All the existing APIs will remain as it is except with the addition of lineageDetails to the edge.

Ingestion Support

UI Support

@harshach harshach added the epic label Feb 22, 2022
@harshach harshach added this to To do in Release 0.10.0 via automation Feb 22, 2022
@harshach harshach removed this from To do in Release 0.10.0 Mar 12, 2022
@harshach harshach added this to To do in Release 0.11 via automation Mar 12, 2022
@harshach harshach closed this as completed Jul 4, 2022
Release 0.11 automation moved this from To do to Done Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

2 participants