Skip to content

Latest commit

 

History

History
161 lines (97 loc) · 11.3 KB

File metadata and controls

161 lines (97 loc) · 11.3 KB

Supported Data Sources and Limitations for the Azure Databricks to Purview Solution Accelerator Connector

This Databricks to Purview Solution Accelerator supports extracting information from Spark's logical plans, emits them in the standardized OpenLineage format, and translates that standard format to Apache Atlas / Microsoft Purview types.

The solution accelerator supports a limited set of data sources to be ingested into Microsoft Purview and can be extended further.

Connecting to Assets in Purview

For supported databases listed above, the Databricks to Purview Solution Accelerator will connect to a scanned asset present in Microsoft Purview.

For supported filestores listed above, the Databricks to Purview Solution Accelerator will connect to folders or resource sets.

  • If you are reading a specific file in a folder (e.g. container/path/to/some.csv), the asset this solution connects to will be the folder and not the specific file (e.g. container/path/to/).
  • If the folder contains a resource set, this solution will link to the resource set asset instead of the folder.

Azure Blob File System (ABFS)

Supports Azure Data Lake Gen 2 through the ABFS connection built-into Apache Spark.

Azure Databricks Mount Points

Supports mapping Databricks Mount Points to the underlying storage location.

  • Only mount points with ABFS(S) or WASB(S) storage locations can be mapped to their appropriate Purview type.
  • Databricks Credential Passthrough is not currently supported.

Azure Storage Blobs (WASB)

Supports Azure Blob Storage through the WASB connection built-into Apache Spark.

  • Only WASB paths with blob.core.windows.net in the host name will generate the correct Azure Blob Storage Purview type lineage.

Azure Synapse SQL Pools

Supports querying Azure Synapse SQL Pools with the Databricks Synapse Connector

Does not support:

  • preAction / postAction (these sql statements are running in Synapse, NOT as a Spark job)
  • query as data source: Lineage will show the input table as dbo.COMPLEX.
  • Spark jobs that use Synapse tables in the same synapse workspace for input and output (all operations are done in Synapse in this scenario and no lineage is emitted).

Limited support for Azure Synapse as an output:

  • Due to the implementation by Databricks, it will report output lineage to the staging folder path used to temporarily store the data before a Polybase / Copy command is executed inside of your Synapse SQL Pool.

Azure SQL DB

Supports Azure SQL DB through the Apache Spark Connector for Azure SQL DB.

  • If you specify the dbTable value without the database schema (e.g. dbo), the connector assumes you are using a default dbo schema.
    • For users and Service Principals with different default schemas, this may result in incorrect lineage.
    • This can be corrected by specifying the database schema in the Spark job.
  • Does not support emitting lineage for cross-database table sources.
  • Default configuration supports using multiple strings divided by dots to define a custom schema. For example myschema.mytable. This will not function correctly if table names could contain dot characters in your organization. In this case, you can delete the "azureSQLNonDboNoDotsInNames" section from the "OlToPurviewMappings" function configuration setting. Note that you would need to use bracket syntax to denote a custom schema. For example [myschema].[my.table].

Delta Lake File Format

Supports Delta File Format.

  • Does NOT support MERGE INTO statement on Databricks due to differences in Databricsk and Open Source classes.
    • An earlier release mistakenly indicated support
  • Does not support Delta on Spark 2 Databricks Runtimes.
  • Commands such as Vacuum or Optimize do not emit any lineage information and will not result in a Purview asset.

Azure MySQL

Supports Azure MySQL through JDBC.

PostgreSQL

Supports both Azure PostgreSQL and on-prem/VM installations of PostgreSQL through JDBC.

  • If you specify the dbTable value without the database schema (e.g. dbo), the connector assumes you are using the default public schema.
    • For users and Service Principals with different default schemas, this may result in incorrect lineage.
    • This can be corrected by specifying the database schema in the Spark job.
  • Default configuration supports using multiple strings divided by dots to define a custom schema. For example myschema.mytable.
  • If you register and scan your postgres server as localhost in Microsoft Purview, but use the IP within the Databricks notebook, the assets will not be matched correctly. You need to use the IP when registering the Postgres server.

Azure Data Explorer

Supports Azure Data Explorer (aka Kusto) through the Azure Data Explorer Connector for Apache Spark.

  • Only supports the kustoTable option.
  • If you use the kustoQuery option, it will return a Purview Generic Connector entity with a name of COMPLEX to capture the lineage but we are not able to parse arbitrary kusto queries at this time.

Azure Data Factory

Supports capturing lineage for Databricks Notebook activities in Azure Data Factory (ADF). After running a notebook through ADF on an interactive or job cluster, you will see a Databricks Job asset in Microsoft Purview with a name similar to ADF_<factory name>_<pipeline name>. For each Databricks notebook activity, you will also see a Databricks Task with a name similar to ADF_<factory name>_<pipeline name>_<activity name>.

  • At this time, the Microsoft Purview view of Azure Data Factory lineage will not contain these tasks unless the Databricks Task uses or feeds a data source to a Data Flow or Copy activity.
  • Copy Activities may not show lineage connecting to these Databricks tasks since it emits individual file assets rather than folder or resource set assets.

Other Data Sources and Limitations

Lineage for Unsupported Data Sources

The OpenLineage project supports emitting lineage for other data sources, such as HDFS, S3, GCP, BigQuery, Apache Iceberg and more. However, this connector does not provide translation of these other data sources not mentioned in the list above.

Instead, any unknown data type will land in Microsoft Purview as a "dummy" type.

We welcome contributions to help map those types to official Purview types as they become available. Alternatively, in your implementation, you may choose to extend this solution to map to your own custom types.

Case Sensitivity

Microsoft Purview's Fully Qualified Names are case sensitive. Spark Jobs may have data sources connections that are not in the proper casing as on the data source (e.g. dbo.InputTable might be the physical table's name in the SQL db but a Spark query may reference the table as dbo.iNpUtTaBlE).

As a result, this solution attempts to find the best matching existing asset. If no existing asset is found to match based on qualified name, the data source name as found in the Spark query will be used toe create a dummy asset. On a subsequent scan of the data source in Purview and another run of the Spark query with the connector enabled will resolve the linkage.

Hive Metastore / Delta Table Names

The solution currently does not support emitting the Hive Metastore / Delta table SQL names. For example, if you have a Delta table name default.events and it's physical location is abfss://container@storage/path, the solution will report abfss://container@storage/path.

OpenLineage is considering adding this feature with OpenLineage#435.

Spark Streaming

The solution does not currently support Spark Streaming. OpenLineage will emit events on Spark Streaming events, however, OpenLineage does not currently support retrieving the input and output data sources.

If you are using forEachBatch and use a supported output data source, it is possible to receive lineage. However, you will receive a lineage event for each execution of the forEachBatch. This may lead to increased Purview API calls per second and result in increased capacity unit charges.

At this time, we encourage clusters running Spark Structured Streaming jobs to not have the OpenLineage jar installed.

Spark 2 Support

The solution supports Spark 2 job cluster jobs. Databricks has removed Spark 2 from it's Long Term Support program.

Spark 3.3+ Support

The solution supports Spark 3.0, 3.1, 3.2, and 3.3 interactive and job clusters. The solution has been tested on the Databricks Runtime 11.3LTS version.

Private Endpoints on Microsoft Purview

Currently, the solution does not support pushing lineage to a Private Endpoint backed Microsoft Purview service. The solution may be customized to deploy the Azure Function to connect to Microsoft Purview. Consider reviewing the documentation to Connect privately and securely to your Microsoft Purview account.

Column Level Mapping Supported Sources

Starting with OpenLineage 0.18.0 and release 2.3.0 of the solution accelerator, we support emitting column level mapping from the following sources and their combinations:

  • Read / Write to ABFSS file paths (mount or explicit path abfss://)
  • Read / Write to WASBS file paths (mount or explicit path wasbs://)
  • Read / Write to the default metastore in Azure Databricks
    • Does NOT support custom hive metastores

Column Mapping Support for Delta Format

  • Delta Merge statements are not supported at this time
  • Delta to Delta is NOT supported at this time