Skip to content

fix: resolve path-based lineage for Databricks external tables (#27561)#27648

Open
ShivamChavan01 wants to merge 2 commits intoopen-metadata:mainfrom
ShivamChavan01:fix/databricks-external-table-path-lineage-27561
Open

fix: resolve path-based lineage for Databricks external tables (#27561)#27648
ShivamChavan01 wants to merge 2 commits intoopen-metadata:mainfrom
ShivamChavan01:fix/databricks-external-table-path-lineage-27561

Conversation

@ShivamChavan01
Copy link
Copy Markdown

Describe your changes:

Fixes #27561

External tables in Databricks are referenced using cloud storage paths (e.g. delta.\abfss://...`) instead of table names. In this case, Databricks system tables populate source_path/target_pathand leavesource_table_full_name/target_table_full_name` as null. The lineage processor was filtering out these rows entirely, resulting in missing lineage for all external tables.

Changes:

  • databricks/queries.py + unitycatalog/queries.py: Added source_path and target_path to SELECT; relaxed WHERE filter from hard IS NOT NULL on name columns to (name IS NOT NULL OR path IS NOT NULL)
  • databricks/client.py: Pass source_path and target_path through the lineage cache dict
  • unitycatalog/lineage.py: Build a reverse path → table_fqn map from the external locations cache; fall back to path resolution when full_name is null; ensure _cache_external_locations() runs before _cache_lineage() so the reverse map is available
  • test_unity_catalog_lineage.py: Updated mock row definitions to include path fields; added tests for path resolution, unresolvable path skipping, and reverse map construction

Type of change:

  • Bug fix

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes #27561: resolve path-based lineage for Databricks external tables
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added a test that covers the exact scenario we are fixing.

@ShivamChavan01 ShivamChavan01 requested a review from a team as a code owner April 23, 2026 02:40
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/source/database/databricks/queries.py Outdated
…-FQN resolution

Reverts the path-based fallback in DATABRICKS_GET_TABLE_LINEAGE and
DATABRICKS_GET_COLUMN_LINEAGE queries since DatabricksClient lacks
the external_path_to_fqn map needed to resolve paths to FQNs.

Without this map, relaxing the IS NOT NULL constraints creates dict keys
containing None values that never match downstream lookups.
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 23, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Resolves path-based lineage for Databricks external tables by enabling path fallback during column lineage caching. No issues found.

✅ 1 resolved
Bug: DatabricksClient column lineage caching ignores path fallback

📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:370-379 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:107-121 📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:348-355 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:90-104
The DATABRICKS_GET_COLUMN_LINEAGE query was relaxed to allow rows where source_table_full_name or target_table_full_name is NULL (as long as the corresponding path is not null). However, the cache_lineage() method in client.py (lines 370-379) still directly uses row.source_table_full_name and row.target_table_full_name without any path-based fallback. This means:

  1. Column lineage rows for external tables will create dict keys containing None (e.g., (None, 'cat.schema.target')), which won't match any downstream lookup.
  2. These phantom entries silently pollute entity_column_lineage and will never produce useful lineage.

The same path-resolution logic added to unitycatalog/lineage.py should be applied here, or the column lineage query's WHERE clause should retain the IS NOT NULL filter on table name columns (as done before this PR) since there's no external_path_to_fqn map available in DatabricksClient.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Lineage Databricks is not performed for external tables using path-based queries.

1 participant