Skip to content

feat(clickhouse): support cross-database and dictionary lineage (#26095)#27551

Open
mohitjeswani01 wants to merge 3 commits intoopen-metadata:mainfrom
mohitjeswani01:feat/26095-clickhouse-lineage
Open

feat(clickhouse): support cross-database and dictionary lineage (#26095)#27551
mohitjeswani01 wants to merge 3 commits intoopen-metadata:mainfrom
mohitjeswani01:feat/26095-clickhouse-lineage

Conversation

@mohitjeswani01
Copy link
Copy Markdown

Description:

Fixes #26095

What changes did you make?

  1. Dictionary Lineage: * Patched metadata.py and utils.py to ingest ClickHouse Dictionary engines as TableType.External.
    • Added CLICKHOUSE_DICTIONARY_LINEAGE to query system.dictionaries and a robust regex parser (_parse_clickhouse_dict_source) to extract the upstream database and table/view from the SOURCE() clause.
    • Yields Source.ViewLineage edges.
  2. Cross-Database Lineage: * Implemented yield_cross_database_lineage() in ClickhouseLineageSource following the established Trino pattern to resolve FQNs from crossDatabaseServiceNames.

Why did you make them?
To resolve the missing cross-database lineage blocker (#26095) and to fulfill the explicit hackathon request from @agusosimani to support upstream lineage for ClickHouse dictionaries (e.g., correctly mapping geo_location_dict to its source view geo_locations).

How did you test your changes?

  • Wrote 37 parameterized unit tests in test_clickhouse_lineage.py achieving 100% pass rate.
  • Verified the regex aggressively strips single/double quotes to prevent lookup failures, gracefully ignores non-ClickHouse sources (Postgres, MySQL, etc.), and handles malformed strings.
  • Attached screenshots of the local test suite execution below.

Screenshots of passing test suite:
image
image

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.

  • My PR title is Fixes <issue-number>: <short explanation>

  • I have commented on my code, particularly in hard-to-understand areas.

  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

  • I have added tests around the new logic.

  • The issue properly describes why the new feature is needed, what's the goal, and how we are building it. Any discussion
    or decision-making process is reflected in the issue.

  • I have added tests around the new logic.

Copilot AI review requested due to automatic review settings April 20, 2026 14:46
@mohitjeswani01 mohitjeswani01 requested a review from a team as a code owner April 20, 2026 14:46
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds missing ClickHouse lineage capabilities to address #26095 by introducing cross-database lineage (Trino-style matching) and dictionary-based lineage derived from system.dictionaries.

Changes:

  • Added ClickHouse dictionary discovery (ingested as TableType.External) and a new system.dictionaries query for lineage extraction.
  • Implemented ClickHouse cross-database lineage resolution using crossDatabaseServiceNames.
  • Added unit tests for dictionary source-string parsing.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py Implements cross-database lineage and dictionary lineage extraction, plus parsing helper.
ingestion/src/metadata/ingestion/source/database/clickhouse/metadata.py Registers dictionary engine objects as TableType.External during metadata ingestion.
ingestion/src/metadata/ingestion/source/database/clickhouse/utils.py Adds SQLAlchemy inspector/dialect helpers to list dictionary names.
ingestion/src/metadata/ingestion/source/database/clickhouse/queries.py Refactors query strings formatting and adds CLICKHOUSE_DICTIONARY_LINEAGE.
ingestion/src/metadata/ingestion/source/database/clickhouse/usage.py Minor formatting-only change.
ingestion/tests/unit/topology/database/test_clickhouse_lineage.py Adds unit tests for _parse_clickhouse_dict_source.

Comment thread ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/clickhouse/utils.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 20, 2026

Code Review ✅ Approved 2 resolved / 2 findings

Adds cross-database and dictionary lineage support for Clickhouse. Resolves issues regarding dictionary table duplication and generator handling in lineage collection.

✅ 2 resolved
Bug: Dictionary tables may be duplicated in regular_tables list

📄 ingestion/src/metadata/ingestion/source/database/clickhouse/metadata.py:142 📄 ingestion/src/metadata/ingestion/source/database/clickhouse/metadata.py:158-161 📄 ingestion/src/metadata/ingestion/source/database/clickhouse/metadata.py:163
In metadata.py, regular_tables is built from self.inspector.get_table_names(schema_name) which typically returns all non-view tables from system.tables in clickhouse-sqlalchemy — including those with engine = 'Dictionary'. The new dictionary_tables list queries system.tables WHERE engine = 'Dictionary' separately, so dictionaries will likely appear in both lists, causing duplicate ingestion attempts.

This could lead to the same entity being processed twice per ingestion run, potentially causing conflicts or wasted API calls.

Edge Case: yield_dictionary_lineage or [] is a no-op on generators

📄 ingestion/src/metadata/ingestion/source/database/clickhouse/lineage.py:369
On line 369, yield from self.yield_dictionary_lineage() or [] — since yield_dictionary_lineage is a generator function, calling it always returns a generator object (which is truthy). The or [] branch can never execute. This is harmless but misleading to future readers who might think it guards against None.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@mohitjeswani01
Copy link
Copy Markdown
Author

mohitjeswani01 commented Apr 20, 2026

@harshach all bot comments addressed could you please add a safe to test label ? thank you !😊🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Cross Database Lineage for Clickhouse

2 participants