Skip to content

Fixes #26200: Fix BigQuery string bindings on uniqueCount CTE for binary columns#27256

Open
aniruddhaadak80 wants to merge 12 commits intoopen-metadata:mainfrom
aniruddhaadak80:fix-bigquery-unique-count-type
Open

Fixes #26200: Fix BigQuery string bindings on uniqueCount CTE for binary columns#27256
aniruddhaadak80 wants to merge 12 commits intoopen-metadata:mainfrom
aniruddhaadak80:fix-bigquery-unique-count-type

Conversation

@aniruddhaadak80
Copy link
Copy Markdown

@aniruddhaadak80 aniruddhaadak80 commented Apr 10, 2026

What it does

Fixes the BigQuery profiler pipeline that crashes on BYTES / BINARY columns during uniqueCount calculation due to No matching signature for operator = for argument types: INT64, STRING.

How it does it

SQLAlchemy BigQuery metric runner passes the original metric type (like STRING) into the COUNTIF(col == 1) check. However, in the sqa_profiler_interface.py execution, BigQuery executes the metric label query via a wrapping CTE where data acts as an INT64 COUNT output. SQLAlchemy then attempts to compare the INT64 count returned by the subquery against a bound STRING '1'. Using a standard, un-typed generic column(col.name) instead skips the aggressive data type injection and solves the BigQuery mismatch error.

Fixes #26200

Copilot AI review requested due to automatic review settings April 10, 2026 17:15
@aniruddhaadak80 aniruddhaadak80 requested a review from a team as a code owner April 10, 2026 17:15
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to fix the BigQuery profiler’s uniqueCount calculation on BYTES/BINARY columns by avoiding an incorrect STRING-typed bind in the COUNTIF(... = 1) comparison when the metric is executed via the “Label + wrapping subquery” path.

Changes:

  • Adjust BigQuery UniqueCount SQLAlchemy expression to use an untyped column reference for the COUNTIF(col == 1) comparison.
  • Update CollectionDAO (UserDAO) list queries to pass filter.getQueryParams() into the underlying JDBI queries.
  • Add tag-pruning logic in TableRepository.addDataModel(...) for table and column tags.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
ingestion/src/metadata/profiler/metrics/static/unique_count.py Uses an untyped column(self.col.name) for BigQuery COUNTIF to prevent STRING binding when comparing against 1.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java Adds filter.getQueryParams() binding to UserDAO list methods and propagates it into the DAO query signatures.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TableRepository.java Prunes AUTOMATED tags not present in incoming DataModel tags for tables and columns before persisting/applying tags.

Comment on lines +1410 to +1415
if (table.getTags() != null) {
java.util.List<String> incomingTags = dataModel.getTags() != null
? dataModel.getTags().stream().map(org.openmetadata.schema.type.TagLabel::getTagFQN).collect(java.util.stream.Collectors.toList())
: java.util.Collections.emptyList();
mergedTableTags.removeIf(t -> t.getLabelType() == org.openmetadata.schema.type.TagLabel.LabelType.AUTOMATED && !incomingTags.contains(t.getTagFQN()));
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current logic will remove all existing AUTOMATED table tags whenever table.getTags() is non-null and dataModel.getTags() is null/omitted, because incomingTags becomes empty and the removeIf predicate matches all automated tags. If the caller omits tags (vs explicitly sending an empty list), this is an unintended behavior change. Consider only pruning automated tags when dataModel.getTags() is explicitly provided (non-null), or distinguish between null and empty to preserve existing tags when tags aren't part of the update payload.

Copilot uses AI. Check for mistakes.
Comment on lines +1433 to +1438
if (stored.getTags() != null) {
java.util.List<String> incomingColTags = modelColumn.getTags() != null
? modelColumn.getTags().stream().map(org.openmetadata.schema.type.TagLabel::getTagFQN).collect(java.util.stream.Collectors.toList())
: java.util.Collections.emptyList();
mergedColumnTags.removeIf(t -> t.getLabelType() == org.openmetadata.schema.type.TagLabel.LabelType.AUTOMATED && !incomingColTags.contains(t.getTagFQN()));
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to table tags: if modelColumn.getTags() is null/omitted but stored.getTags() is non-null, incomingColTags becomes empty and this removes all existing AUTOMATED column tags. If tags are not being updated for the column, this likely wipes automated tags unintentionally. Consider gating the prune on modelColumn.getTags() != null (or otherwise distinguishing null vs empty).

Copilot uses AI. Check for mistakes.
Comment on lines 6359 to 6363
return EntityDAO.super.listCount(filter);
}
return listCount(
getTableName(), mySqlCondition, postgresCondition, team, Relationship.HAS.ordinal());
getTableName(), mySqlCondition, postgresCondition, team, Relationship.HAS.ordinal(), filter.getQueryParams());
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter.getQueryParams() likely contains a team key (and potentially other keys) that overlaps with separately bound parameters (e.g., @BindFQN("team") String team). Passing it via @BindMap risks overriding the intended binding for :team in the query, which could break filtering (te.nameHash = :team). Consider passing a copy of queryParams with overlapping keys removed (e.g., remove "team" and any other explicitly-bound names) before binding the map.

Copilot uses AI. Check for mistakes.
Comment on lines +1407 to +1412

List<TagLabel> mergedTableTags =
mergeTagsWithIncomingPrecedence(table.getTags(), dataModel.getTags());
if (table.getTags() != null) {
java.util.List<String> incomingTags = dataModel.getTags() != null
? dataModel.getTags().stream().map(org.openmetadata.schema.type.TagLabel::getTagFQN).collect(java.util.stream.Collectors.toList())
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description focuses on fixing BigQuery uniqueCount binding, but this file also introduces tag pruning behavior changes for tables/columns. If these changes are intentional, the PR description should cover them; otherwise consider splitting into a separate PR to keep scope and review risk contained.

Copilot uses AI. Check for mistakes.
Comment on lines 70 to +74
if session.get_bind().dialect.name == Dialects.BigQuery:
return func.countif(col == 1).label(self.name())
# We are querying against the subquery output (which is a COUNT), so the type is numeric.
# Use an untyped column to avoid passing the original metric type (like STRING or BYTES) into the COUNTIF comparison.
count_col = column(self.col.name)
return func.countif(count_col == 1).label(self.name())
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a regression test for the BigQuery path to ensure the generated SQL compares the COUNT subquery output as a numeric (e.g., no :STRING binding for the literal 1 when the original column type is STRING/BYTES). There are existing unit tests for UniqueCount, but they don’t appear to cover the BigQuery Label hotfix flow in SQAProfilerInterface.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment on lines +6436 to +6445
beforeId,
Relationship.HAS.ordinal());
Relationship.HAS.ordinal(),
filter.getQueryParams());
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same binding-collision issue as listCount: listBefore now passes filter.getQueryParams() into @BindMap params while also binding :team via @BindFQN("team"). If queryParams contains "team", it can override the hashed :team value expected by te.nameHash = :team, causing the filter to stop matching. Remove colliding keys from the map (e.g., "team") before binding, or bind the extra params with a prefix.

Copilot uses AI. Check for mistakes.
Comment on lines 6527 to 6529
filter.getQueryParams());
}

Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same binding-collision issue as listCount/listBefore: listAfter passes filter.getQueryParams() into @BindMap params while also binding :team via @BindFQN("team"). If queryParams contains "team", it can override the hashed value used by te.nameHash = :team. Consider removing colliding keys from the map or binding the extra params with a prefix.

Suggested change
filter.getQueryParams());
}
getListAfterQueryParams(filter));
}
private Map<String, String> getListAfterQueryParams(ListFilter filter) {
Map<String, String> queryParams = new HashMap<>(filter.getQueryParams());
queryParams.remove("team");
return queryParams;
}

Copilot uses AI. Check for mistakes.
@aniruddhaadak80
Copy link
Copy Markdown
Author

Hello! I am participating in the WeMakeDevs hackathon. Could a maintainer please assign the safe to test label so the GitHub Actions workflows can validate my fixes? Thank you!

@aniruddhaadak80
Copy link
Copy Markdown
Author

aniruddhaadak80 commented Apr 13, 2026

Could someone help trigger the CI by adding the safe to test label here? Much appreciated.

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings April 13, 2026 14:08
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment on lines +1411 to +1414
java.util.List<String> incomingTags = dataModel.getTags() != null
? dataModel.getTags().stream().map(org.openmetadata.schema.type.TagLabel::getTagFQN).collect(java.util.stream.Collectors.toList())
: java.util.Collections.emptyList();
mergedTableTags.removeIf(t -> t.getLabelType() == org.openmetadata.schema.type.TagLabel.LabelType.AUTOMATED && !incomingTags.contains(t.getTagFQN()));
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These newly added lines are not formatted to the repository's standard (Spotless/google-java-format) and rely on fully-qualified names inside the method body, making the code harder to read/maintain. Please apply the standard formatter and use existing imports (e.g., TagLabel::getTagFQN, Collectors/toList) to keep the style consistent and avoid CI formatting failures.

Copilot uses AI. Check for mistakes.
Comment on lines +6521 to +6527
afterId,
Relationship.HAS.ordinal());
Relationship.HAS.ordinal(),
filter.getQueryParams());
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same binding-collision risk as listCount/listBefore: passing filter.getQueryParams() via @BindMap can re-bind :team and override the @BindFQN("team") hashed value (or cause duplicate binding). Please pass a cleaned params map with conflicting keys removed (at minimum team).

Copilot uses AI. Check for mistakes.
@aniruddhaadak80
Copy link
Copy Markdown
Author

Absolutely! I just refactored TableRepository.java over on this branch to extract the stale-tag removal logic into the simple removeStaleAutomatedTags helper method, replaced the fully qualified class names with standard top-of-file imports, and improved efficiency by leveraging Collectors.toSet() over List.contains. Thanks for pointing that out!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@harshach
Copy link
Copy Markdown
Collaborator

@aniruddhaadak80 this shouldn't be touching TableRepository.java, Not sure I follow the changes you are doing here

Copilot AI review requested due to automatic review settings April 13, 2026 17:12
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 2 changed files in this pull request and generated 1 comment.

@@ -68,7 +68,10 @@ def query(self, sample: Optional[type], session: Optional[Session] = None):

# TODO: Move all connectors from subquery to COUNT(IF) or COUNTIF for peformance
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in TODO comment: “peformance” → “performance”.

Suggested change
# TODO: Move all connectors from subquery to COUNT(IF) or COUNTIF for peformance
# TODO: Move all connectors from subquery to COUNT(IF) or COUNTIF for performance

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings April 13, 2026 17:24
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment on lines +9 to +17
def test_bigquery_unique_count():
# Mocking session binding
session_mock = Mock()
session_mock.get_bind().dialect.name = Dialects.BigQuery

unique_count_metric = UniqueCount(Column("test_col"))
result = unique_count_metric.fn(session_mock)

assert "countif" in str(result).lower()
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test only asserts that COUNTIF appears in the rendered SQL, but it doesn’t verify the regression being fixed (i.e., that the comparison is numeric and not bound/typed as a string) and it doesn’t cover the problematic BYTES/BINARY column scenario described in the PR. Consider constructing the metric with a binary column type (e.g., LargeBinary/BINARY) and asserting against the compiled expression (BigQuery dialect) that the = 1 comparison is treated as numeric (e.g., literal 1 or an integer-typed bindparam), so this test fails under the previous buggy behavior.

Copilot uses AI. Check for mistakes.
Comment thread ingestion/tests/unit/profiler/metrics/test_unique_count.py Outdated
Comment thread ingestion/tests/unit/profiler/metrics/test_unique_count.py
…perly validate untyped column typing for BigQuery
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 13, 2026

Code Review ✅ Approved 3 resolved / 3 findings

Fixes BigQuery string bindings on uniqueCount CTE for binary columns by using fully qualified class names in TableRepository and correcting test method calls. All findings have been addressed.

✅ 3 resolved
Quality: Fully qualified class names instead of imports in TableRepository

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TableRepository.java:1411-1414 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TableRepository.java:1434-1437 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TableRepository.java:1410-1415 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TableRepository.java:1433-1438
The new code at lines 1411-1414 and 1434-1437 uses fully qualified class names (java.util.List, java.util.stream.Collectors, java.util.Collections, org.openmetadata.schema.type.TagLabel) instead of using imports at the top of the file. Most of these classes are likely already imported. This hurts readability and is inconsistent with the rest of the codebase.

Bug: Test calls non-existent fn() instead of query()

📄 ingestion/tests/unit/profiler/metrics/test_unique_count.py:15
The test calls unique_count_metric.fn(session_mock) but UniqueCount extends QueryMetric, which defines query() not fn(). The fn() method is only on StaticMetric. This test will raise an AttributeError at runtime.

Additionally, query() requires a sample parameter (first positional arg after self), so the correct call should pass both sample and session.

Quality: Test only checks string output, not the untyped column fix

📄 ingestion/tests/unit/profiler/metrics/test_unique_count.py:17
The test asserts "countif" in str(result).lower() which would pass even with the old buggy code (which also used countif). Consider asserting that the generated SQL does NOT contain the original column type (e.g., STRING or BYTES), or inspect the clause elements to verify an untyped column is used in the comparison. This would make the test actually validate the fix.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@aniruddhaadak80
Copy link
Copy Markdown
Author

All feedback incorporated!

  1. The accidental TableRepository.java and CollectionDAO.java files are completely reverted. This PR now exclusively touches the Python BigQuery unique_count.py fix.
  2. Fixed the test: it now correctly calls query(sample=None, session=session_mock) instead of the non-existent fn() method.
  3. Updated the test to explicitly verify that the generated column is untyped (NullType) inside the countif expression to prevent the BigQuery type mismatch error.

Looks like CI is failing on Verify PR labels. Could someone re-add the safe to test label so the workflows can run? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Binary Column Cause Profiler Agent to Fail in BigQury

3 participants