Fixes #26200: Fix BigQuery string bindings on uniqueCount CTE for binary columns#27256
Fixes #26200: Fix BigQuery string bindings on uniqueCount CTE for binary columns#27256aniruddhaadak80 wants to merge 12 commits intoopen-metadata:mainfrom
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
This PR aims to fix the BigQuery profiler’s uniqueCount calculation on BYTES/BINARY columns by avoiding an incorrect STRING-typed bind in the COUNTIF(... = 1) comparison when the metric is executed via the “Label + wrapping subquery” path.
Changes:
- Adjust BigQuery
UniqueCountSQLAlchemy expression to use an untyped column reference for theCOUNTIF(col == 1)comparison. - Update
CollectionDAO(UserDAO) list queries to passfilter.getQueryParams()into the underlying JDBI queries. - Add tag-pruning logic in
TableRepository.addDataModel(...)for table and column tags.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| ingestion/src/metadata/profiler/metrics/static/unique_count.py | Uses an untyped column(self.col.name) for BigQuery COUNTIF to prevent STRING binding when comparing against 1. |
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java | Adds filter.getQueryParams() binding to UserDAO list methods and propagates it into the DAO query signatures. |
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TableRepository.java | Prunes AUTOMATED tags not present in incoming DataModel tags for tables and columns before persisting/applying tags. |
| if (table.getTags() != null) { | ||
| java.util.List<String> incomingTags = dataModel.getTags() != null | ||
| ? dataModel.getTags().stream().map(org.openmetadata.schema.type.TagLabel::getTagFQN).collect(java.util.stream.Collectors.toList()) | ||
| : java.util.Collections.emptyList(); | ||
| mergedTableTags.removeIf(t -> t.getLabelType() == org.openmetadata.schema.type.TagLabel.LabelType.AUTOMATED && !incomingTags.contains(t.getTagFQN())); | ||
| } |
There was a problem hiding this comment.
Current logic will remove all existing AUTOMATED table tags whenever table.getTags() is non-null and dataModel.getTags() is null/omitted, because incomingTags becomes empty and the removeIf predicate matches all automated tags. If the caller omits tags (vs explicitly sending an empty list), this is an unintended behavior change. Consider only pruning automated tags when dataModel.getTags() is explicitly provided (non-null), or distinguish between null and empty to preserve existing tags when tags aren't part of the update payload.
| if (stored.getTags() != null) { | ||
| java.util.List<String> incomingColTags = modelColumn.getTags() != null | ||
| ? modelColumn.getTags().stream().map(org.openmetadata.schema.type.TagLabel::getTagFQN).collect(java.util.stream.Collectors.toList()) | ||
| : java.util.Collections.emptyList(); | ||
| mergedColumnTags.removeIf(t -> t.getLabelType() == org.openmetadata.schema.type.TagLabel.LabelType.AUTOMATED && !incomingColTags.contains(t.getTagFQN())); | ||
| } |
There was a problem hiding this comment.
Similar to table tags: if modelColumn.getTags() is null/omitted but stored.getTags() is non-null, incomingColTags becomes empty and this removes all existing AUTOMATED column tags. If tags are not being updated for the column, this likely wipes automated tags unintentionally. Consider gating the prune on modelColumn.getTags() != null (or otherwise distinguishing null vs empty).
| return EntityDAO.super.listCount(filter); | ||
| } | ||
| return listCount( | ||
| getTableName(), mySqlCondition, postgresCondition, team, Relationship.HAS.ordinal()); | ||
| getTableName(), mySqlCondition, postgresCondition, team, Relationship.HAS.ordinal(), filter.getQueryParams()); | ||
| } |
There was a problem hiding this comment.
filter.getQueryParams() likely contains a team key (and potentially other keys) that overlaps with separately bound parameters (e.g., @BindFQN("team") String team). Passing it via @BindMap risks overriding the intended binding for :team in the query, which could break filtering (te.nameHash = :team). Consider passing a copy of queryParams with overlapping keys removed (e.g., remove "team" and any other explicitly-bound names) before binding the map.
|
|
||
| List<TagLabel> mergedTableTags = | ||
| mergeTagsWithIncomingPrecedence(table.getTags(), dataModel.getTags()); | ||
| if (table.getTags() != null) { | ||
| java.util.List<String> incomingTags = dataModel.getTags() != null | ||
| ? dataModel.getTags().stream().map(org.openmetadata.schema.type.TagLabel::getTagFQN).collect(java.util.stream.Collectors.toList()) |
There was a problem hiding this comment.
The PR description focuses on fixing BigQuery uniqueCount binding, but this file also introduces tag pruning behavior changes for tables/columns. If these changes are intentional, the PR description should cover them; otherwise consider splitting into a separate PR to keep scope and review risk contained.
| if session.get_bind().dialect.name == Dialects.BigQuery: | ||
| return func.countif(col == 1).label(self.name()) | ||
| # We are querying against the subquery output (which is a COUNT), so the type is numeric. | ||
| # Use an untyped column to avoid passing the original metric type (like STRING or BYTES) into the COUNTIF comparison. | ||
| count_col = column(self.col.name) | ||
| return func.countif(count_col == 1).label(self.name()) |
There was a problem hiding this comment.
Please add a regression test for the BigQuery path to ensure the generated SQL compares the COUNT subquery output as a numeric (e.g., no :STRING binding for the literal 1 when the original column type is STRING/BYTES). There are existing unit tests for UniqueCount, but they don’t appear to cover the BigQuery Label hotfix flow in SQAProfilerInterface.
| beforeId, | ||
| Relationship.HAS.ordinal()); | ||
| Relationship.HAS.ordinal(), | ||
| filter.getQueryParams()); |
There was a problem hiding this comment.
Same binding-collision issue as listCount: listBefore now passes filter.getQueryParams() into @BindMap params while also binding :team via @BindFQN("team"). If queryParams contains "team", it can override the hashed :team value expected by te.nameHash = :team, causing the filter to stop matching. Remove colliding keys from the map (e.g., "team") before binding, or bind the extra params with a prefix.
| filter.getQueryParams()); | ||
| } | ||
|
|
There was a problem hiding this comment.
Same binding-collision issue as listCount/listBefore: listAfter passes filter.getQueryParams() into @BindMap params while also binding :team via @BindFQN("team"). If queryParams contains "team", it can override the hashed value used by te.nameHash = :team. Consider removing colliding keys from the map or binding the extra params with a prefix.
| filter.getQueryParams()); | |
| } | |
| getListAfterQueryParams(filter)); | |
| } | |
| private Map<String, String> getListAfterQueryParams(ListFilter filter) { | |
| Map<String, String> queryParams = new HashMap<>(filter.getQueryParams()); | |
| queryParams.remove("team"); | |
| return queryParams; | |
| } |
|
Hello! I am participating in the WeMakeDevs hackathon. Could a maintainer please assign the |
|
Could someone help trigger the CI by adding the |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| java.util.List<String> incomingTags = dataModel.getTags() != null | ||
| ? dataModel.getTags().stream().map(org.openmetadata.schema.type.TagLabel::getTagFQN).collect(java.util.stream.Collectors.toList()) | ||
| : java.util.Collections.emptyList(); | ||
| mergedTableTags.removeIf(t -> t.getLabelType() == org.openmetadata.schema.type.TagLabel.LabelType.AUTOMATED && !incomingTags.contains(t.getTagFQN())); |
There was a problem hiding this comment.
These newly added lines are not formatted to the repository's standard (Spotless/google-java-format) and rely on fully-qualified names inside the method body, making the code harder to read/maintain. Please apply the standard formatter and use existing imports (e.g., TagLabel::getTagFQN, Collectors/toList) to keep the style consistent and avoid CI formatting failures.
| afterId, | ||
| Relationship.HAS.ordinal()); | ||
| Relationship.HAS.ordinal(), | ||
| filter.getQueryParams()); |
There was a problem hiding this comment.
Same binding-collision risk as listCount/listBefore: passing filter.getQueryParams() via @BindMap can re-bind :team and override the @BindFQN("team") hashed value (or cause duplicate binding). Please pass a cleaned params map with conflicting keys removed (at minimum team).
…et for fast lookup
|
Absolutely! I just refactored |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
@aniruddhaadak80 this shouldn't be touching TableRepository.java, Not sure I follow the changes you are doing here |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| @@ -68,7 +68,10 @@ def query(self, sample: Optional[type], session: Optional[Session] = None): | |||
|
|
|||
| # TODO: Move all connectors from subquery to COUNT(IF) or COUNTIF for peformance | |||
There was a problem hiding this comment.
Typo in TODO comment: “peformance” → “performance”.
| # TODO: Move all connectors from subquery to COUNT(IF) or COUNTIF for peformance | |
| # TODO: Move all connectors from subquery to COUNT(IF) or COUNTIF for performance |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
1 similar comment
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| def test_bigquery_unique_count(): | ||
| # Mocking session binding | ||
| session_mock = Mock() | ||
| session_mock.get_bind().dialect.name = Dialects.BigQuery | ||
|
|
||
| unique_count_metric = UniqueCount(Column("test_col")) | ||
| result = unique_count_metric.fn(session_mock) | ||
|
|
||
| assert "countif" in str(result).lower() |
There was a problem hiding this comment.
The test only asserts that COUNTIF appears in the rendered SQL, but it doesn’t verify the regression being fixed (i.e., that the comparison is numeric and not bound/typed as a string) and it doesn’t cover the problematic BYTES/BINARY column scenario described in the PR. Consider constructing the metric with a binary column type (e.g., LargeBinary/BINARY) and asserting against the compiled expression (BigQuery dialect) that the = 1 comparison is treated as numeric (e.g., literal 1 or an integer-typed bindparam), so this test fails under the previous buggy behavior.
…perly validate untyped column typing for BigQuery
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review ✅ Approved 3 resolved / 3 findingsFixes BigQuery string bindings on uniqueCount CTE for binary columns by using fully qualified class names in TableRepository and correcting test method calls. All findings have been addressed. ✅ 3 resolved✅ Quality: Fully qualified class names instead of imports in TableRepository
✅ Bug: Test calls non-existent
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
|
All feedback incorporated!
Looks like CI is failing on |
What it does
Fixes the BigQuery profiler pipeline that crashes on
BYTES/BINARYcolumns duringuniqueCountcalculation due toNo matching signature for operator = for argument types: INT64, STRING.How it does it
SQLAlchemy BigQuery metric runner passes the original metric type (like
STRING) into theCOUNTIF(col == 1)check. However, in thesqa_profiler_interface.pyexecution, BigQuery executes the metric label query via a wrapping CTE wheredataacts as anINT64COUNT output. SQLAlchemy then attempts to compare theINT64count returned by the subquery against a boundSTRING'1'. Using a standard, un-typed genericcolumn(col.name)instead skips the aggressive data type injection and solves the BigQuery mismatch error.Fixes #26200