Skip to content

Add source filter and indexed hash prefix to cert tag batch query#27847

Merged
sonika-shah merged 4 commits intomainfrom
fix-cert-batch-tag-usage-source-filter
Apr 30, 2026
Merged

Add source filter and indexed hash prefix to cert tag batch query#27847
sonika-shah merged 4 commits intomainfrom
fix-cert-batch-tag-usage-source-filter

Conversation

@sonika-shah
Copy link
Copy Markdown
Collaborator

Summary

The certification tag batch query (TagUsageDAO.getCertTagsInternalBatch) was running at ~12 seconds per call on instances with heavy classification hierarchies, fired ~5,800 times per Data Insights run — contributing roughly 19 hours of cumulative DB time per DI run on a customer instance with deep nested containers.

Root cause

Two missing index-friendly predicates in the existing SQL:

  1. No source filter — couldn't use idx_tag_usage_target_exact (source, targetFQNHash, state) INCLUDE (tagFQN, labelType) whose covering INCLUDE has tagFQN.
  2. tagFQN LIKE 'Certification.%' on the raw column — there's no LIKE-friendly index on raw tagFQN. Only tagfqn_lower text_pattern_ops and tagFQNHash are indexed for LIKE patterns. The LIKE always ran as a post-filter on every row the IN clause returned.

Changes

CollectionDAO.TagUsageDAO.getCertTagsInternalBatch

-- Before
WHERE targetFQNHash IN (<targetFQNHashes>)
  AND tagFQN LIKE :tagFQNPrefix

-- After
WHERE source = :source
  AND targetFQNHash IN (<targetFQNHashes>)
  AND tagFQNHash LIKE :tagFQNHashPrefix

Caller updates (EntityRepository)

Two call sites — getCertification() (single-entity GET) and batchFetchCertification() (bulk LIST). Both updated to:

  • Pass TagLabel.TagSource.CLASSIFICATION.ordinal() as source.
  • Pass FullyQualifiedName.buildHash(certClassification) + \".%\" instead of the raw certClassification + \".%\".

The hash is computed once per call via the existing FullyQualifiedName.buildHash helper (the same MD5 used by @BindFQN when storing the row), so the LIKE prefix matches the hierarchical hash format actually stored in tag_usage.tagFQNHash.

Correctness improvement (bonus)

The new source = 0 filter excludes glossary terms (source = 1) that happen to have FQNs starting with "Certification.". Previously such glossary terms could be incorrectly returned as certifications via the unconstrained LIKE; now they're correctly excluded.

Cross-DB compatibility

All constructs (source = ?, targetFQNHash IN (...), tagFQNHash LIKE 'prefix%', ORDER BY) work identically on MySQL and Postgres. No @ConnectionAwareSqlQuery split needed.

Index usage (verified via EXPLAIN ANALYZE on RDS)

  • idx_tag_usage_target_fqn_hash or idx_tag_usage_target_source for the targetFQNHash IN (...) clause
  • idx_tag_usage_join_source (Postgres) / idx_tag_usage_tag_fqn_hash (MySQL) for the tagFQNHash LIKE 'hash.%' clause
  • source = 0 filter unlocks idx_tag_usage_target_exact covering scan when planner picks it

Tests

  • Existing test_certificationTagNotLeakingIntoTagsField (in TagResourceIT) already covers the happy path — single GET and bulk LIST both populate certification correctly and the cert tag does not leak into tags. Continues to pass with the new SQL.
  • New test_certBatch_bulkFetchReturnsCorrectCertsPerEntity — exercises the bulk fetch path with three schemas:
    • One with a cert tag → must have certification populated correctly
    • One without any cert → must have certification == null (no false positives from the IN list)
    • One with a non-cert tag from a different classification → must have certification == null (regression test for the source filter + hash prefix)

Performance impact

Before After
Per-call latency ~12 s sub-second (index seek)
Cumulative during a Data Insights run (~5,800 batch calls) ~19 hrs DB time ~minutes DB time

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 30, 2026 12:55
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 30, 2026
@sonika-shah sonika-shah added the To release Will cherry-pick this PR into the release branch label Apr 30, 2026
The certification tag batch query (TagUsageDAO.getCertTagsInternalBatch)
was hitting ~12 seconds per call on instances with deep classification
hierarchies — fired ~5,800 times per Data Insights run, contributing
~19 hrs of cumulative DB time per DI run.

Two missing index-friendly predicates caused the slowness:
1. No `source = ?` filter — couldn't use idx_tag_usage_target_exact
   (source, targetFQNHash, state) INCLUDE (tagFQN, labelType) whose
   covering INCLUDE has tagFQN.
2. `tagFQN LIKE 'Certification.%'` on the raw column — there's no
   LIKE-friendly index on raw tagFQN, only on tagfqn_lower text_pattern_ops
   and tagFQNHash. The LIKE always ran as a post-filter on every row the
   IN clause returned.

Fix:
- Add `source = :source` filter (Certifications are always Classification
  source = 0).
- Switch `tagFQN LIKE :tagFQNPrefix` → `tagFQNHash LIKE :tagFQNHashPrefix`,
  with the hash prefix pre-computed via FullyQualifiedName.buildHash so the
  query hits the indexed hash column.

Same SQL on MySQL and Postgres — no @ConnectionAwareSqlQuery split needed.

Also a correctness improvement: the `source = 0` filter excludes glossary
terms (source = 1) that happen to have FQNs starting with "Certification.".
Previously such glossary terms could be incorrectly returned as
certifications; now they're excluded as expected.

Test:
- Added test_certBatch_bulkFetchReturnsCorrectCertsPerEntity in
  TagResourceIT — exercises the bulk fetch path with three schemas
  (cert-tagged / untagged / non-cert-tagged) and asserts each gets
  the right certification (or null) in the listed response. Locks in
  source-filter correctness and prevents future regressions where a
  non-cert tag could leak into the certification field.
@sonika-shah sonika-shah force-pushed the fix-cert-batch-tag-usage-source-filter branch from edfb21d to e832508 Compare April 30, 2026 12:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes certification tag batch fetching by making the getCertTagsInternalBatch query index-friendly (adding a source predicate and filtering by tagFQNHash prefix), and updates repository call sites accordingly to pass the new parameters. Adds an integration test intended to validate correctness of the bulk certification fetch path.

Changes:

  • Update CollectionDAO.TagUsageDAO.getCertTagsInternalBatch SQL to filter by source and tagFQNHash LIKE :prefix.
  • Update EntityRepository call sites to pass TagSource.CLASSIFICATION and a FullyQualifiedName.buildHash(certClassification) + ".%" prefix.
  • Add a new IT covering bulk list behavior for certification population.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java Updates certification fetches to pass source and hashed prefix into the batch DAO query.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java Makes cert batch query use source and tagFQNHash prefix for index usage and correctness.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/TagResourceIT.java Adds a new integration test to validate bulk certification fetching behavior.

@github-actions
Copy link
Copy Markdown
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Copy Markdown
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copilot AI review requested due to automatic review settings April 30, 2026 13:14
@github-actions
Copy link
Copy Markdown
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 30, 2026

Code Review ✅ Approved

Implements source filtering and indexed hash prefixes in the certificate tag batch query. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@github-actions
Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (19 flaky)

✅ 3979 passed · ❌ 0 failed · 🟡 19 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🟡 Shard 2 746 0 8 8
🟡 Shard 3 742 0 4 7
🟡 Shard 4 774 0 1 18
🟡 Shard 5 685 0 2 41
🟡 Shard 6 734 0 3 8
🟡 19 flaky test(s) (passed on retry)
  • Pages/Bots.spec.ts › Bots Page should work properly (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/BulkImport.spec.ts › Keyboard Delete selection (shard 2, 1 retry)
  • Features/ColumnBulkOperations.spec.ts › should filter by metadata status and verify API param (shard 2, 1 retry)
  • Features/ColumnBulkOperations.spec.ts › should not reset stats to zero while search request is loading (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › Admin: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseResultPermissions.spec.ts › User with only VIEW cannot PATCH results (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
  • Flow/PersonaDeletionUserProfile.spec.ts › User profile loads correctly before and after persona deletion (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Store Procedure (shard 4, 1 retry)
  • Pages/Entity.spec.ts › User as Owner with unsorted list (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should verify deleted user not visible in owner selection for table (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Drag and Drop Glossary Term (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@sonarqubecloud
Copy link
Copy Markdown

@sonika-shah sonika-shah merged commit 4a2f42f into main Apr 30, 2026
66 of 69 checks passed
@sonika-shah sonika-shah deleted the fix-cert-batch-tag-usage-source-filter branch April 30, 2026 18:37
@github-actions
Copy link
Copy Markdown
Contributor

Changes have been cherry-picked to the 1.12.7 branch.

github-actions Bot pushed a commit that referenced this pull request Apr 30, 2026
…7847)

* Add source filter and use indexed hash prefix in cert tag batch query

The certification tag batch query (TagUsageDAO.getCertTagsInternalBatch)
was hitting ~12 seconds per call on instances with deep classification
hierarchies — fired ~5,800 times per Data Insights run, contributing
~19 hrs of cumulative DB time per DI run.

Two missing index-friendly predicates caused the slowness:
1. No `source = ?` filter — couldn't use idx_tag_usage_target_exact
   (source, targetFQNHash, state) INCLUDE (tagFQN, labelType) whose
   covering INCLUDE has tagFQN.
2. `tagFQN LIKE 'Certification.%'` on the raw column — there's no
   LIKE-friendly index on raw tagFQN, only on tagfqn_lower text_pattern_ops
   and tagFQNHash. The LIKE always ran as a post-filter on every row the
   IN clause returned.

Fix:
- Add `source = :source` filter (Certifications are always Classification
  source = 0).
- Switch `tagFQN LIKE :tagFQNPrefix` → `tagFQNHash LIKE :tagFQNHashPrefix`,
  with the hash prefix pre-computed via FullyQualifiedName.buildHash so the
  query hits the indexed hash column.

Same SQL on MySQL and Postgres — no @ConnectionAwareSqlQuery split needed.

Also a correctness improvement: the `source = 0` filter excludes glossary
terms (source = 1) that happen to have FQNs starting with "Certification.".
Previously such glossary terms could be incorrectly returned as
certifications; now they're excluded as expected.

Test:
- Added test_certBatch_bulkFetchReturnsCorrectCertsPerEntity in
  TagResourceIT — exercises the bulk fetch path with three schemas
  (cert-tagged / untagged / non-cert-tagged) and asserts each gets
  the right certification (or null) in the listed response. Locks in
  source-filter correctness and prevents future regressions where a
  non-cert tag could leak into the certification field.

* Fix duplicate schema names in cert batch test, trim verbose comments

* Update EntityRepositoryCertificationTest mocks for new getCertTagsInternalBatch signature

* fix check style

(cherry picked from commit 4a2f42f)
@github-actions
Copy link
Copy Markdown
Contributor

Changes have been cherry-picked to the 1.13 branch.

github-actions Bot pushed a commit that referenced this pull request Apr 30, 2026
…7847)

* Add source filter and use indexed hash prefix in cert tag batch query

The certification tag batch query (TagUsageDAO.getCertTagsInternalBatch)
was hitting ~12 seconds per call on instances with deep classification
hierarchies — fired ~5,800 times per Data Insights run, contributing
~19 hrs of cumulative DB time per DI run.

Two missing index-friendly predicates caused the slowness:
1. No `source = ?` filter — couldn't use idx_tag_usage_target_exact
   (source, targetFQNHash, state) INCLUDE (tagFQN, labelType) whose
   covering INCLUDE has tagFQN.
2. `tagFQN LIKE 'Certification.%'` on the raw column — there's no
   LIKE-friendly index on raw tagFQN, only on tagfqn_lower text_pattern_ops
   and tagFQNHash. The LIKE always ran as a post-filter on every row the
   IN clause returned.

Fix:
- Add `source = :source` filter (Certifications are always Classification
  source = 0).
- Switch `tagFQN LIKE :tagFQNPrefix` → `tagFQNHash LIKE :tagFQNHashPrefix`,
  with the hash prefix pre-computed via FullyQualifiedName.buildHash so the
  query hits the indexed hash column.

Same SQL on MySQL and Postgres — no @ConnectionAwareSqlQuery split needed.

Also a correctness improvement: the `source = 0` filter excludes glossary
terms (source = 1) that happen to have FQNs starting with "Certification.".
Previously such glossary terms could be incorrectly returned as
certifications; now they're excluded as expected.

Test:
- Added test_certBatch_bulkFetchReturnsCorrectCertsPerEntity in
  TagResourceIT — exercises the bulk fetch path with three schemas
  (cert-tagged / untagged / non-cert-tagged) and asserts each gets
  the right certification (or null) in the listed response. Locks in
  source-filter correctness and prevents future regressions where a
  non-cert tag could leak into the certification field.

* Fix duplicate schema names in cert batch test, trim verbose comments

* Update EntityRepositoryCertificationTest mocks for new getCertTagsInternalBatch signature

* fix check style

(cherry picked from commit 4a2f42f)
mohitjeswani01 pushed a commit to mohitjeswani01/OpenMetadata that referenced this pull request Apr 30, 2026
…en-metadata#27847)

* Add source filter and use indexed hash prefix in cert tag batch query

The certification tag batch query (TagUsageDAO.getCertTagsInternalBatch)
was hitting ~12 seconds per call on instances with deep classification
hierarchies — fired ~5,800 times per Data Insights run, contributing
~19 hrs of cumulative DB time per DI run.

Two missing index-friendly predicates caused the slowness:
1. No `source = ?` filter — couldn't use idx_tag_usage_target_exact
   (source, targetFQNHash, state) INCLUDE (tagFQN, labelType) whose
   covering INCLUDE has tagFQN.
2. `tagFQN LIKE 'Certification.%'` on the raw column — there's no
   LIKE-friendly index on raw tagFQN, only on tagfqn_lower text_pattern_ops
   and tagFQNHash. The LIKE always ran as a post-filter on every row the
   IN clause returned.

Fix:
- Add `source = :source` filter (Certifications are always Classification
  source = 0).
- Switch `tagFQN LIKE :tagFQNPrefix` → `tagFQNHash LIKE :tagFQNHashPrefix`,
  with the hash prefix pre-computed via FullyQualifiedName.buildHash so the
  query hits the indexed hash column.

Same SQL on MySQL and Postgres — no @ConnectionAwareSqlQuery split needed.

Also a correctness improvement: the `source = 0` filter excludes glossary
terms (source = 1) that happen to have FQNs starting with "Certification.".
Previously such glossary terms could be incorrectly returned as
certifications; now they're excluded as expected.

Test:
- Added test_certBatch_bulkFetchReturnsCorrectCertsPerEntity in
  TagResourceIT — exercises the bulk fetch path with three schemas
  (cert-tagged / untagged / non-cert-tagged) and asserts each gets
  the right certification (or null) in the listed response. Locks in
  source-filter correctness and prevents future regressions where a
  non-cert tag could leak into the certification field.

* Fix duplicate schema names in cert batch test, trim verbose comments

* Update EntityRepositoryCertificationTest mocks for new getCertTagsInternalBatch signature

* fix check style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants