Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage by harshach · Pull Request #27636 · open-metadata/OpenMetadata

harshach · 2026-04-22T16:51:15Z

fixes https://github.com/open-metadata/openmetadata-collate/issues/3793

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

Search performance and reliability:
- Implemented a sub-token count heuristic in SearchUtils to dynamically disable fuzziness and limit expansions for complex search queries, preventing Lucene clause-count overflow.
- Updated searchSettings.json to add name.keyword with high boost for exact-match ranking in table assets.
Code organization:
- Merged SearchUtil into SearchUtils and centralized all index classification and mapping logic.
Testing and validation:
- Added comprehensive matrix tests in SearchResourceIT to validate search behavior under varying query complexity and separator configurations.
- Added unit test suite in SearchUtilsTest to verify fuzzy-logic boundaries and index classification consistency.

_{This will update automatically on new commits.}

… more test coverage

Copilot

Pull request overview

This PR refines search-query fuzziness/max-expansion heuristics to prevent Lucene clause explosions, consolidates search helper utilities into SearchUtils, and adds regression coverage (unit + integration) around the affected search behavior and ranking.

Changes:

Merged the former SearchUtil helpers into SearchUtils and updated call sites accordingly.
Updated fuzzy-query heuristics to key off “alphanumeric sub-token” count (mirroring the ngram tokenizer split behavior) and added unit tests to pin the boundary behavior.
Improved search configuration and integration coverage (e.g., adding name.keyword exact boost for tables; matrix tests to guard shard-failure regressions).

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
openmetadata-service/src/test/java/org/openmetadata/service/search/SearchUtilsTest.java	Adds parameterized unit tests for fuzziness/max_expansions heuristics and index classification routing.
openmetadata-service/src/main/resources/json/data/settings/searchSettings.json	Adds `name.keyword` exact-match boost for the `table` asset configuration.
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSourceBuilderFactory.java	Switches static imports from `SearchUtil` to `SearchUtils`.
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchSourceBuilderFactory.java	Switches static imports from `SearchUtil` to `SearchUtils`.
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchUtils.java	Incorporates index classification + fuzziness/max_expansions logic (formerly in `SearchUtil`) using sub-token counting.
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchUtil.java	Removed (functionality merged into `SearchUtils`).
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchSourceBuilderFactory.java	Switches static imports from `SearchUtil` to `SearchUtils`.
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/SearchMetadataTool.java	Updates static import to `SearchUtils.mapEntityTypesToIndexNames`.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java	Adds integration regression tests for dataAsset alias clause-explosion behavior, typo-tolerance guard, and exact-name ranking guard.
.gitignore	Broadens ignores for `.claude/` content.

…it/tests/SearchResourceIT.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated no new comments.

github-actions · 2026-04-22T19:25:55Z

🔴 Playwright Results — 1 failure(s), 13 flaky

✅ 3697 passed · ❌ 1 failed · 🟡 13 flaky · ⏭️ 89 skipped

Shard	Passed	Failed	Flaky	Skipped
🔴 Shard 1	480	1	0	4
🟡 Shard 2	654	0	2	7
🟡 Shard 3	663	0	3	1
🟡 Shard 4	645	0	3	27
✅ Shard 5	611	0	0	42
🟡 Shard 6	644	0	5	8

Genuine Failures (failed on all attempts)

❌ Pages/SearchSettings.spec.ts › Restore default search settings (shard 1)

Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoEqual�[2m(�[22m�[32mexpected�[39m�[2m) // deep equality�[22m

�[32m- Expected  - 0�[39m
�[31m+ Received  + 5�[39m

�[33m@@ -45,10 +45,15 @@�[39m
�[2m        "boost": 20,�[22m
�[2m        "field": "displayName.keyword",�[22m
�[2m        "matchType": "exact",�[22m
�[2m      },�[22m
�[2m      Object {�[22m
�[31m+       "boost": 20,�[39m
�[31m+       "field": "name.keyword",�[39m
�[31m+       "matchType": "exact",�[39m
�[31m+     },�[39m
�[31m+     Object {�[39m
�[2m        "boost": 10,�[22m
�[2m        "field": "name",�[22m
�[2m        "matchType": "phrase",�[22m
�[2m      },�[22m
�[2m      Object {�[22m

🟡 13 flaky test(s) (passed on retry)

Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
Features/Glossary/GlossaryHierarchy.spec.ts › should cancel move operation (shard 2, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Features/UserProfileOnlineStatus.spec.ts › Should show online status badge on user profile for active users (shard 3, 1 retry)
Flow/PersonaDeletionUserProfile.spec.ts › User profile loads correctly before and after persona deletion (shard 3, 1 retry)
Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
Pages/Entity.spec.ts › Announcement create, edit & delete (shard 4, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Container (shard 6, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Api Endpoint (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

CI was timing out in the Awaitility loops that wait for newly-created tables to appear in the search index. Indexing is async via change events and can take noticeably longer under CI load than locally. 30s gave no margin; 90s is 3x cushion without slowing the happy path.

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

CI was failing on three short-prefix matrix scenarios that queried the seeded table's unique tag. The tag was pure hex from uniqueShortId(), which shares ngrams with every UUID/hash in a busy CI index — our table got pushed out of the top-15 hits by ngram-overlap noise from other tests. Two fixes: - Prefix the tag with "xqz", a trigraph rare in any real document. Now the first sub-token is uniquely ours regardless of index pollution. - Bump matrix size from 15 to 50. The matrix tests retrievability, not top-N ranking — testExactFullNameRanksSeededTableFirst already pins the production-UI ranking concern at size=10.

gitar-bot · 2026-04-23T05:37:30Z

Code Review ✅ Approved

Consolidates SearchUtils into a single class and removes ngram fuzzy matching while increasing test coverage. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.

Copilot · 2026-04-23T05:40:59Z

+    Table table = createTestTable(ns, "customer_analytics");
+    String indexedName = table.getName();
+    String firstSeg = indexedName.split("_+")[0];
+
+    Awaitility.await()
+        .atMost(90, TimeUnit.SECONDS)
+        .pollInterval(500, TimeUnit.MILLISECONDS)
+        .until(
+            () -> {
+              String r =
+                  client.search().query(firstSeg).index("table_search_index").size(25).execute();
+              JsonNode root = OBJECT_MAPPER.readTree(r);
+              for (JsonNode hit : root.path("hits").path("hits")) {
+                if (indexedName.equals(hit.path("_source").path("name").asText())) {
+                  return true;
+                }
+              }
+              return false;
+            });
+
+    // "custmer" is a 1-char typo of "customer", 1 alnum sub-token → fuzziness path is active.
+    String typoQuery = "custmer";
+    String response =
+        client.search().query(typoQuery).index("dataAsset").deleted(false).size(25).execute();
+    JsonNode root = OBJECT_MAPPER.readTree(response);


testSingleWordTypoStillMatchesViaFuzzy is likely to be flaky because it queries dataAsset with a very generic term ("custmer") and only fetches 25 hits; matches from other entities/fields (e.g., column names containing "customer") can easily push the seeded table out of the top-N. Consider making the seeded table name include a short unique marker (similar to the xqz prefix used in the matrix test) and/or increasing the result size, so the assertion is deterministic while still exercising the single-token fuzzy path.

sonarqubecloud · 2026-04-23T06:37:17Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-04-23T14:53:10Z

Failed to cherry-pick changes to the 1.12.7 branch.
Please cherry-pick the changes manually.
You can find more details here.

… more test coverage (#27636) * Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage * Update openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix tests * test(search): bump indexing-wait timeouts from 30s to 90s CI was timing out in the Awaitility loops that wait for newly-created tables to appear in the search index. Indexing is async via change events and can take noticeably longer under CI load than locally. 30s gave no margin; 90s is 3x cushion without slowing the happy path. * test(search): use distinctive xqz prefix and bump matrix size to 50 CI was failing on three short-prefix matrix scenarios that queried the seeded table's unique tag. The tag was pure hex from uniqueShortId(), which shares ngrams with every UUID/hash in a busy CI index — our table got pushed out of the top-15 hits by ngram-overlap noise from other tests. Two fixes: - Prefix the tag with "xqz", a trigraph rare in any real document. Now the first sub-token is uniquely ours regardless of index pollution. - Bump matrix size from 15 to 50. The matrix tests retrievability, not top-N ranking — testExactFullNameRanksSeededTableFirst already pins the production-UI ranking concern at size=10. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

… more test coverage (open-metadata#27636) * Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage * Update openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix tests * test(search): bump indexing-wait timeouts from 30s to 90s CI was timing out in the Awaitility loops that wait for newly-created tables to appear in the search index. Indexing is async via change events and can take noticeably longer under CI load than locally. 30s gave no margin; 90s is 3x cushion without slowing the happy path. * test(search): use distinctive xqz prefix and bump matrix size to 50 CI was failing on three short-prefix matrix scenarios that queried the seeded table's unique tag. The tag was pure hex from uniqueShortId(), which shares ngrams with every UUID/hash in a busy CI index — our table got pushed out of the top-15 hits by ngram-overlap noise from other tests. Two fixes: - Prefix the tag with "xqz", a trigraph rare in any real document. Now the first sub-token is uniquely ours regardless of index pollution. - Bump matrix size from 15 to 50. The matrix tests retrievability, not top-N ranking — testExactFullNameRanksSeededTableFirst already pins the production-UI ranking concern at size=10. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Remove fuzzy match on ngram; merge SearchUtils into single class; add…

55fba86

… more test coverage

Copilot AI review requested due to automatic review settings April 22, 2026 16:51

github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 22, 2026

Copilot started reviewing on behalf of harshach April 22, 2026 16:51 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java Outdated

harshach had a problem deploying to test April 22, 2026 17:02 — with GitHub Actions Error

pmbrull previously approved these changes Apr 22, 2026

View reviewed changes

Update openmetadata-integration-tests/src/test/java/org/openmetadata/…

2052e26

…it/tests/SearchResourceIT.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

harshach dismissed pmbrull’s stale review via 2052e26 April 22, 2026 17:16

Copilot AI review requested due to automatic review settings April 22, 2026 17:16

harshach added the To release Will cherry-pick this PR into the release branch label Apr 22, 2026

Copilot started reviewing on behalf of harshach April 22, 2026 17:16 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

harshach temporarily deployed to test April 22, 2026 17:26 — with GitHub Actions Inactive

harshach had a problem deploying to test April 22, 2026 17:26 — with GitHub Actions Failure

harshach temporarily deployed to test April 22, 2026 17:26 — with GitHub Actions Inactive

harshach had a problem deploying to test April 22, 2026 17:26 — with GitHub Actions Failure

harshach temporarily deployed to test April 22, 2026 17:26 — with GitHub Actions Inactive

Fix tests

cf1ad84

harshach had a problem deploying to test April 23, 2026 01:52 — with GitHub Actions Error

Copilot AI review requested due to automatic review settings April 23, 2026 02:52

Copilot started reviewing on behalf of harshach April 23, 2026 02:53 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java Outdated

Comment thread openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java Outdated

harshach temporarily deployed to test April 23, 2026 03:03 — with GitHub Actions Inactive

harshach had a problem deploying to test April 23, 2026 03:03 — with GitHub Actions Failure

harshach temporarily deployed to test April 23, 2026 03:03 — with GitHub Actions Inactive

harshach and others added 2 commits April 22, 2026 22:34

Merge branch 'main' into search_tests

d417391

Copilot AI review requested due to automatic review settings April 23, 2026 05:36

Copilot started reviewing on behalf of harshach April 23, 2026 05:37 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

harshach temporarily deployed to test April 23, 2026 05:46 — with GitHub Actions Inactive

harshach had a problem deploying to test April 23, 2026 05:46 — with GitHub Actions Failure

harshach temporarily deployed to test April 23, 2026 05:46 — with GitHub Actions Inactive

harshach merged commit 10e43a4 into main Apr 23, 2026
57 of 61 checks passed

harshach deleted the search_tests branch April 23, 2026 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage#27636

Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage#27636
harshach merged 6 commits intomainfrom
search_tests

harshach commented Apr 22, 2026 •

edited by gitar-bot Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

gitar-bot Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

sonarqubecloud Bot commented Apr 23, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

harshach commented Apr 22, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Playwright Results — 1 failure(s), 13 flaky

Genuine Failures (failed on all attempts)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

gitar-bot Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Apr 23, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

harshach commented Apr 22, 2026 •

edited by gitar-bot Bot

Loading

github-actions Bot commented Apr 22, 2026 •

edited

Loading

gitar-bot Bot commented Apr 23, 2026 •

edited

Loading