Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage#27636
Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage#27636
Conversation
… more test coverage
There was a problem hiding this comment.
Pull request overview
This PR refines search-query fuzziness/max-expansion heuristics to prevent Lucene clause explosions, consolidates search helper utilities into SearchUtils, and adds regression coverage (unit + integration) around the affected search behavior and ranking.
Changes:
- Merged the former
SearchUtilhelpers intoSearchUtilsand updated call sites accordingly. - Updated fuzzy-query heuristics to key off “alphanumeric sub-token” count (mirroring the ngram tokenizer split behavior) and added unit tests to pin the boundary behavior.
- Improved search configuration and integration coverage (e.g., adding
name.keywordexact boost for tables; matrix tests to guard shard-failure regressions).
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-service/src/test/java/org/openmetadata/service/search/SearchUtilsTest.java | Adds parameterized unit tests for fuzziness/max_expansions heuristics and index classification routing. |
| openmetadata-service/src/main/resources/json/data/settings/searchSettings.json | Adds name.keyword exact-match boost for the table asset configuration. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSourceBuilderFactory.java | Switches static imports from SearchUtil to SearchUtils. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchSourceBuilderFactory.java | Switches static imports from SearchUtil to SearchUtils. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/SearchUtils.java | Incorporates index classification + fuzziness/max_expansions logic (formerly in SearchUtil) using sub-token counting. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/SearchUtil.java | Removed (functionality merged into SearchUtils). |
| openmetadata-service/src/main/java/org/openmetadata/service/search/SearchSourceBuilderFactory.java | Switches static imports from SearchUtil to SearchUtils. |
| openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/SearchMetadataTool.java | Updates static import to SearchUtils.mapEntityTypesToIndexNames. |
| openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java | Adds integration regression tests for dataAsset alias clause-explosion behavior, typo-tolerance guard, and exact-name ranking guard. |
| .gitignore | Broadens ignores for .claude/ content. |
…it/tests/SearchResourceIT.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
🔴 Playwright Results — 1 failure(s), 13 flaky✅ 3697 passed · ❌ 1 failed · 🟡 13 flaky · ⏭️ 89 skipped
Genuine Failures (failed on all attempts)❌
|
CI was timing out in the Awaitility loops that wait for newly-created tables to appear in the search index. Indexing is async via change events and can take noticeably longer under CI load than locally. 30s gave no margin; 90s is 3x cushion without slowing the happy path.
CI was failing on three short-prefix matrix scenarios that queried the seeded table's unique tag. The tag was pure hex from uniqueShortId(), which shares ngrams with every UUID/hash in a busy CI index — our table got pushed out of the top-15 hits by ngram-overlap noise from other tests. Two fixes: - Prefix the tag with "xqz", a trigraph rare in any real document. Now the first sub-token is uniquely ours regardless of index pollution. - Bump matrix size from 15 to 50. The matrix tests retrievability, not top-N ranking — testExactFullNameRanksSeededTableFirst already pins the production-UI ranking concern at size=10.
Code Review ✅ ApprovedConsolidates SearchUtils into a single class and removes ngram fuzzy matching while increasing test coverage. No issues found. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
| Table table = createTestTable(ns, "customer_analytics"); | ||
| String indexedName = table.getName(); | ||
| String firstSeg = indexedName.split("_+")[0]; | ||
|
|
||
| Awaitility.await() | ||
| .atMost(90, TimeUnit.SECONDS) | ||
| .pollInterval(500, TimeUnit.MILLISECONDS) | ||
| .until( | ||
| () -> { | ||
| String r = | ||
| client.search().query(firstSeg).index("table_search_index").size(25).execute(); | ||
| JsonNode root = OBJECT_MAPPER.readTree(r); | ||
| for (JsonNode hit : root.path("hits").path("hits")) { | ||
| if (indexedName.equals(hit.path("_source").path("name").asText())) { | ||
| return true; | ||
| } | ||
| } | ||
| return false; | ||
| }); | ||
|
|
||
| // "custmer" is a 1-char typo of "customer", 1 alnum sub-token → fuzziness path is active. | ||
| String typoQuery = "custmer"; | ||
| String response = | ||
| client.search().query(typoQuery).index("dataAsset").deleted(false).size(25).execute(); | ||
| JsonNode root = OBJECT_MAPPER.readTree(response); |
There was a problem hiding this comment.
testSingleWordTypoStillMatchesViaFuzzy is likely to be flaky because it queries dataAsset with a very generic term ("custmer") and only fetches 25 hits; matches from other entities/fields (e.g., column names containing "customer") can easily push the seeded table out of the top-N. Consider making the seeded table name include a short unique marker (similar to the xqz prefix used in the matrix test) and/or increasing the result size, so the assertion is deterministic while still exercising the single-token fuzzy path.
|
|
Failed to cherry-pick changes to the 1.12.7 branch. |
… more test coverage (#27636) * Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage * Update openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix tests * test(search): bump indexing-wait timeouts from 30s to 90s CI was timing out in the Awaitility loops that wait for newly-created tables to appear in the search index. Indexing is async via change events and can take noticeably longer under CI load than locally. 30s gave no margin; 90s is 3x cushion without slowing the happy path. * test(search): use distinctive xqz prefix and bump matrix size to 50 CI was failing on three short-prefix matrix scenarios that queried the seeded table's unique tag. The tag was pure hex from uniqueShortId(), which shares ngrams with every UUID/hash in a busy CI index — our table got pushed out of the top-15 hits by ngram-overlap noise from other tests. Two fixes: - Prefix the tag with "xqz", a trigraph rare in any real document. Now the first sub-token is uniquely ours regardless of index pollution. - Bump matrix size from 15 to 50. The matrix tests retrievability, not top-N ranking — testExactFullNameRanksSeededTableFirst already pins the production-UI ranking concern at size=10. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… more test coverage (#27636) * Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage * Update openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix tests * test(search): bump indexing-wait timeouts from 30s to 90s CI was timing out in the Awaitility loops that wait for newly-created tables to appear in the search index. Indexing is async via change events and can take noticeably longer under CI load than locally. 30s gave no margin; 90s is 3x cushion without slowing the happy path. * test(search): use distinctive xqz prefix and bump matrix size to 50 CI was failing on three short-prefix matrix scenarios that queried the seeded table's unique tag. The tag was pure hex from uniqueShortId(), which shares ngrams with every UUID/hash in a busy CI index — our table got pushed out of the top-15 hits by ngram-overlap noise from other tests. Two fixes: - Prefix the tag with "xqz", a trigraph rare in any real document. Now the first sub-token is uniquely ours regardless of index pollution. - Bump matrix size from 15 to 50. The matrix tests retrievability, not top-N ranking — testExactFullNameRanksSeededTableFirst already pins the production-UI ranking concern at size=10. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… more test coverage (open-metadata#27636) * Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage * Update openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix tests * test(search): bump indexing-wait timeouts from 30s to 90s CI was timing out in the Awaitility loops that wait for newly-created tables to appear in the search index. Indexing is async via change events and can take noticeably longer under CI load than locally. 30s gave no margin; 90s is 3x cushion without slowing the happy path. * test(search): use distinctive xqz prefix and bump matrix size to 50 CI was failing on three short-prefix matrix scenarios that queried the seeded table's unique tag. The tag was pure hex from uniqueShortId(), which shares ngrams with every UUID/hash in a busy CI index — our table got pushed out of the top-15 hits by ngram-overlap noise from other tests. Two fixes: - Prefix the tag with "xqz", a trigraph rare in any real document. Now the first sub-token is uniquely ours regardless of index pollution. - Bump matrix size from 15 to 50. The matrix tests retrievability, not top-N ranking — testExactFullNameRanksSeededTableFirst already pins the production-UI ranking concern at size=10. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>



fixes https://github.com/open-metadata/openmetadata-collate/issues/3793
Describe your changes:
Fixes
I worked on ... because ...
Type of change:
Checklist:
Fixes <issue-number>: <short explanation>Summary by Gitar
SearchUtilsto dynamically disable fuzziness and limit expansions for complex search queries, preventing Lucene clause-count overflow.searchSettings.jsonto addname.keywordwith high boost for exact-match ranking in table assets.SearchUtilintoSearchUtilsand centralized all index classification and mapping logic.SearchResourceITto validate search behavior under varying query complexity and separator configurations.SearchUtilsTestto verify fuzzy-logic boundaries and index classification consistency.This will update automatically on new commits.