Skip to content

Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage#27636

Merged
harshach merged 6 commits intomainfrom
search_tests
Apr 23, 2026
Merged

Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage#27636
harshach merged 6 commits intomainfrom
search_tests

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented Apr 22, 2026

fixes https://github.com/open-metadata/openmetadata-collate/issues/3793

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • Search performance and reliability:
    • Implemented a sub-token count heuristic in SearchUtils to dynamically disable fuzziness and limit expansions for complex search queries, preventing Lucene clause-count overflow.
    • Updated searchSettings.json to add name.keyword with high boost for exact-match ranking in table assets.
  • Code organization:
    • Merged SearchUtil into SearchUtils and centralized all index classification and mapping logic.
  • Testing and validation:
    • Added comprehensive matrix tests in SearchResourceIT to validate search behavior under varying query complexity and separator configurations.
    • Added unit test suite in SearchUtilsTest to verify fuzzy-logic boundaries and index classification consistency.

This will update automatically on new commits.

Copilot AI review requested due to automatic review settings April 22, 2026 16:51
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines search-query fuzziness/max-expansion heuristics to prevent Lucene clause explosions, consolidates search helper utilities into SearchUtils, and adds regression coverage (unit + integration) around the affected search behavior and ranking.

Changes:

  • Merged the former SearchUtil helpers into SearchUtils and updated call sites accordingly.
  • Updated fuzzy-query heuristics to key off “alphanumeric sub-token” count (mirroring the ngram tokenizer split behavior) and added unit tests to pin the boundary behavior.
  • Improved search configuration and integration coverage (e.g., adding name.keyword exact boost for tables; matrix tests to guard shard-failure regressions).

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
openmetadata-service/src/test/java/org/openmetadata/service/search/SearchUtilsTest.java Adds parameterized unit tests for fuzziness/max_expansions heuristics and index classification routing.
openmetadata-service/src/main/resources/json/data/settings/searchSettings.json Adds name.keyword exact-match boost for the table asset configuration.
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSourceBuilderFactory.java Switches static imports from SearchUtil to SearchUtils.
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchSourceBuilderFactory.java Switches static imports from SearchUtil to SearchUtils.
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchUtils.java Incorporates index classification + fuzziness/max_expansions logic (formerly in SearchUtil) using sub-token counting.
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchUtil.java Removed (functionality merged into SearchUtils).
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchSourceBuilderFactory.java Switches static imports from SearchUtil to SearchUtils.
openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/SearchMetadataTool.java Updates static import to SearchUtils.mapEntityTypesToIndexNames.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java Adds integration regression tests for dataAsset alias clause-explosion behavior, typo-tolerance guard, and exact-name ranking guard.
.gitignore Broadens ignores for .claude/ content.

pmbrull
pmbrull previously approved these changes Apr 22, 2026
…it/tests/SearchResourceIT.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 22, 2026 17:16
@harshach harshach added the To release Will cherry-pick this PR into the release branch label Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated no new comments.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

🔴 Playwright Results — 1 failure(s), 13 flaky

✅ 3697 passed · ❌ 1 failed · 🟡 13 flaky · ⏭️ 89 skipped

Shard Passed Failed Flaky Skipped
🔴 Shard 1 480 1 0 4
🟡 Shard 2 654 0 2 7
🟡 Shard 3 663 0 3 1
🟡 Shard 4 645 0 3 27
✅ Shard 5 611 0 0 42
🟡 Shard 6 644 0 5 8

Genuine Failures (failed on all attempts)

Pages/SearchSettings.spec.ts › Restore default search settings (shard 1)
Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoEqual�[2m(�[22m�[32mexpected�[39m�[2m) // deep equality�[22m

�[32m- Expected  - 0�[39m
�[31m+ Received  + 5�[39m

�[33m@@ -45,10 +45,15 @@�[39m
�[2m        "boost": 20,�[22m
�[2m        "field": "displayName.keyword",�[22m
�[2m        "matchType": "exact",�[22m
�[2m      },�[22m
�[2m      Object {�[22m
�[31m+       "boost": 20,�[39m
�[31m+       "field": "name.keyword",�[39m
�[31m+       "matchType": "exact",�[39m
�[31m+     },�[39m
�[31m+     Object {�[39m
�[2m        "boost": 10,�[22m
�[2m        "field": "name",�[22m
�[2m        "matchType": "phrase",�[22m
�[2m      },�[22m
�[2m      Object {�[22m
🟡 13 flaky test(s) (passed on retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/Glossary/GlossaryHierarchy.spec.ts › should cancel move operation (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show online status badge on user profile for active users (shard 3, 1 retry)
  • Flow/PersonaDeletionUserProfile.spec.ts › User profile loads correctly before and after persona deletion (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Announcement create, edit & delete (shard 4, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Container (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Api Endpoint (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

CI was timing out in the Awaitility loops that wait for newly-created
tables to appear in the search index. Indexing is async via change
events and can take noticeably longer under CI load than locally.
30s gave no margin; 90s is 3x cushion without slowing the happy path.
Copilot AI review requested due to automatic review settings April 23, 2026 02:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

harshach and others added 2 commits April 22, 2026 22:34
CI was failing on three short-prefix matrix scenarios that queried the
seeded table's unique tag. The tag was pure hex from uniqueShortId(),
which shares ngrams with every UUID/hash in a busy CI index — our
table got pushed out of the top-15 hits by ngram-overlap noise from
other tests.

Two fixes:
- Prefix the tag with "xqz", a trigraph rare in any real document.
  Now the first sub-token is uniquely ours regardless of index pollution.
- Bump matrix size from 15 to 50. The matrix tests retrievability,
  not top-N ranking — testExactFullNameRanksSeededTableFirst already
  pins the production-UI ranking concern at size=10.
Copilot AI review requested due to automatic review settings April 23, 2026 05:36
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 23, 2026

Code Review ✅ Approved

Consolidates SearchUtils into a single class and removes ngram fuzzy matching while increasing test coverage. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.

Comment on lines +593 to +617
Table table = createTestTable(ns, "customer_analytics");
String indexedName = table.getName();
String firstSeg = indexedName.split("_+")[0];

Awaitility.await()
.atMost(90, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(
() -> {
String r =
client.search().query(firstSeg).index("table_search_index").size(25).execute();
JsonNode root = OBJECT_MAPPER.readTree(r);
for (JsonNode hit : root.path("hits").path("hits")) {
if (indexedName.equals(hit.path("_source").path("name").asText())) {
return true;
}
}
return false;
});

// "custmer" is a 1-char typo of "customer", 1 alnum sub-token → fuzziness path is active.
String typoQuery = "custmer";
String response =
client.search().query(typoQuery).index("dataAsset").deleted(false).size(25).execute();
JsonNode root = OBJECT_MAPPER.readTree(response);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testSingleWordTypoStillMatchesViaFuzzy is likely to be flaky because it queries dataAsset with a very generic term ("custmer") and only fetches 25 hits; matches from other entities/fields (e.g., column names containing "customer") can easily push the seeded table out of the top-N. Consider making the seeded table name include a short unique marker (similar to the xqz prefix used in the matrix test) and/or increasing the result size, so the assertion is deterministic while still exercising the single-token fuzzy path.

Copilot uses AI. Check for mistakes.
@sonarqubecloud
Copy link
Copy Markdown

@harshach harshach merged commit 10e43a4 into main Apr 23, 2026
57 of 61 checks passed
@harshach harshach deleted the search_tests branch April 23, 2026 14:46
@github-actions
Copy link
Copy Markdown
Contributor

Failed to cherry-pick changes to the 1.12.7 branch.
Please cherry-pick the changes manually.
You can find more details here.

pmbrull pushed a commit that referenced this pull request Apr 24, 2026
… more test coverage (#27636)

* Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage

* Update openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix tests

* test(search): bump indexing-wait timeouts from 30s to 90s

CI was timing out in the Awaitility loops that wait for newly-created
tables to appear in the search index. Indexing is async via change
events and can take noticeably longer under CI load than locally.
30s gave no margin; 90s is 3x cushion without slowing the happy path.

* test(search): use distinctive xqz prefix and bump matrix size to 50

CI was failing on three short-prefix matrix scenarios that queried the
seeded table's unique tag. The tag was pure hex from uniqueShortId(),
which shares ngrams with every UUID/hash in a busy CI index — our
table got pushed out of the top-15 hits by ngram-overlap noise from
other tests.

Two fixes:
- Prefix the tag with "xqz", a trigraph rare in any real document.
  Now the first sub-token is uniquely ours regardless of index pollution.
- Bump matrix size from 15 to 50. The matrix tests retrievability,
  not top-N ranking — testExactFullNameRanksSeededTableFirst already
  pins the production-UI ranking concern at size=10.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
sonika-shah added a commit that referenced this pull request Apr 29, 2026
… more test coverage (#27636)

* Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage

* Update openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix tests

* test(search): bump indexing-wait timeouts from 30s to 90s

CI was timing out in the Awaitility loops that wait for newly-created
tables to appear in the search index. Indexing is async via change
events and can take noticeably longer under CI load than locally.
30s gave no margin; 90s is 3x cushion without slowing the happy path.

* test(search): use distinctive xqz prefix and bump matrix size to 50

CI was failing on three short-prefix matrix scenarios that queried the
seeded table's unique tag. The tag was pure hex from uniqueShortId(),
which shares ngrams with every UUID/hash in a busy CI index — our
table got pushed out of the top-15 hits by ngram-overlap noise from
other tests.

Two fixes:
- Prefix the tag with "xqz", a trigraph rare in any real document.
  Now the first sub-token is uniquely ours regardless of index pollution.
- Bump matrix size from 15 to 50. The matrix tests retrievability,
  not top-N ranking — testExactFullNameRanksSeededTableFirst already
  pins the production-UI ranking concern at size=10.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
jatinmasaram pushed a commit to jatinmasaram/OpenMetadata that referenced this pull request May 2, 2026
… more test coverage (open-metadata#27636)

* Remove fuzzy match on ngram; merge SearchUtils into single class; add more test coverage

* Update openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchResourceIT.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix tests

* test(search): bump indexing-wait timeouts from 30s to 90s

CI was timing out in the Awaitility loops that wait for newly-created
tables to appear in the search index. Indexing is async via change
events and can take noticeably longer under CI load than locally.
30s gave no margin; 90s is 3x cushion without slowing the happy path.

* test(search): use distinctive xqz prefix and bump matrix size to 50

CI was failing on three short-prefix matrix scenarios that queried the
seeded table's unique tag. The tag was pure hex from uniqueShortId(),
which shares ngrams with every UUID/hash in a busy CI index — our
table got pushed out of the top-15 hits by ngram-overlap noise from
other tests.

Two fixes:
- Prefix the tag with "xqz", a trigraph rare in any real document.
  Now the first sub-token is uniquely ours regardless of index pollution.
- Bump matrix size from 15 to 50. The matrix tests retrievability,
  not top-N ranking — testExactFullNameRanksSeededTableFirst already
  pins the production-UI ranking concern at size=10.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants