Skip to content

fix: tokenize _ in ngram#24935

Merged
TeddyCr merged 1 commit intoopen-metadata:mainfrom
TeddyCr:ISSUE-2501-CLT
Dec 22, 2025
Merged

fix: tokenize _ in ngram#24935
TeddyCr merged 1 commit intoopen-metadata:mainfrom
TeddyCr:ISSUE-2501-CLT

Conversation

@TeddyCr
Copy link
Copy Markdown
Collaborator

@TeddyCr TeddyCr commented Dec 20, 2025

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • Elasticsearch tokenization fix:
    • Modified edge_ngram_tokenizer in test_case_index_mapping.json to preserve underscores via custom_token_chars: "_"
  • Wildcard search behavior:
    • Fixed incorrect matches where *api_e* matched api_service_entity_test (now only matches substrings containing "api_e")
  • New test coverage:
    • Added test_listTestCasesFromSearch_wildcardQuery (107 lines) validating wildcard patterns with strategic test case names

This will update automatically on new commits.


@sonarqubecloud
Copy link
Copy Markdown

@TeddyCr
Copy link
Copy Markdown
Collaborator Author

TeddyCr commented Dec 20, 2025

gitar review this PR

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Dec 20, 2025

gitar review this PR

Code Review Summary

I've reviewed PR #24935. This is a well-executed bug fix for Elasticsearch tokenization. Here's my assessment:

✅ Strengths

1. Precise Configuration Fix

  • The Elasticsearch edge_ngram_tokenizer modification is minimal and targeted
  • Adding custom_token_chars: "_" correctly addresses the underscore tokenization issue
  • Configuration change is in the right location: test_case_index_mapping.json:8-14

2. Excellent Test Coverage

  • New 107-line test test_listTestCasesFromSearch_wildcardQuery comprehensively validates the fix
  • Strategic test case naming (api_endpoint_column_test vs api_service_entity_test) clearly demonstrates the expected behavior
  • Proper async handling with Awaitility (10s timeout, 1s polling)
  • Complete resource cleanup to prevent test pollution

3. Documentation Quality

  • Inline comments clearly explain WHY certain patterns match/don't match
  • Updated existing test assertions with explanatory comments about behavior changes
  • Well-documented test expectations

🔍 What This Fixes

Problem: Wildcard query *api_e* incorrectly matched api_service_entity_test because underscores were stripped during tokenization

Solution: Preserve underscores in edge n-gram tokens so *api_e* only matches strings containing the literal substring "api_e" (like api_endpoint_column_test)

💡 Minor Suggestions

  1. Consider broader application: This fix only updates the test_case index. Consider auditing other index mappings (188+ files) to see if they'd benefit from the same underscore preservation

  2. OpenSearch compatibility: Verify this configuration works with OpenSearch as well as Elasticsearch

  3. Migration notes: Document in release notes that this is a breaking change in search behavior for existing deployments

✅ Verdict: Approved

This is production-ready code with proper testing and documentation. The fix is well-targeted and the implementation follows OpenMetadata conventions.

@TeddyCr TeddyCr merged commit ce7c6ed into open-metadata:main Dec 22, 2025
25 of 30 checks passed
ShaileshParmar11 pushed a commit that referenced this pull request Dec 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants