fix: tokenize _ in ngram by TeddyCr · Pull Request #24935 · open-metadata/OpenMetadata

TeddyCr · 2025-12-20T09:07:04Z

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

Elasticsearch tokenization fix:
- Modified edge_ngram_tokenizer in test_case_index_mapping.json to preserve underscores via custom_token_chars: "_"
Wildcard search behavior:
- Fixed incorrect matches where *api_e* matched api_service_entity_test (now only matches substrings containing "api_e")
New test coverage:
- Added test_listTestCasesFromSearch_wildcardQuery (107 lines) validating wildcard patterns with strategic test case names

_{This will update automatically on new commits.}

sonarqubecloud · 2025-12-20T11:06:13Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

TeddyCr · 2025-12-20T15:46:42Z

gitar review this PR

gitar-bot · 2025-12-20T15:48:45Z

gitar review this PR

Code Review Summary

I've reviewed PR #24935. This is a well-executed bug fix for Elasticsearch tokenization. Here's my assessment:

✅ Strengths

1. Precise Configuration Fix

The Elasticsearch edge_ngram_tokenizer modification is minimal and targeted
Adding custom_token_chars: "_" correctly addresses the underscore tokenization issue
Configuration change is in the right location: test_case_index_mapping.json:8-14

2. Excellent Test Coverage

New 107-line test test_listTestCasesFromSearch_wildcardQuery comprehensively validates the fix
Strategic test case naming (api_endpoint_column_test vs api_service_entity_test) clearly demonstrates the expected behavior
Proper async handling with Awaitility (10s timeout, 1s polling)
Complete resource cleanup to prevent test pollution

3. Documentation Quality

Inline comments clearly explain WHY certain patterns match/don't match
Updated existing test assertions with explanatory comments about behavior changes
Well-documented test expectations

🔍 What This Fixes

Problem: Wildcard query *api_e* incorrectly matched api_service_entity_test because underscores were stripped during tokenization

Solution: Preserve underscores in edge n-gram tokens so *api_e* only matches strings containing the literal substring "api_e" (like api_endpoint_column_test)

💡 Minor Suggestions

Consider broader application: This fix only updates the test_case index. Consider auditing other index mappings (188+ files) to see if they'd benefit from the same underscore preservation
OpenSearch compatibility: Verify this configuration works with OpenSearch as well as Elasticsearch
Migration notes: Document in release notes that this is a breaking change in search behavior for existing deployments

✅ Verdict: Approved

This is production-ready code with proper testing and documentation. The fix is well-targeted and the implementation follows OpenMetadata conventions.

fix: tokenize _ in ngram

7b074d6

TeddyCr requested a review from a team as a code owner December 20, 2025 09:07

TeddyCr temporarily deployed to test December 20, 2025 09:07 — with GitHub Actions Inactive

github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Dec 20, 2025

harshach approved these changes Dec 20, 2025

View reviewed changes

TeddyCr merged commit ce7c6ed into open-metadata:main Dec 22, 2025
25 of 30 checks passed

ShaileshParmar11 pushed a commit that referenced this pull request Dec 26, 2025

fix: tokenize _ in ngram (#24935)

6668b34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tokenize _ in ngram#24935

fix: tokenize _ in ngram#24935
TeddyCr merged 1 commit intoopen-metadata:mainfrom
TeddyCr:ISSUE-2501-CLT

TeddyCr commented Dec 20, 2025 •

edited by gitar-bot Bot

Loading

Uh oh!

sonarqubecloud Bot commented Dec 20, 2025

Uh oh!

TeddyCr commented Dec 20, 2025

Uh oh!

gitar-bot Bot commented Dec 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TeddyCr commented Dec 20, 2025 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

sonarqubecloud Bot commented Dec 20, 2025

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

TeddyCr commented Dec 20, 2025

Uh oh!

gitar-bot Bot commented Dec 20, 2025

Code Review Summary

✅ Strengths

🔍 What This Fixes

💡 Minor Suggestions

✅ Verdict: Approved

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TeddyCr commented Dec 20, 2025 •

edited by gitar-bot Bot

Loading