Add fuzzy matching suggestions for validation errors#5
Conversation
This feature improves user experience when validation fails by using fuzzy matching to suggest corrections. Validation remains strict (deterministic substring matching), but when it fails, RapidFuzz is used to find close matches and provide actionable suggestions. ## Problem Solved When supporting text doesn't match exactly (e.g., typos, capitalization differences), the validator now helps users find the correct text instead of just saying "not found". ## Implementation ### New dependency - Added `rapidfuzz >= 3.14.3` for efficient fuzzy string matching ### New module: fuzzy_text_utils.py - `normalize_whitespace()` - text normalization for matching - `get_significant_words()` - extract content words (excludes stopwords) - `calculate_word_overlap()` - prevent false positives by checking that matches share actual content words, not just common sequences - `split_into_sentences()` - segment text for efficient matching - `find_fuzzy_match_in_text()` - multi-stage fuzzy matching with: - Exact substring match (fastest) - Short query (<5 words) exact match requirement - Sentence-by-sentence RapidFuzz partial_ratio matching - Word overlap validation (min 50%) to prevent false positives ### Updated SupportingTextMatch model - Added `suggested_fix` field - actionable suggestion text - Added `best_match` field - closest matching text found ### Updated SupportingTextValidator - Added `generate_suggested_fix()` method that: - Uses fuzzy matching ONLY for suggestions, not validation - Detects capitalization differences - Reports similarity percentage for close matches - Updated `_substring_match()` to call suggestion generation on failure ### Updated CLI - `validate text` command now shows suggestions when validation fails - `validate text-file` command shows suggestions for each failed line ### Bug fix: False positive prevention The initial implementation could report high similarity (94%) for completely unrelated texts due to common short word sequences matching. Fixed by adding word overlap validation that requires at least 50% of significant (non-stopword) words to be shared. ## Key Design Principle From ai-gene-review: "Validation stays strict, fuzzy matching only helps users fix issues" ## Example Output ``` ✗ INVALID: Supporting text not found in reference PMID:123 Suggestion: Very close match (92%) - try: "JAK1 protein is a tyrosine kinase" ``` ## Tests - 34 new tests in test_fuzzy_suggestions.py covering: - Word extraction and overlap calculation - Fuzzy matching accuracy - False positive prevention (specific bug case) - Validation strictness (fuzzy matching doesn't change pass/fail) - Suggestion generation - Integration tests ## Future Enhancements Documented in docs/todo.md - additional issue detection features from ai-gene-review (ellipsis detection, short text warnings, bracket ratio) that could be added later. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
4af2202 to
ec8ef74
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds fuzzy text matching to provide helpful suggestions when validation fails, while maintaining strict substring-based validation. The implementation uses RapidFuzz for efficient similarity scoring and includes word overlap validation to prevent false positive matches on common word sequences.
Key changes:
- New
rapidfuzzdependency for efficient fuzzy text matching - Fuzzy matching utilities with word overlap validation to prevent false positives
- Enhanced validation results with suggested fixes displayed in CLI output
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_fuzzy_suggestions.py | Comprehensive test suite covering fuzzy matching utilities, suggestion generation, and validation strictness |
| src/linkml_reference_validator/validation/supporting_text_validator.py | Added generate_suggested_fix() method and integrated fuzzy suggestions into validation flow |
| src/linkml_reference_validator/validation/fuzzy_text_utils.py | New module implementing fuzzy matching with sentence splitting, word overlap checks, and similarity calculation |
| src/linkml_reference_validator/models.py | Extended SupportingTextMatch with suggested_fix and best_match fields |
| src/linkml_reference_validator/cli/validate.py | Updated CLI to display suggestions when validation fails |
| pyproject.toml | Added rapidfuzz dependency |
| notebooks/02_advanced_usage.ipynb | Added missing outputs field to notebook cell |
| docs/todo.md | New documentation of potential future enhancements from ai-gene-review |
Comments suppressed due to low confidence (1)
src/linkml_reference_validator/validation/fuzzy_text_utils.py:1
- The magic number 70 appears both as a default parameter (
threshold=70.0) and in this condition. This creates potential inconsistency if the threshold parameter is changed. Use thethresholdparameter value instead of hardcoding 70 here.
"""Efficient fuzzy text matching utilities for large documents.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return SupportingTextMatch( | ||
| found=False, | ||
| similarity_score=0.0, | ||
| similarity_score=similarity / 100.0, # Convert to 0-1 scale |
There was a problem hiding this comment.
The similarity score uses different scales in different places (0-100 in generate_suggested_fix() return value, 0-1 in SupportingTextMatch). Consider standardizing on a single scale throughout the codebase to avoid confusion and conversion errors.
| def test_suggests_capitalization_fix(self, validator): | ||
| """Should detect capitalization differences.""" | ||
| fix, match, score = validator.generate_suggested_fix( | ||
| "jak1 protein is a tyrosine kinase", | ||
| "The JAK1 protein is a tyrosine kinase that activates STAT.", | ||
| ) | ||
| assert score >= 90 | ||
| # May or may not detect as exact capitalization difference due to surrounding text |
There was a problem hiding this comment.
The comment indicates uncertainty about what the test validates. The assertion only checks score >= 90 but doesn't verify that the capitalization fix was actually detected. Consider adding an assertion that fix contains "Capitalization differs" when this is indeed a pure case difference.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…dator.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
Changes
rapidfuzz >= 3.14.3fuzzy_text_utils.pywith efficient fuzzy matching utilitiessuggested_fixandbest_matchfields toSupportingTextMatchgenerate_suggested_fix()methoddocs/todo.mdExample Output
Bug Fix Included
Fixed false positive issue where unrelated texts could report high similarity (94%) due to common short word sequences. Added word overlap validation requiring at least 50% of significant (non-stopword) words to be shared.
Test plan
🤖 Generated with Claude Code