Skip to content

Add fuzzy matching suggestions for validation errors#5

Merged
cmungall merged 3 commits intomainfrom
feature/fuzzy-matching-suggestions
Dec 1, 2025
Merged

Add fuzzy matching suggestions for validation errors#5
cmungall merged 3 commits intomainfrom
feature/fuzzy-matching-suggestions

Conversation

@cmungall
Copy link
Copy Markdown
Member

@cmungall cmungall commented Dec 1, 2025

Summary

  • Add RapidFuzz-based fuzzy matching to suggest corrections when validation fails
  • Validation remains strict (deterministic substring matching), suggestions are purely informational
  • Includes word overlap validation to prevent false positive matches on common word sequences

Changes

  • New dependency: rapidfuzz >= 3.14.3
  • New module: fuzzy_text_utils.py with efficient fuzzy matching utilities
  • Updated models: Added suggested_fix and best_match fields to SupportingTextMatch
  • Updated validator: Added generate_suggested_fix() method
  • Updated CLI: Shows suggestions when validation fails
  • Documentation: Future enhancements documented in docs/todo.md

Example Output

✗ INVALID: Supporting text not found in reference PMID:123
  Suggestion: Very close match (92%) - try: "JAK1 protein is a tyrosine kinase"

Bug Fix Included

Fixed false positive issue where unrelated texts could report high similarity (94%) due to common short word sequences. Added word overlap validation requiring at least 50% of significant (non-stopword) words to be shared.

Test plan

  • 34 new tests for fuzzy matching functionality
  • All 160 existing tests pass
  • Doctests pass
  • mypy type checking passes

🤖 Generated with Claude Code

This feature improves user experience when validation fails by using
fuzzy matching to suggest corrections. Validation remains strict
(deterministic substring matching), but when it fails, RapidFuzz is
used to find close matches and provide actionable suggestions.

## Problem Solved

When supporting text doesn't match exactly (e.g., typos, capitalization
differences), the validator now helps users find the correct text
instead of just saying "not found".

## Implementation

### New dependency
- Added `rapidfuzz >= 3.14.3` for efficient fuzzy string matching

### New module: fuzzy_text_utils.py
- `normalize_whitespace()` - text normalization for matching
- `get_significant_words()` - extract content words (excludes stopwords)
- `calculate_word_overlap()` - prevent false positives by checking
  that matches share actual content words, not just common sequences
- `split_into_sentences()` - segment text for efficient matching
- `find_fuzzy_match_in_text()` - multi-stage fuzzy matching with:
  - Exact substring match (fastest)
  - Short query (<5 words) exact match requirement
  - Sentence-by-sentence RapidFuzz partial_ratio matching
  - Word overlap validation (min 50%) to prevent false positives

### Updated SupportingTextMatch model
- Added `suggested_fix` field - actionable suggestion text
- Added `best_match` field - closest matching text found

### Updated SupportingTextValidator
- Added `generate_suggested_fix()` method that:
  - Uses fuzzy matching ONLY for suggestions, not validation
  - Detects capitalization differences
  - Reports similarity percentage for close matches
- Updated `_substring_match()` to call suggestion generation on failure

### Updated CLI
- `validate text` command now shows suggestions when validation fails
- `validate text-file` command shows suggestions for each failed line

### Bug fix: False positive prevention
The initial implementation could report high similarity (94%) for
completely unrelated texts due to common short word sequences matching.
Fixed by adding word overlap validation that requires at least 50% of
significant (non-stopword) words to be shared.

## Key Design Principle

From ai-gene-review: "Validation stays strict, fuzzy matching only
helps users fix issues"

## Example Output

```
✗ INVALID: Supporting text not found in reference PMID:123
  Suggestion: Very close match (92%) - try: "JAK1 protein is a tyrosine kinase"
```

## Tests

- 34 new tests in test_fuzzy_suggestions.py covering:
  - Word extraction and overlap calculation
  - Fuzzy matching accuracy
  - False positive prevention (specific bug case)
  - Validation strictness (fuzzy matching doesn't change pass/fail)
  - Suggestion generation
  - Integration tests

## Future Enhancements

Documented in docs/todo.md - additional issue detection features from
ai-gene-review (ellipsis detection, short text warnings, bracket ratio)
that could be added later.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds fuzzy text matching to provide helpful suggestions when validation fails, while maintaining strict substring-based validation. The implementation uses RapidFuzz for efficient similarity scoring and includes word overlap validation to prevent false positive matches on common word sequences.

Key changes:

  • New rapidfuzz dependency for efficient fuzzy text matching
  • Fuzzy matching utilities with word overlap validation to prevent false positives
  • Enhanced validation results with suggested fixes displayed in CLI output

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_fuzzy_suggestions.py Comprehensive test suite covering fuzzy matching utilities, suggestion generation, and validation strictness
src/linkml_reference_validator/validation/supporting_text_validator.py Added generate_suggested_fix() method and integrated fuzzy suggestions into validation flow
src/linkml_reference_validator/validation/fuzzy_text_utils.py New module implementing fuzzy matching with sentence splitting, word overlap checks, and similarity calculation
src/linkml_reference_validator/models.py Extended SupportingTextMatch with suggested_fix and best_match fields
src/linkml_reference_validator/cli/validate.py Updated CLI to display suggestions when validation fails
pyproject.toml Added rapidfuzz dependency
notebooks/02_advanced_usage.ipynb Added missing outputs field to notebook cell
docs/todo.md New documentation of potential future enhancements from ai-gene-review
Comments suppressed due to low confidence (1)

src/linkml_reference_validator/validation/fuzzy_text_utils.py:1

  • The magic number 70 appears both as a default parameter (threshold=70.0) and in this condition. This creates potential inconsistency if the threshold parameter is changed. Use the threshold parameter value instead of hardcoding 70 here.
"""Efficient fuzzy text matching utilities for large documents.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/linkml_reference_validator/validation/fuzzy_text_utils.py
Comment thread tests/test_fuzzy_suggestions.py Outdated
return SupportingTextMatch(
found=False,
similarity_score=0.0,
similarity_score=similarity / 100.0, # Convert to 0-1 scale
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The similarity score uses different scales in different places (0-100 in generate_suggested_fix() return value, 0-1 in SupportingTextMatch). Consider standardizing on a single scale throughout the codebase to avoid confusion and conversion errors.

Copilot uses AI. Check for mistakes.
Comment on lines +243 to +250
def test_suggests_capitalization_fix(self, validator):
"""Should detect capitalization differences."""
fix, match, score = validator.generate_suggested_fix(
"jak1 protein is a tyrosine kinase",
"The JAK1 protein is a tyrosine kinase that activates STAT.",
)
assert score >= 90
# May or may not detect as exact capitalization difference due to surrounding text
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment indicates uncertainty about what the test validates. The assertion only checks score >= 90 but doesn't verify that the capitalization fix was actually detected. Consider adding an assertion that fix contains "Capitalization differs" when this is indeed a pure case difference.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment thread src/linkml_reference_validator/validation/supporting_text_validator.py Outdated
cmungall and others added 2 commits December 1, 2025 09:40
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…dator.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Dec 1, 2025

@cmungall I've opened a new pull request, #6, to work on those changes. Once the pull request is ready, I'll request review from you.

@cmungall cmungall merged commit 11af923 into main Dec 1, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants