Add fuzzy matching suggestions for validation errors by cmungall · Pull Request #5 · linkml/linkml-reference-validator

cmungall · 2025-12-01T16:45:40Z

Summary

Add RapidFuzz-based fuzzy matching to suggest corrections when validation fails
Validation remains strict (deterministic substring matching), suggestions are purely informational
Includes word overlap validation to prevent false positive matches on common word sequences

Changes

New dependency: rapidfuzz >= 3.14.3
New module: fuzzy_text_utils.py with efficient fuzzy matching utilities
Updated models: Added suggested_fix and best_match fields to SupportingTextMatch
Updated validator: Added generate_suggested_fix() method
Updated CLI: Shows suggestions when validation fails
Documentation: Future enhancements documented in docs/todo.md

Example Output

✗ INVALID: Supporting text not found in reference PMID:123
  Suggestion: Very close match (92%) - try: "JAK1 protein is a tyrosine kinase"

Bug Fix Included

Fixed false positive issue where unrelated texts could report high similarity (94%) due to common short word sequences. Added word overlap validation requiring at least 50% of significant (non-stopword) words to be shared.

Test plan

34 new tests for fuzzy matching functionality
All 160 existing tests pass
Doctests pass
mypy type checking passes

🤖 Generated with Claude Code

This feature improves user experience when validation fails by using fuzzy matching to suggest corrections. Validation remains strict (deterministic substring matching), but when it fails, RapidFuzz is used to find close matches and provide actionable suggestions. ## Problem Solved When supporting text doesn't match exactly (e.g., typos, capitalization differences), the validator now helps users find the correct text instead of just saying "not found". ## Implementation ### New dependency - Added `rapidfuzz >= 3.14.3` for efficient fuzzy string matching ### New module: fuzzy_text_utils.py - `normalize_whitespace()` - text normalization for matching - `get_significant_words()` - extract content words (excludes stopwords) - `calculate_word_overlap()` - prevent false positives by checking that matches share actual content words, not just common sequences - `split_into_sentences()` - segment text for efficient matching - `find_fuzzy_match_in_text()` - multi-stage fuzzy matching with: - Exact substring match (fastest) - Short query (<5 words) exact match requirement - Sentence-by-sentence RapidFuzz partial_ratio matching - Word overlap validation (min 50%) to prevent false positives ### Updated SupportingTextMatch model - Added `suggested_fix` field - actionable suggestion text - Added `best_match` field - closest matching text found ### Updated SupportingTextValidator - Added `generate_suggested_fix()` method that: - Uses fuzzy matching ONLY for suggestions, not validation - Detects capitalization differences - Reports similarity percentage for close matches - Updated `_substring_match()` to call suggestion generation on failure ### Updated CLI - `validate text` command now shows suggestions when validation fails - `validate text-file` command shows suggestions for each failed line ### Bug fix: False positive prevention The initial implementation could report high similarity (94%) for completely unrelated texts due to common short word sequences matching. Fixed by adding word overlap validation that requires at least 50% of significant (non-stopword) words to be shared. ## Key Design Principle From ai-gene-review: "Validation stays strict, fuzzy matching only helps users fix issues" ## Example Output ``` ✗ INVALID: Supporting text not found in reference PMID:123 Suggestion: Very close match (92%) - try: "JAK1 protein is a tyrosine kinase" ``` ## Tests - 34 new tests in test_fuzzy_suggestions.py covering: - Word extraction and overlap calculation - Fuzzy matching accuracy - False positive prevention (specific bug case) - Validation strictness (fuzzy matching doesn't change pass/fail) - Suggestion generation - Integration tests ## Future Enhancements Documented in docs/todo.md - additional issue detection features from ai-gene-review (ellipsis detection, short text warnings, bracket ratio) that could be added later. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds fuzzy text matching to provide helpful suggestions when validation fails, while maintaining strict substring-based validation. The implementation uses RapidFuzz for efficient similarity scoring and includes word overlap validation to prevent false positive matches on common word sequences.

Key changes:

New rapidfuzz dependency for efficient fuzzy text matching
Fuzzy matching utilities with word overlap validation to prevent false positives
Enhanced validation results with suggested fixes displayed in CLI output

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_fuzzy_suggestions.py	Comprehensive test suite covering fuzzy matching utilities, suggestion generation, and validation strictness
src/linkml_reference_validator/validation/supporting_text_validator.py	Added `generate_suggested_fix()` method and integrated fuzzy suggestions into validation flow
src/linkml_reference_validator/validation/fuzzy_text_utils.py	New module implementing fuzzy matching with sentence splitting, word overlap checks, and similarity calculation
src/linkml_reference_validator/models.py	Extended `SupportingTextMatch` with `suggested_fix` and `best_match` fields
src/linkml_reference_validator/cli/validate.py	Updated CLI to display suggestions when validation fails
pyproject.toml	Added rapidfuzz dependency
notebooks/02_advanced_usage.ipynb	Added missing outputs field to notebook cell
docs/todo.md	New documentation of potential future enhancements from ai-gene-review

Comments suppressed due to low confidence (1)

src/linkml_reference_validator/validation/fuzzy_text_utils.py:1

The magic number 70 appears both as a default parameter (threshold=70.0) and in this condition. This creates potential inconsistency if the threshold parameter is changed. Use the threshold parameter value instead of hardcoding 70 here.

"""Efficient fuzzy text matching utilities for large documents.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-01T17:12:56Z

                return SupportingTextMatch(
                    found=False,
-                    similarity_score=0.0,
+                    similarity_score=similarity / 100.0,  # Convert to 0-1 scale


The similarity score uses different scales in different places (0-100 in generate_suggested_fix() return value, 0-1 in SupportingTextMatch). Consider standardizing on a single scale throughout the codebase to avoid confusion and conversion errors.

Copilot · 2025-12-01T17:12:56Z

+    def test_suggests_capitalization_fix(self, validator):
+        """Should detect capitalization differences."""
+        fix, match, score = validator.generate_suggested_fix(
+            "jak1 protein is a tyrosine kinase",
+            "The JAK1 protein is a tyrosine kinase that activates STAT.",
+        )
+        assert score >= 90
+        # May or may not detect as exact capitalization difference due to surrounding text


The comment indicates uncertainty about what the test validates. The assertion only checks score >= 90 but doesn't verify that the capitalization fix was actually detected. Consider adding an assertion that fix contains "Capitalization differs" when this is indeed a pure case difference.

@copilot open a new pull request to apply changes based on this feedback

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…dator.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot · 2025-12-01T17:41:39Z

@cmungall I've opened a new pull request, #6, to work on those changes. Once the pull request is ready, I'll request review from you.

cmungall force-pushed the feature/fuzzy-matching-suggestions branch from 4af2202 to ec8ef74 Compare December 1, 2025 17:04

cmungall requested a review from Copilot December 1, 2025 17:08

Copilot started reviewing on behalf of cmungall December 1, 2025 17:08 View session

Copilot finished reviewing on behalf of cmungall December 1, 2025 17:11

Copilot AI reviewed Dec 1, 2025

View reviewed changes

cmungall and others added 2 commits December 1, 2025 09:40

Update tests/test_fuzzy_suggestions.py

cee7018

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/linkml_reference_validator/validation/supporting_text_vali…

6196141

…dator.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI mentioned this pull request Dec 1, 2025

Verify capitalization detection in test_suggests_capitalization_fix #6

Merged

cmungall merged commit 11af923 into main Dec 1, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fuzzy matching suggestions for validation errors#5

Add fuzzy matching suggestions for validation errors#5
cmungall merged 3 commits intomainfrom
feature/fuzzy-matching-suggestions

cmungall commented Dec 1, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

cmungall Dec 1, 2025

Uh oh!

Uh oh!

Copilot AI commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cmungall commented Dec 1, 2025

Summary

Changes

Example Output

Bug Fix Included

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

cmungall Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants