Feat/nupunkt sentence tokenization#73
Merged
Merged
Conversation
- Add optional processor parameter to Redlines class for custom diff strategies
- Implement NupunktProcessor using nupunkt library for advanced sentence tokenization
- Add concatenate_sentences_and_add_chr_182() helper function
- Update Python requirement to 3.11+ (required by nupunkt)
- Add nupunkt>=0.6.0 dependency
- Maintain backward compatibility with WholeDocumentProcessor as default
- Handle tuple return values from sent_tokenize for type safety
NupunktProcessor provides better handling of:
- Abbreviations (Dr., Mr., etc.)
- Decimals and numbers (3.14, $5.99)
- URLs and email addresses
- Complex punctuation
Example usage:
from redlines import Redlines
from redlines.processor import NupunktProcessor
processor = NupunktProcessor()
r = Redlines(source, test, processor=processor)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Move nupunkt>=0.6.0 from required to optional dependencies - Users can install with: pip install redlines[nupunkt] - NupunktProcessor will raise ImportError if nupunkt not installed - Allows users to use WholeDocumentProcessor without nupunkt 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add 14 test cases covering edge cases and functionality - Test abbreviations, decimals, URLs, legal citations - Test initials, acronyms, ellipsis, complex punctuation - Test comparison with WholeDocumentProcessor - Add test for ImportError when nupunkt not installed - Tests skip gracefully when nupunkt not available 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add installation instructions for redlines[nupunkt] - Add section on custom processors (WholeDocumentProcessor vs NupunktProcessor) - Document when to use each processor - Link to demo comparison for benchmarks - Update usage examples with NupunktProcessor 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add demo/processor_comparison.py with 13 edge case tests - Add demo/README.md with detailed findings and recommendations - Test speed across 4 text sizes (400 chars to 400K chars) - Show accuracy comparison (both processors detect same changes) - Generate detailed performance and accuracy tables - Document that differences are in formatting, not detection Key findings: - Both processors have identical accuracy - NupunktProcessor is 5.3x slower on average - WholeDocument: ~10M chars/s, Nupunkt: ~1.6M chars/s - Output differs in sentence boundary formatting only 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Reflect nupunkt as optional dependency in lock file - Move nupunkt from dependencies to optional-dependencies 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add proper type guards for optional fields in tests - Add TypedDicts (EdgeCaseResult, PerformanceResult) for demo - Fix type annotations in demo comparison functions - Add type narrowing with isinstance checks - All mypy checks now pass with strict mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Update requires-python to >=3.10 (core library) - Add python_version>='3.11' constraint to nupunkt optional dependency - Update CI to conditionally install nupunkt based on Python version - Python 3.10: runs without nupunkt (WholeDocumentProcessor only) - Python 3.11+: runs with nupunkt (both processors) - All tests pass on all Python versions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add fallback definition for sent_tokenize when nupunkt not available - Add mypy override to ignore missing nupunkt imports - Ensures mypy passes on Python 3.10 where nupunkt can't be installed - All tests still pass (43 passed, 1 skipped) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Import NUPUNKT_AVAILABLE directly from processor module - Remove redundant try/except in test file - Tests will now properly skip on Python 3.10 where nupunkt unavailable - Fixes test failures in CI for Python 3.10 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add LEVENSHTEIN_AVAILABLE flag to processor.py - Make Levenshtein-dependent test assertions conditional - Tests now pass regardless of Levenshtein installation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #48 (including comparison table for nupunkt processor)