Skip to content

Feat/nupunkt sentence tokenization#73

Merged
houfu merged 13 commits into
mainfrom
feat/nupunkt-sentence-tokenization
Oct 14, 2025
Merged

Feat/nupunkt sentence tokenization#73
houfu merged 13 commits into
mainfrom
feat/nupunkt-sentence-tokenization

Conversation

@houfu
Copy link
Copy Markdown
Owner

@houfu houfu commented Oct 14, 2025

Closes #48 (including comparison table for nupunkt processor)

houfu and others added 13 commits October 12, 2025 23:44
- Add optional processor parameter to Redlines class for custom diff strategies
- Implement NupunktProcessor using nupunkt library for advanced sentence tokenization
- Add concatenate_sentences_and_add_chr_182() helper function
- Update Python requirement to 3.11+ (required by nupunkt)
- Add nupunkt>=0.6.0 dependency
- Maintain backward compatibility with WholeDocumentProcessor as default
- Handle tuple return values from sent_tokenize for type safety

NupunktProcessor provides better handling of:
- Abbreviations (Dr., Mr., etc.)
- Decimals and numbers (3.14, $5.99)
- URLs and email addresses
- Complex punctuation

Example usage:
    from redlines import Redlines
    from redlines.processor import NupunktProcessor

    processor = NupunktProcessor()
    r = Redlines(source, test, processor=processor)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move nupunkt>=0.6.0 from required to optional dependencies
- Users can install with: pip install redlines[nupunkt]
- NupunktProcessor will raise ImportError if nupunkt not installed
- Allows users to use WholeDocumentProcessor without nupunkt

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add 14 test cases covering edge cases and functionality
- Test abbreviations, decimals, URLs, legal citations
- Test initials, acronyms, ellipsis, complex punctuation
- Test comparison with WholeDocumentProcessor
- Add test for ImportError when nupunkt not installed
- Tests skip gracefully when nupunkt not available

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add installation instructions for redlines[nupunkt]
- Add section on custom processors (WholeDocumentProcessor vs NupunktProcessor)
- Document when to use each processor
- Link to demo comparison for benchmarks
- Update usage examples with NupunktProcessor

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add demo/processor_comparison.py with 13 edge case tests
- Add demo/README.md with detailed findings and recommendations
- Test speed across 4 text sizes (400 chars to 400K chars)
- Show accuracy comparison (both processors detect same changes)
- Generate detailed performance and accuracy tables
- Document that differences are in formatting, not detection

Key findings:
- Both processors have identical accuracy
- NupunktProcessor is 5.3x slower on average
- WholeDocument: ~10M chars/s, Nupunkt: ~1.6M chars/s
- Output differs in sentence boundary formatting only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Reflect nupunkt as optional dependency in lock file
- Move nupunkt from dependencies to optional-dependencies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add proper type guards for optional fields in tests
- Add TypedDicts (EdgeCaseResult, PerformanceResult) for demo
- Fix type annotations in demo comparison functions
- Add type narrowing with isinstance checks
- All mypy checks now pass with strict mode

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Update requires-python to >=3.10 (core library)
- Add python_version>='3.11' constraint to nupunkt optional dependency
- Update CI to conditionally install nupunkt based on Python version
- Python 3.10: runs without nupunkt (WholeDocumentProcessor only)
- Python 3.11+: runs with nupunkt (both processors)
- All tests pass on all Python versions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add fallback definition for sent_tokenize when nupunkt not available
- Add mypy override to ignore missing nupunkt imports
- Ensures mypy passes on Python 3.10 where nupunkt can't be installed
- All tests still pass (43 passed, 1 skipped)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Import NUPUNKT_AVAILABLE directly from processor module
- Remove redundant try/except in test file
- Tests will now properly skip on Python 3.10 where nupunkt unavailable
- Fixes test failures in CI for Python 3.10

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add LEVENSHTEIN_AVAILABLE flag to processor.py
- Make Levenshtein-dependent test assertions conditional
- Tests now pass regardless of Levenshtein installation
@houfu houfu merged commit b491906 into main Oct 14, 2025
10 checks passed
@houfu houfu deleted the feat/nupunkt-sentence-tokenization branch October 14, 2025 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feat: Using nupunkt as a sentence boundary detector

1 participant