Feat/nupunkt sentence tokenization by houfu · Pull Request #73 · houfu/redlines

houfu · 2025-10-14T15:07:23Z

Closes #48 (including comparison table for nupunkt processor)

- Add optional processor parameter to Redlines class for custom diff strategies - Implement NupunktProcessor using nupunkt library for advanced sentence tokenization - Add concatenate_sentences_and_add_chr_182() helper function - Update Python requirement to 3.11+ (required by nupunkt) - Add nupunkt>=0.6.0 dependency - Maintain backward compatibility with WholeDocumentProcessor as default - Handle tuple return values from sent_tokenize for type safety NupunktProcessor provides better handling of: - Abbreviations (Dr., Mr., etc.) - Decimals and numbers (3.14, $5.99) - URLs and email addresses - Complex punctuation Example usage: from redlines import Redlines from redlines.processor import NupunktProcessor processor = NupunktProcessor() r = Redlines(source, test, processor=processor) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Move nupunkt>=0.6.0 from required to optional dependencies - Users can install with: pip install redlines[nupunkt] - NupunktProcessor will raise ImportError if nupunkt not installed - Allows users to use WholeDocumentProcessor without nupunkt 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add 14 test cases covering edge cases and functionality - Test abbreviations, decimals, URLs, legal citations - Test initials, acronyms, ellipsis, complex punctuation - Test comparison with WholeDocumentProcessor - Add test for ImportError when nupunkt not installed - Tests skip gracefully when nupunkt not available 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add installation instructions for redlines[nupunkt] - Add section on custom processors (WholeDocumentProcessor vs NupunktProcessor) - Document when to use each processor - Link to demo comparison for benchmarks - Update usage examples with NupunktProcessor 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add demo/processor_comparison.py with 13 edge case tests - Add demo/README.md with detailed findings and recommendations - Test speed across 4 text sizes (400 chars to 400K chars) - Show accuracy comparison (both processors detect same changes) - Generate detailed performance and accuracy tables - Document that differences are in formatting, not detection Key findings: - Both processors have identical accuracy - NupunktProcessor is 5.3x slower on average - WholeDocument: ~10M chars/s, Nupunkt: ~1.6M chars/s - Output differs in sentence boundary formatting only 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Reflect nupunkt as optional dependency in lock file - Move nupunkt from dependencies to optional-dependencies 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add proper type guards for optional fields in tests - Add TypedDicts (EdgeCaseResult, PerformanceResult) for demo - Fix type annotations in demo comparison functions - Add type narrowing with isinstance checks - All mypy checks now pass with strict mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Update requires-python to >=3.10 (core library) - Add python_version>='3.11' constraint to nupunkt optional dependency - Update CI to conditionally install nupunkt based on Python version - Python 3.10: runs without nupunkt (WholeDocumentProcessor only) - Python 3.11+: runs with nupunkt (both processors) - All tests pass on all Python versions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add fallback definition for sent_tokenize when nupunkt not available - Add mypy override to ignore missing nupunkt imports - Ensures mypy passes on Python 3.10 where nupunkt can't be installed - All tests still pass (43 passed, 1 skipped) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Import NUPUNKT_AVAILABLE directly from processor module - Remove redundant try/except in test file - Tests will now properly skip on Python 3.10 where nupunkt unavailable - Fixes test failures in CI for Python 3.10 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add LEVENSHTEIN_AVAILABLE flag to processor.py - Make Levenshtein-dependent test assertions conditional - Tests now pass regardless of Levenshtein installation

houfu and others added 13 commits October 12, 2025 23:44

chore: Update uv.lock for optional nupunkt dependency

8678d70

- Reflect nupunkt as optional dependency in lock file - Move nupunkt from dependencies to optional-dependencies 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Merge branch 'main' into feat/nupunkt-sentence-tokenization

054d32f

fix: Add mypy override for Levenshtein optional dependency

c28126c

fix: Handle optional Levenshtein dependency in tests

40af105

- Add LEVENSHTEIN_AVAILABLE flag to processor.py - Make Levenshtein-dependent test assertions conditional - Tests now pass regardless of Levenshtein installation

houfu merged commit b491906 into main Oct 14, 2025
10 checks passed

houfu deleted the feat/nupunkt-sentence-tokenization branch October 14, 2025 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/nupunkt sentence tokenization#73

Feat/nupunkt sentence tokenization#73
houfu merged 13 commits into
mainfrom
feat/nupunkt-sentence-tokenization

houfu commented Oct 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

houfu commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

houfu commented Oct 14, 2025 •

edited

Loading