add advanced statistics and analytics for text comparison#72
Conversation
- Extend Stats dataclass with comprehensive analytics fields including longest/shortest/average change lengths, change ratio, character-level statistics, and Levenshtein distance - Enhance stats() method with advanced calculation logic for change metrics and optional Levenshtein distance integration - Update JSON output to include all new statistics fields - Fix constructor to properly handle empty strings as valid test input - Add comprehensive test suite with 10 new test functions covering various scenarios - Maintain full backward compatibility with existing APIs
| change_lengths.append(len(text)) | ||
|
|
||
| longest_change_length = max(change_lengths) if change_lengths else 0 | ||
| shortest_change_length = min(change_lengths) if change_lengths else None |
There was a problem hiding this comment.
| shortest_change_length = min(change_lengths) if change_lengths else None | |
| shortest_change_length = min(change_lengths) if change_lengths else 0 |
It should follow the same logic as shortest.
| if self._test is not None and self.source == self._test: | ||
| levenshtein_distance: int | None = 0 | ||
| try: | ||
| import Levenshtein |
There was a problem hiding this comment.
Levenshtein should be added as an optional dependency in pyproject.toml
| changed_chars = sum( | ||
| len(change.source_text or change.test_text or "") for change in changes | ||
| ) | ||
| change_ratio = changed_chars / total_chars if total_chars > 0 else 0.0 |
There was a problem hiding this comment.
Problem: This calculates the ratio based on the source text length only, which doesn't accurately represent the proportion of text that changed. For
insertions, this can produce ratios > 1.0 (violating the documented constraint "0.0 to 1.0 scale").
Example:
- Source: "Hello world" (11 chars)
- Test: "Hello beautiful world" (21 chars)
- Insert: "beautiful " (10 chars)
- Current ratio: 10/11 = 0.909 (90.9%)
This suggests 90.9% of the text changed, when actually only a small portion was modified. The test in test_changes_api.py:317-318 even expects this incorrect behavior.
Consider using maximum of source or test?
There was a problem hiding this comment.
I see that now thanks!
I think if i use something like "max(source_len, test_len)" as the denominator that should solve this
Overview
This pull request implements comprehensive statistics and analytics for text comparison in the Redlines library, addressing issue #15. The enhancement provides quantitative measures of text differences for data analysis, quality assurance, and research applications.
New Features
Advanced Statistics Metrics
longest_change_length: Character length of the longest individual changeshortest_change_length: Character length of the shortest individual changeaverage_change_length: Average length across all changeschange_ratio: Percentage of text modified (0.0 to 1.0 scale)chars_added: Total number of characters added across all operationschars_deleted: Total number of characters deleted across all operationschars_net_change: Net change in characters (added - deleted)levenshtein_distance: Edit distance between source and test textsTechnical Implementation
Statsdataclass while maintaining full backward compatibilitystats()method with advanced calculation logicCode Quality
Usage Examples
Files Modified
redlines/processor.py: Extended Stats dataclass with new fieldsredlines/redlines.py: Enhanced stats() method and JSON outputtests/test_changes_api.py: Added comprehensive test suiteREADME.md: Added advanced analytics documentationBackward Compatibility
Use Cases Enabled
Verification Steps
Test Screenshot
Closes #15: show change statistics/analytics when comparing text