add advanced statistics and analytics for text comparison by Kaos599 · Pull Request #72 · houfu/redlines

Kaos599 · 2025-10-12T13:30:13Z

Overview

This pull request implements comprehensive statistics and analytics for text comparison in the Redlines library, addressing issue #15. The enhancement provides quantitative measures of text differences for data analysis, quality assurance, and research applications.

New Features

Advanced Statistics Metrics

longest_change_length: Character length of the longest individual change
shortest_change_length: Character length of the shortest individual change
average_change_length: Average length across all changes
change_ratio: Percentage of text modified (0.0 to 1.0 scale)
chars_added: Total number of characters added across all operations
chars_deleted: Total number of characters deleted across all operations
chars_net_change: Net change in characters (added - deleted)
levenshtein_distance: Edit distance between source and test texts

Technical Implementation

Extended the existing Stats dataclass while maintaining full backward compatibility
Enhanced the stats() method with advanced calculation logic
Updated JSON output to include all new statistics fields
Added optional Levenshtein distance integration with graceful fallback
Fixed constructor to properly handle empty strings as valid test input

Code Quality

MyPy type checking passes
Black code formatting compliant
No breaking changes to existing APIs
Performance maintained (sub-2ms processing for large texts)

Usage Examples

from redlines import Redlines

# Basic analytics
r = Redlines("Hello world", "Hello there world")
stats = r.stats()

print(f"Total changes: {stats.total_changes}")
print(f"Change ratio: {stats.change_ratio:.1%}")
print(f"Longest change: {stats.longest_change_length} chars")
print(f"Levenshtein distance: {stats.levenshtein_distance}")

# JSON export with all metrics
data = r.output_json()  # Includes all new statistics

Files Modified

redlines/processor.py: Extended Stats dataclass with new fields
redlines/redlines.py: Enhanced stats() method and JSON output
tests/test_changes_api.py: Added comprehensive test suite
README.md: Added advanced analytics documentation

Backward Compatibility

Zero breaking changes to existing APIs
Extended Stats object maintains all original fields
JSON output includes new fields alongside existing ones
All existing code continues to work unchanged

Use Cases Enabled

Content Analysis: Measure document similarity quantitatively
Quality Assurance: Automated diff analysis in CI/CD pipelines
Research Applications: Statistical analysis of text modifications
Data Processing: ETL pipeline validation and monitoring

Verification Steps

# Run all tests
python -m pytest tests/ -v

# Check advanced stats specifically
python -m pytest tests/test_changes_api.py -k "advanced_stats" -v

# Verify linting
python -m mypy redlines/ --ignore-missing-imports
python -m black --check redlines/

Test Screenshot

Closes #15: show change statistics/analytics when comparing text

- Extend Stats dataclass with comprehensive analytics fields including longest/shortest/average change lengths, change ratio, character-level statistics, and Levenshtein distance - Enhance stats() method with advanced calculation logic for change metrics and optional Levenshtein distance integration - Update JSON output to include all new statistics fields - Fix constructor to properly handle empty strings as valid test input - Add comprehensive test suite with 10 new test functions covering various scenarios - Maintain full backward compatibility with existing APIs

houfu · 2025-10-12T15:38:58Z

+            change_lengths.append(len(text))
+
+        longest_change_length = max(change_lengths) if change_lengths else 0
+        shortest_change_length = min(change_lengths) if change_lengths else None


Suggested change

shortest_change_length = min(change_lengths) if change_lengths else None

shortest_change_length = min(change_lengths) if change_lengths else 0

It should follow the same logic as shortest.

houfu · 2025-10-12T15:44:27Z

+        if self._test is not None and self.source == self._test:
+            levenshtein_distance: int | None = 0
+            try:
+                import Levenshtein


Levenshtein should be added as an optional dependency in pyproject.toml

houfu · 2025-10-12T16:25:15Z

+        changed_chars = sum(
+            len(change.source_text or change.test_text or "") for change in changes
+        )
+        change_ratio = changed_chars / total_chars if total_chars > 0 else 0.0


Problem: This calculates the ratio based on the source text length only, which doesn't accurately represent the proportion of text that changed. For
insertions, this can produce ratios > 1.0 (violating the documented constraint "0.0 to 1.0 scale").

Example:

Source: "Hello world" (11 chars)

Test: "Hello beautiful world" (21 chars)

Insert: "beautiful " (10 chars)

Current ratio: 10/11 = 0.909 (90.9%)

This suggests 90.9% of the text changed, when actually only a small portion was modified. The test in test_changes_api.py:317-318 even expects this incorrect behavior.

Consider using maximum of source or test?

I see that now thanks!

I think if i use something like "max(source_len, test_len)" as the denominator that should solve this

Kaos599 mentioned this pull request Oct 12, 2025

New feat: show change statistics/anaytics when comparing text #15

Closed

4 tasks

houfu reviewed Oct 12, 2025

View reviewed changes

Resolved Comments

1caf2b4

houfu merged commit c84ea82 into houfu:main Oct 13, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add advanced statistics and analytics for text comparison#72

add advanced statistics and analytics for text comparison#72
houfu merged 2 commits into
houfu:mainfrom
Kaos599:main

Kaos599 commented Oct 12, 2025

Uh oh!

houfu Oct 12, 2025

Uh oh!

houfu Oct 12, 2025

Uh oh!

houfu Oct 12, 2025

Uh oh!

Kaos599 Oct 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	shortest_change_length = min(change_lengths) if change_lengths else None
	shortest_change_length = min(change_lengths) if change_lengths else 0

Conversation

Kaos599 commented Oct 12, 2025

Overview

New Features

Advanced Statistics Metrics

Technical Implementation

Code Quality

Usage Examples

Files Modified

Backward Compatibility

Use Cases Enabled

Verification Steps

Test Screenshot

Uh oh!

houfu Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

houfu Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

houfu Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

Kaos599 Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants