Skip to content

add advanced statistics and analytics for text comparison#72

Merged
houfu merged 2 commits into
houfu:mainfrom
Kaos599:main
Oct 13, 2025
Merged

add advanced statistics and analytics for text comparison#72
houfu merged 2 commits into
houfu:mainfrom
Kaos599:main

Conversation

@Kaos599
Copy link
Copy Markdown
Contributor

@Kaos599 Kaos599 commented Oct 12, 2025

Overview

This pull request implements comprehensive statistics and analytics for text comparison in the Redlines library, addressing issue #15. The enhancement provides quantitative measures of text differences for data analysis, quality assurance, and research applications.

New Features

Advanced Statistics Metrics

  • longest_change_length: Character length of the longest individual change
  • shortest_change_length: Character length of the shortest individual change
  • average_change_length: Average length across all changes
  • change_ratio: Percentage of text modified (0.0 to 1.0 scale)
  • chars_added: Total number of characters added across all operations
  • chars_deleted: Total number of characters deleted across all operations
  • chars_net_change: Net change in characters (added - deleted)
  • levenshtein_distance: Edit distance between source and test texts

Technical Implementation

  • Extended the existing Stats dataclass while maintaining full backward compatibility
  • Enhanced the stats() method with advanced calculation logic
  • Updated JSON output to include all new statistics fields
  • Added optional Levenshtein distance integration with graceful fallback
  • Fixed constructor to properly handle empty strings as valid test input

Code Quality

  • MyPy type checking passes
  • Black code formatting compliant
  • No breaking changes to existing APIs
  • Performance maintained (sub-2ms processing for large texts)

Usage Examples

from redlines import Redlines

# Basic analytics
r = Redlines("Hello world", "Hello there world")
stats = r.stats()

print(f"Total changes: {stats.total_changes}")
print(f"Change ratio: {stats.change_ratio:.1%}")
print(f"Longest change: {stats.longest_change_length} chars")
print(f"Levenshtein distance: {stats.levenshtein_distance}")

# JSON export with all metrics
data = r.output_json()  # Includes all new statistics

Files Modified

  • redlines/processor.py: Extended Stats dataclass with new fields
  • redlines/redlines.py: Enhanced stats() method and JSON output
  • tests/test_changes_api.py: Added comprehensive test suite
  • README.md: Added advanced analytics documentation

Backward Compatibility

  • Zero breaking changes to existing APIs
  • Extended Stats object maintains all original fields
  • JSON output includes new fields alongside existing ones
  • All existing code continues to work unchanged

Use Cases Enabled

  1. Content Analysis: Measure document similarity quantitatively
  2. Quality Assurance: Automated diff analysis in CI/CD pipelines
  3. Research Applications: Statistical analysis of text modifications
  4. Data Processing: ETL pipeline validation and monitoring

Verification Steps

# Run all tests
python -m pytest tests/ -v

# Check advanced stats specifically
python -m pytest tests/test_changes_api.py -k "advanced_stats" -v

# Verify linting
python -m mypy redlines/ --ignore-missing-imports
python -m black --check redlines/

Test Screenshot

Screenshot 2025-10-12 174954

Closes #15: show change statistics/analytics when comparing text

- Extend Stats dataclass with comprehensive analytics fields including longest/shortest/average change lengths, change ratio, character-level statistics, and Levenshtein distance
- Enhance stats() method with advanced calculation logic for change metrics and optional Levenshtein distance integration
- Update JSON output to include all new statistics fields
- Fix constructor to properly handle empty strings as valid test input
- Add comprehensive test suite with 10 new test functions covering various scenarios
- Maintain full backward compatibility with existing APIs
Comment thread redlines/redlines.py Outdated
change_lengths.append(len(text))

longest_change_length = max(change_lengths) if change_lengths else 0
shortest_change_length = min(change_lengths) if change_lengths else None
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shortest_change_length = min(change_lengths) if change_lengths else None
shortest_change_length = min(change_lengths) if change_lengths else 0

It should follow the same logic as shortest.

Comment thread redlines/redlines.py
if self._test is not None and self.source == self._test:
levenshtein_distance: int | None = 0
try:
import Levenshtein
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Levenshtein should be added as an optional dependency in pyproject.toml

Comment thread redlines/redlines.py
changed_chars = sum(
len(change.source_text or change.test_text or "") for change in changes
)
change_ratio = changed_chars / total_chars if total_chars > 0 else 0.0
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem: This calculates the ratio based on the source text length only, which doesn't accurately represent the proportion of text that changed. For
insertions, this can produce ratios > 1.0 (violating the documented constraint "0.0 to 1.0 scale").

Example:

  • Source: "Hello world" (11 chars)
  • Test: "Hello beautiful world" (21 chars)
  • Insert: "beautiful " (10 chars)
  • Current ratio: 10/11 = 0.909 (90.9%)

This suggests 90.9% of the text changed, when actually only a small portion was modified. The test in test_changes_api.py:317-318 even expects this incorrect behavior.

Consider using maximum of source or test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that now thanks!

I think if i use something like "max(source_len, test_len)" as the denominator that should solve this

@houfu houfu merged commit c84ea82 into houfu:main Oct 13, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New feat: show change statistics/anaytics when comparing text

2 participants