v0.31.0
v0.31.0 Release Notes
New Features
Character Replacements for Text Normalization (#130)
Added support for custom character replacements to normalize special Unicode characters before text comparison. This solves issues where PDFs contain visually identical but technically different characters (e.g., non-breaking spaces \u00A0 vs regular spaces).
PdfTest Library:
- New
character_replacementsinitialization parameter applies to all keywords - Keyword-level
character_replacementsparameter forCompare Pdf Documents,Compare Pdf Structure,PDF Should Contain Strings, andPDF Should Not Contain Strings
Library DocTest.PdfTest character_replacements={'\u00A0': ' '}
*** Test Cases ***
Compare with normalized whitespace
Compare Pdf Documents ref.pdf cand.pdf character_replacements={'\u00A0': ' '}VisualTest Library:
- New
character_replacementsinitialization parameter - New
Set Character Replacementskeyword for runtime configuration - Applied to
Get Text,Get Text From Document, andGet Text From Areakeywords
*** Test Cases ***
Get text with normalized characters
Set Character Replacements {'\u00A0': ' '}
${text}= Get Text document.pdf
Set Character Replacements ${NONE}Ignore Page Boundaries in PDF Structure Comparison (#129)
Added options to compare PDF text content while ignoring page structure differences. This is useful when font or size changes cause text to reflow across pages differently.
New Parameters:
| Parameter | Default | Description |
|---|---|---|
ignore_page_boundaries |
${False} |
Flatten text across all pages and compare only content and order |
check_geometry |
${True} |
When ${False}, skip line position/size comparison |
check_block_count |
${True} |
When ${False}, skip block count validation per page |
*** Test Cases ***
Compare PDFs ignoring page breaks
Compare Pdf Structure reference.pdf candidate.pdf
... ignore_page_boundaries=${True}
Compare content only (ignore positions)
Compare Pdf Structure reference.pdf candidate.pdf
... check_geometry=${False} check_block_count=${False}Improvements
- Improved LLM prompt quality for more consistent AI-assisted comparisons
- Reduced test flakiness in LLM-related tests
- Added comprehensive unit tests for text normalization and PDF structure comparison
- Added acceptance tests for character replacement functionality
Internal Changes
- Extended
StructureExtractionConfigdataclass withcharacter_replacementsfield - Added
apply_character_replacements()function toTextNormalization.py - Added
compare_document_text_only()function toPdfStructureComparator.py - Extended
StructureComparisonResultwith difference counting