Skip to content

v0.31.0

Choose a tag to compare

@manykarim manykarim released this 07 Jan 19:45
· 10 commits to main since this release

v0.31.0 Release Notes

New Features

Character Replacements for Text Normalization (#130)

Added support for custom character replacements to normalize special Unicode characters before text comparison. This solves issues where PDFs contain visually identical but technically different characters (e.g., non-breaking spaces \u00A0 vs regular spaces).

PdfTest Library:

  • New character_replacements initialization parameter applies to all keywords
  • Keyword-level character_replacements parameter for Compare Pdf Documents, Compare Pdf Structure, PDF Should Contain Strings, and PDF Should Not Contain Strings
Library    DocTest.PdfTest    character_replacements={'\u00A0': ' '}

*** Test Cases ***
Compare with normalized whitespace
    Compare Pdf Documents    ref.pdf    cand.pdf    character_replacements={'\u00A0': ' '}

VisualTest Library:

  • New character_replacements initialization parameter
  • New Set Character Replacements keyword for runtime configuration
  • Applied to Get Text, Get Text From Document, and Get Text From Area keywords
*** Test Cases ***
Get text with normalized characters
    Set Character Replacements    {'\u00A0': ' '}
    ${text}=    Get Text    document.pdf
    Set Character Replacements    ${NONE}

Ignore Page Boundaries in PDF Structure Comparison (#129)

Added options to compare PDF text content while ignoring page structure differences. This is useful when font or size changes cause text to reflow across pages differently.

New Parameters:

Parameter Default Description
ignore_page_boundaries ${False} Flatten text across all pages and compare only content and order
check_geometry ${True} When ${False}, skip line position/size comparison
check_block_count ${True} When ${False}, skip block count validation per page
*** Test Cases ***
Compare PDFs ignoring page breaks
    Compare Pdf Structure    reference.pdf    candidate.pdf
    ...    ignore_page_boundaries=${True}

Compare content only (ignore positions)
    Compare Pdf Structure    reference.pdf    candidate.pdf
    ...    check_geometry=${False}    check_block_count=${False}

Improvements

  • Improved LLM prompt quality for more consistent AI-assisted comparisons
  • Reduced test flakiness in LLM-related tests
  • Added comprehensive unit tests for text normalization and PDF structure comparison
  • Added acceptance tests for character replacement functionality

Internal Changes

  • Extended StructureExtractionConfig dataclass with character_replacements field
  • Added apply_character_replacements() function to TextNormalization.py
  • Added compare_document_text_only() function to PdfStructureComparator.py
  • Extended StructureComparisonResult with difference counting