fix: Escape html special characters in `hocr_document_template.xml.j2` #279

holtskinner · 2024-03-11T14:52:15Z

Special characters need to be escaped in order to utilize the output from the HOCR conversion in other tools. The j2 spec also suggests to escape characters (see HTML escaping at https://jinja.palletsprojects.com/en/3.0.x/templates/)

Reported in Customer Issue b/329048716

Fixes #213 🦕

Replacement for #239

holtskinner · 2024-03-11T15:34:51Z

Verified that Test Input failed before HTML escaping added:

_________________ test_export_hocr_str_with_escape_characters __________________

    def test_export_hocr_str_with_escape_characters():
        wrapped_document = document.Document.from_document_path(
            document_path="tests/unit/resources/toolbox_invoice_test-0-hocr-escape.json"
        )
    
        actual_hocr = wrapped_document.export_hocr_str(title="toolbox_invoice_test-0")
        assert actual_hocr
    
>       element = ElementTree.fromstring(actual_hocr)

tests/unit/test_document.py:8[27](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:28): 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

text = '<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3...rx_word\' id=\'word_1_[30](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:31)_0_0_4\' title=\'bbox 585 1781 620 1818\'>t Q</span></span></p></span></div>\n</body>\n</html>'
parser = <xml.etree.ElementTree.XMLParser object at 0x7f67e1a1c160>

    def XML(text, parser=None):
        """Parse XML document from string constant.
    
        This function can be used to embed "XML Literals" in Python code.
    
        *text* is a string containing XML data, *parser* is an
        optional parser instance, defaulting to the standard XMLParser.
    
        Returns an Element instance.
    
        """
        if not parser:
            parser = XMLParser(target=TreeBuilder())
>       parser.feed(text)
E       xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 14, column 279

/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/xml/etree/ElementTree.py:1[34](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:35)5: ParseError
- generated xml file: /home/runner/work/python-documentai-toolbox/python-documentai-toolbox/unit_3.11_sponge_log.xml -
=========================== short test summary info ============================
FAILED tests/unit/test_document.py::test_export_hocr_str_with_escape_characters - xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 14, column 279
1 failed, 1[52](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:53) passed in [57](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:58).65s
nox > Command py.test --quiet --junitxml=unit_3.11_sponge_log.xml --cov=google --cov=tests/unit --cov-append --cov-config=.coveragerc --cov-report= --cov-fail-under=0 tests/unit failed with exit code 1

- Added test output

dizcology

If possible, please use smaller test files that minimally illustrate the issue being tested.

holtskinner requested review from a team as code owners March 11, 2024 14:52

product-auto-label bot added the size: xs Pull request size is extra small. label Mar 11, 2024

holtskinner mentioned this pull request Mar 11, 2024

fix: Escape html special characters in hocr_document_template.xml.j2 #239

Closed

test: Add Unit test for hOCR XML validity.

a5ae7d2

holtskinner force-pushed the hocr-escape branch from 855de6b to a5ae7d2 Compare March 11, 2024 15:22

product-auto-label bot added size: xl Pull request size is extra large. and removed size: xs Pull request size is extra small. labels Mar 11, 2024

fix: Escape html special characters in hocr_document_template.xml.j2

41b345e

- Added test output

holtskinner assigned parthea and dizcology Mar 11, 2024

dizcology approved these changes Mar 11, 2024

View reviewed changes

parthea removed their assignment Mar 11, 2024

Shorten hocr escape test files

4112037

product-auto-label bot added size: l Pull request size is large. and removed size: xl Pull request size is extra large. labels Mar 11, 2024

holtskinner changed the title ~~fix: Escape html special characters in hocr_document_template.xml.j2~~ fix: Escape html special characters in hocr_document_template.xml.j2 Mar 11, 2024

holtskinner enabled auto-merge (squash) March 11, 2024 16:08

Merge branch 'main' into hocr-escape

dcd86ca

holtskinner added the owlbot:run Add this label to trigger the Owlbot post processor. label Mar 11, 2024

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Mar 11, 2024

holtskinner merged commit 2d9f05b into main Mar 11, 2024
25 checks passed

holtskinner deleted the hocr-escape branch March 11, 2024 16:39

release-please bot mentioned this pull request Mar 11, 2024

chore(main): release 0.13.3-alpha #276

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Escape html special characters in `hocr_document_template.xml.j2` #279

fix: Escape html special characters in `hocr_document_template.xml.j2` #279

holtskinner commented Mar 11, 2024 •

edited

Loading

holtskinner commented Mar 11, 2024

dizcology left a comment

fix: Escape html special characters in hocr_document_template.xml.j2 #279

fix: Escape html special characters in hocr_document_template.xml.j2 #279

Conversation

holtskinner commented Mar 11, 2024 • edited Loading

holtskinner commented Mar 11, 2024

dizcology left a comment

Choose a reason for hiding this comment

fix: Escape html special characters in `hocr_document_template.xml.j2` #279

fix: Escape html special characters in `hocr_document_template.xml.j2` #279

holtskinner commented Mar 11, 2024 •

edited

Loading