Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Escape html special characters in hocr_document_template.xml.j2 #279

Merged
merged 4 commits into from
Mar 11, 2024

Conversation

holtskinner
Copy link
Member

@holtskinner holtskinner commented Mar 11, 2024

Special characters need to be escaped in order to utilize the output from the HOCR conversion in other tools. The j2 spec also suggests to escape characters (see HTML escaping at https://jinja.palletsprojects.com/en/3.0.x/templates/)

Reported in Customer Issue b/329048716

Fixes #213 馃

Replacement for #239

@holtskinner holtskinner requested review from a team as code owners March 11, 2024 14:52
@product-auto-label product-auto-label bot added the size: xs Pull request size is extra small. label Mar 11, 2024
@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: xs Pull request size is extra small. labels Mar 11, 2024
@holtskinner
Copy link
Member Author

Verified that Test Input failed before HTML escaping added:

_________________ test_export_hocr_str_with_escape_characters __________________

    def test_export_hocr_str_with_escape_characters():
        wrapped_document = document.Document.from_document_path(
            document_path="tests/unit/resources/toolbox_invoice_test-0-hocr-escape.json"
        )
    
        actual_hocr = wrapped_document.export_hocr_str(title="toolbox_invoice_test-0")
        assert actual_hocr
    
>       element = ElementTree.fromstring(actual_hocr)

tests/unit/test_document.py:8[27](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:28): 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

text = '<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3...rx_word\' id=\'word_1_[30](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:31)_0_0_4\' title=\'bbox 585 1781 620 1818\'>t Q</span></span></p></span></div>\n</body>\n</html>'
parser = <xml.etree.ElementTree.XMLParser object at 0x7f67e1a1c160>

    def XML(text, parser=None):
        """Parse XML document from string constant.
    
        This function can be used to embed "XML Literals" in Python code.
    
        *text* is a string containing XML data, *parser* is an
        optional parser instance, defaulting to the standard XMLParser.
    
        Returns an Element instance.
    
        """
        if not parser:
            parser = XMLParser(target=TreeBuilder())
>       parser.feed(text)
E       xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 14, column 279

/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/xml/etree/ElementTree.py:1[34](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:35)5: ParseError
- generated xml file: /home/runner/work/python-documentai-toolbox/python-documentai-toolbox/unit_3.11_sponge_log.xml -
=========================== short test summary info ============================
FAILED tests/unit/test_document.py::test_export_hocr_str_with_escape_characters - xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 14, column 279
1 failed, 1[52](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:53) passed in [57](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:58).65s
nox > Command py.test --quiet --junitxml=unit_3.11_sponge_log.xml --cov=google --cov=tests/unit --cov-append --cov-config=.coveragerc --cov-report= --cov-fail-under=0 tests/unit failed with exit code 1

Copy link
Collaborator

@dizcology dizcology left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, please use smaller test files that minimally illustrate the issue being tested.

@parthea parthea removed their assignment Mar 11, 2024
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: xl Pull request size is extra large. labels Mar 11, 2024
@holtskinner holtskinner changed the title fix: Escape html special characters in hocr_document_template.xml.j2 fix: Escape html special characters in hocr_document_template.xml.j2 Mar 11, 2024
@holtskinner holtskinner enabled auto-merge (squash) March 11, 2024 16:08
@holtskinner holtskinner added the owlbot:run Add this label to trigger the Owlbot post processor. label Mar 11, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Mar 11, 2024
@holtskinner holtskinner merged commit 2d9f05b into main Mar 11, 2024
25 checks passed
@holtskinner holtskinner deleted the hocr-escape branch March 11, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

conversion to HOCR does not convert HTML special characters, it leaves them plain text
3 participants