Discrepancies in Properties and Format within hocr_document_template.xml.j2 #156

amehmood-pls · 2023-08-08T14:53:52Z

I'm trying to create the hOCR file using a JSON file that I generated through DocumentAI. However, it seems that the hocr_document_template.xml.j2 template anticipates a distinct structure. To elaborate, it requires paragraphs followed by nested lines within those paragraphs, with each line further containing nested words. In contrast, the JSON produced by DocumentAI doesn't adhere to this format. Moreover, the hocr_document_template.xml.j2 template utilizes dissimilar property names like hocr_bounding_box, which are absent in the JSON file (attached).

Adding to this, the code repository displays a templates folder at google/cloud/documentai_toolbox/templates/. Nonetheless, upon installing google.cloud.documentai_toolbox via pip, the templates folder appears to be missing.

wikipedia-sample.json.txt

holtskinner · 2023-08-08T15:02:35Z

For clarification, the Document object structure directly from Document AI doesn't follow the hOCR structure with lines nested inside paragraphs, etc. But the Document "wrapper" class within the Document AI Toolbox library does follow that nested structure as of v0.9.0-alpha. It can import a Document AI Document and convert it to the wrapped document structure, which can be exported into the hOCR format.

This code sample shows how to convert a Document AI Document JSON file into an hOCR string.

https://cloud.google.com/document-ai/docs/toolbox#hocr-conversion

As for the templates folder being missing: Can you try the sample code first and report back on if it works as expected?

amehmood-pls · 2023-08-08T17:33:27Z

Thanks for the clarification. Upon executing the sample code, the following error is displayed.

Traceback (most recent call last):
File "/home/am/ocr/hocr.py", line 26, in
hocr_string = convert_document_to_hocr_sample(document_path, document_title)
File "/home/am/ocr/hocr.py", line 21, in convert_document_to_hocr_sample
hocr_string = wrapped_document.export_hocr_str(title=document_title)
File "/home/am/anaconda3/lib/python3.10/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 769, in export_hocr_str
loader=PackageLoader("google.cloud.documentai_toolbox", "templates")
File "/home/am/anaconda3/lib/python3.10/site-packages/jinja2/loaders.py", line 323, in init
raise ValueError(
ValueError: The 'google.cloud.documentai_toolbox' package was not installed in a way that PackageLoader understands.

holtskinner · 2023-08-08T18:24:07Z

I'm able to reproduce the behavior on my machine. Not sure why the samples tests didn't catch this. I'm going to keep investigating

Also, I noticed that the sample json you attached does not entirely follow the correct format, it looks like it was modified after processing by Document AI, because some fields have invalid values, like pages.image. Note - You also need to provide just the document part, not the full response JSON.

amehmood-pls · 2023-08-09T10:09:01Z

Uncertain about the invalid values like page.image. The JSON output remained unchanged subsequent to processing by Document AI. I created a processor based on Document OCR to produce the JSON.

- Fixes #156

* feat: Add export merged sharded Document proto - `to_documentai_document` exports a documentai Document proto from all of the shards in the wrapped Document * fix: Refactor `_apply_text_offset()` to use original impentation with dictionary. - Found issue with implementation when trying to update test coverage * chore: Update min python client library for documentai * Update test constraints * fix: Change test to not include indent * fix: merge_document_shards_sample_test * fix: Address lint error for type checking * Fix lint error for incorrect typing * Rename `to_documentai_document` to `to_merged_documentai_document` * Change `to_merged_documentai_document()` to use a deepcopy instead of editing in place * Add more specific type annotation to `_apply_text_offset()` * fix: Fixed how template files are included in the library - Fixes #156 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * refactor: Updated `from_document_path()` to additionally support directory of shards * fix: Fix type annotation --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>

holtskinner · 2023-08-09T17:52:22Z

Release 0.10.0-alpha should resolve the PackageLoader issue

amehmood-pls · 2023-08-10T17:33:47Z

Thank you, @holtskinner!

holtskinner added a commit that referenced this issue Aug 9, 2023

fix: Fixed how template files are included in the library

c3da29a

- Fixes #156

holtskinner mentioned this issue Aug 9, 2023

feat: Add export merged sharded Document proto #145

Merged

holtskinner closed this as completed in #145 Aug 9, 2023

holtskinner self-assigned this Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancies in Properties and Format within hocr_document_template.xml.j2 #156

Discrepancies in Properties and Format within hocr_document_template.xml.j2 #156

amehmood-pls commented Aug 8, 2023

holtskinner commented Aug 8, 2023 •

edited

amehmood-pls commented Aug 8, 2023

holtskinner commented Aug 8, 2023

amehmood-pls commented Aug 9, 2023

holtskinner commented Aug 9, 2023

amehmood-pls commented Aug 10, 2023

Discrepancies in Properties and Format within hocr_document_template.xml.j2 #156

Discrepancies in Properties and Format within hocr_document_template.xml.j2 #156

Comments

amehmood-pls commented Aug 8, 2023

holtskinner commented Aug 8, 2023 • edited

amehmood-pls commented Aug 8, 2023

holtskinner commented Aug 8, 2023

amehmood-pls commented Aug 9, 2023

holtskinner commented Aug 9, 2023

amehmood-pls commented Aug 10, 2023

holtskinner commented Aug 8, 2023 •

edited