Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies in Properties and Format within hocr_document_template.xml.j2 #156

Closed
amehmood-pls opened this issue Aug 8, 2023 · 6 comments · Fixed by #145
Closed

Discrepancies in Properties and Format within hocr_document_template.xml.j2 #156

amehmood-pls opened this issue Aug 8, 2023 · 6 comments · Fixed by #145
Assignees

Comments

@amehmood-pls
Copy link

I'm trying to create the hOCR file using a JSON file that I generated through DocumentAI. However, it seems that the hocr_document_template.xml.j2 template anticipates a distinct structure. To elaborate, it requires paragraphs followed by nested lines within those paragraphs, with each line further containing nested words. In contrast, the JSON produced by DocumentAI doesn't adhere to this format. Moreover, the hocr_document_template.xml.j2 template utilizes dissimilar property names like hocr_bounding_box, which are absent in the JSON file (attached).

Adding to this, the code repository displays a templates folder at google/cloud/documentai_toolbox/templates/. Nonetheless, upon installing google.cloud.documentai_toolbox via pip, the templates folder appears to be missing.

wikipedia-sample.json.txt

@holtskinner
Copy link
Member

holtskinner commented Aug 8, 2023

For clarification, the Document object structure directly from Document AI doesn't follow the hOCR structure with lines nested inside paragraphs, etc. But the Document "wrapper" class within the Document AI Toolbox library does follow that nested structure as of v0.9.0-alpha. It can import a Document AI Document and convert it to the wrapped document structure, which can be exported into the hOCR format.

This code sample shows how to convert a Document AI Document JSON file into an hOCR string.

https://cloud.google.com/document-ai/docs/toolbox#hocr-conversion

As for the templates folder being missing: Can you try the sample code first and report back on if it works as expected?

@amehmood-pls
Copy link
Author

Thanks for the clarification. Upon executing the sample code, the following error is displayed.

Traceback (most recent call last):
File "/home/am/ocr/hocr.py", line 26, in
hocr_string = convert_document_to_hocr_sample(document_path, document_title)
File "/home/am/ocr/hocr.py", line 21, in convert_document_to_hocr_sample
hocr_string = wrapped_document.export_hocr_str(title=document_title)
File "/home/am/anaconda3/lib/python3.10/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 769, in export_hocr_str
loader=PackageLoader("google.cloud.documentai_toolbox", "templates")
File "/home/am/anaconda3/lib/python3.10/site-packages/jinja2/loaders.py", line 323, in init
raise ValueError(
ValueError: The 'google.cloud.documentai_toolbox' package was not installed in a way that PackageLoader understands.

@holtskinner
Copy link
Member

I'm able to reproduce the behavior on my machine. Not sure why the samples tests didn't catch this. I'm going to keep investigating

Also, I noticed that the sample json you attached does not entirely follow the correct format, it looks like it was modified after processing by Document AI, because some fields have invalid values, like pages.image. Note - You also need to provide just the document part, not the full response JSON.

@amehmood-pls
Copy link
Author

Uncertain about the invalid values like page.image. The JSON output remained unchanged subsequent to processing by Document AI. I created a processor based on Document OCR to produce the JSON.

holtskinner added a commit that referenced this issue Aug 9, 2023
* feat: Add export merged sharded Document proto

- `to_documentai_document` exports a documentai Document proto from all of the shards in the wrapped Document

* fix: Refactor `_apply_text_offset()` to use original impentation with dictionary.

- Found issue with implementation when trying to update test coverage

* chore: Update min python client library for documentai

* Update test constraints

* fix: Change test to not include indent

* fix: merge_document_shards_sample_test

* fix: Address lint error for type checking

* Fix lint error for incorrect typing

* Rename `to_documentai_document` to `to_merged_documentai_document`

* Change `to_merged_documentai_document()` to use a deepcopy instead of editing in place

* Add more specific type annotation to `_apply_text_offset()`

* fix: Fixed how template files are included in the library

- Fixes #156

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* refactor: Updated `from_document_path()` to additionally support directory of shards

* fix: Fix type annotation

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
@holtskinner
Copy link
Member

Release 0.10.0-alpha should resolve the PackageLoader issue

@amehmood-pls
Copy link
Author

Thank you, @holtskinner!

@holtskinner holtskinner self-assigned this Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants