New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancies in Properties and Format within hocr_document_template.xml.j2 #156
Comments
For clarification, the This code sample shows how to convert a Document AI https://cloud.google.com/document-ai/docs/toolbox#hocr-conversion As for the templates folder being missing: Can you try the sample code first and report back on if it works as expected? |
Thanks for the clarification. Upon executing the sample code, the following error is displayed. Traceback (most recent call last): |
I'm able to reproduce the behavior on my machine. Not sure why the samples tests didn't catch this. I'm going to keep investigating Also, I noticed that the sample json you attached does not entirely follow the correct format, it looks like it was modified after processing by Document AI, because some fields have invalid values, like |
Uncertain about the invalid values like |
* feat: Add export merged sharded Document proto - `to_documentai_document` exports a documentai Document proto from all of the shards in the wrapped Document * fix: Refactor `_apply_text_offset()` to use original impentation with dictionary. - Found issue with implementation when trying to update test coverage * chore: Update min python client library for documentai * Update test constraints * fix: Change test to not include indent * fix: merge_document_shards_sample_test * fix: Address lint error for type checking * Fix lint error for incorrect typing * Rename `to_documentai_document` to `to_merged_documentai_document` * Change `to_merged_documentai_document()` to use a deepcopy instead of editing in place * Add more specific type annotation to `_apply_text_offset()` * fix: Fixed how template files are included in the library - Fixes #156 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * refactor: Updated `from_document_path()` to additionally support directory of shards * fix: Fix type annotation --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Release 0.10.0-alpha should resolve the |
Thank you, @holtskinner! |
I'm trying to create the hOCR file using a JSON file that I generated through DocumentAI. However, it seems that the hocr_document_template.xml.j2 template anticipates a distinct structure. To elaborate, it requires paragraphs followed by nested lines within those paragraphs, with each line further containing nested words. In contrast, the JSON produced by DocumentAI doesn't adhere to this format. Moreover, the hocr_document_template.xml.j2 template utilizes dissimilar property names like hocr_bounding_box, which are absent in the JSON file (attached).
Adding to this, the code repository displays a templates folder at google/cloud/documentai_toolbox/templates/. Nonetheless, upon installing google.cloud.documentai_toolbox via pip, the templates folder appears to be missing.
wikipedia-sample.json.txt
The text was updated successfully, but these errors were encountered: