feat: added hOCR export functionality #123

galz10 · 2023-06-06T21:06:34Z

This PR is based off of #111

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

…ocumentai-toolbox into analytic-changes

…-documentai-toolbox into analytic-changes-2

dizcology · 2023-06-15T17:32:29Z

google/cloud/documentai_toolbox/wrappers/document.py

+
+        Args:
+            filename (str):
+                Required. The filename of the original file.


What is the original filename? Is it the original pdf filename?

Is this information already available in the Document AI's Document message? It is a little odd that we are asking the caller to provide this. The expectation should be that the Toolbox's Document object knows how to present itself as an hOCR string.

You need the filename for the hOCR, i don't believe the DocAI response has any filename or something close to it so that's why we ask the caller for that information.

In the current code, filename is used for title, is it the "title" of the whole hOCR document? If so perhaps we should just call it title? Also if it belongs only to the top level (the whole hOCR document), perhaps it does not need to be passed to the lower level elements (e.g. hOCR page).

Will do, i don't think filename gets passed down to lower elements.

Just checked, ocr_page needs a title also so that's why it's passed down to page

dizcology · 2023-06-15T17:36:15Z

google/cloud/documentai_toolbox/wrappers/page.py

    text: str
+    tokens: List[Token]


Should this be calculated? As is right now when a (Toolbox) Line is instantiated, the caller has to provide the underlying documentai_line, its text, and its tokens, and it seems like both text and tokens can be calculated from documentai_line (and maybe the page containing the line).

If I understand this correctly, this class is to be used only when we create a Toolbox Document, and the user of the library should never have to write code like wrapped_line = Line(documentai_line=..., text=..., token=[...]). Is that correct? If so let's at least update the docstring and say something like the user of the library should not instantiate objects of this class (and some other classes) directly. But even then, I would still expect that things like text and tokens to be eventually lazily calculated, instead of having to be passed in as init arguments.

yea users should never wrap their own Line,Entity,Paragraph,etc they should only use the Document.from_ to wrap a whole document not specific parts of a document. My thought process if they are lazy calculated we'll need to pass in the whole page to be able to access things. I think for the initial release this might be okay and if there are performance issues we can go and refactor to do lazy loading.

Sounds good on not worrying about lazy calculation until later.

On the other hand, let's evaluate whether the whole page should be passed to every element as a private attribute. Only a reference of the page will be passed so there will be no copies. As is right now, every time an element with layout needs to look up some text, the caller of the method has to pass in a page, since only the page has the text (is this true?).

There seems no harm to just include the page of an element (Block, Line, etc) as a private attribute of the element, and it would simplify the code (we will no longer need to pass in text in the methods).

This is assuming that every element belongs to only one page, though. Do we know this is true?

So it every element that is loaded is associated to a specific page so as long as we pass in the page i think we will be fine. I'll try out passing the page as reference.

dizcology · 2023-06-15T17:40:38Z

google/cloud/documentai_toolbox/wrappers/page.py

+
+    documentai_formfield: documentai.Document.Page.FormField
+    field_name: str
+    field_value: str


Same comment here, shouldn't field_name and field_value already inside of documentai_formfield?

I don't know why this is showing up as a new change but this i did not add this, this is from Holts addition. The variables field_name and field_value and meant to make it easier to access similar to entity.type_ and entity.mention_text

Did you perhaps merge/rebase with Holt's PR at some point? It seems like we will encounter a pretty bad merge conflict if this is not cleaned up very soon.

No this was pulled from Main, not holts PR but once holt's PR is submitted i will need to change some of the implementation probably

google/cloud/documentai_toolbox/wrappers/page.py

dizcology · 2023-06-15T17:50:19Z

google/cloud/documentai_toolbox/wrappers/page.py

-def _get_blocks(blocks: List[documentai.Document.Page.Block], text: str) -> List[Block]:
-    r"""Returns a list of Block.
+def _get_tokens_in_line(
+    line: documentai.Document.Page.Line, tokens: List[Token]


These helper functions _get_tokens_in_line, _get_lines_in_paragraph, and _get_paragraphs_in_blocks look identical except for different local variable names, but I might have missed something. If they are indeed identical, how about have a single function instead?

For example:

def _get_elements_in_parent(parent: ElementWithLayout) -> List[ElementWithLayout]: ...

the return type is different for them, _get_tokens_in_line returns List[Token], _get_lines_in_paragraph returns List[Line], _get_paragraphs_in_blocks return List[Paragraph]

I think there are fancy ways to properly express that with TypeVar, but we probably don't gain much doing that. Is there anything wrong with returning a list of ElementWithLayout, which is a union type? (Technically this would allow the function to return a list of elements that may be of different types, but that may be okay for now.)

So it's not returning the Wrapped version of Lines,Paragraph,etc... so we wouldn't be able to use ElementWithLayout but we can probably make a similar type, would each type need to have the same objects for example Paragraph has similar variables except lines and Line has tokens , would that still be fine?

google/cloud/documentai_toolbox/wrappers/page.py

dizcology · 2023-06-22T17:20:37Z

google/cloud/documentai_toolbox/wrappers/page.py

+
+def _get_hocr_bounding_box(
+    element_with_layout: ElementWithLayout,
+    dimension: documentai.Document.Page.Dimension,


Is the page's dimension already included in the element's _page?

google/cloud/documentai_toolbox/wrappers/page.py

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

…-documentai-toolbox into analytic-changes-2

google/cloud/documentai_toolbox/wrappers/document.py

google/cloud/documentai_toolbox/wrappers/page.py

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

- Changed page elements to be initalized with document text instead of page text. - Caused all paragraphs/blocks/etc after the first page to be empty. - Introduced in #123 - https://github.com/googleapis/python-documentai-toolbox/pull/123/files#diff-af92c6de8f8e84ca66d2fb9fa7e9bddb5bd644e944153bf7a78d35f47c05853eR251

galz10 added 13 commits April 10, 2023 14:14

chore: edit get_storage_client to add module name

53742c5

added module name to get_bytes

ebe3a36

fixed failing test

00057e6

chore: added hocr

fe033ef

removed test files

9d264a3

Merge branch 'analytic-changes' of https://github.com/galz10/python-d…

61e5ed8

…ocumentai-toolbox into analytic-changes

revised code per comments

93ca69b

Merge branch 'googleapis:main' into analytic-changes

3c70d0a

feat: added hOCR export functionality

2fe42ab

Merge branch 'googleapis:main' into analytic-changes-2

bc1505f

changed line_text to use line.text

38edc81

Merge branch 'analytic-changes-2' of https://github.com/galz10/python…

5110aee

…-documentai-toolbox into analytic-changes-2

added tests

686ed37

product-auto-label bot added the size: xl Pull request size is extra large. label Jun 6, 2023

galz10 requested a review from dizcology June 6, 2023 21:07

fix lint failure

f13673f

dizcology reviewed Jun 15, 2023

View reviewed changes

revised code

879b4bb

dizcology reviewed Jun 16, 2023

View reviewed changes

google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved

dizcology reviewed Jun 16, 2023

View reviewed changes

google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved

galz10 added 3 commits June 16, 2023 11:35

revise code

b19d9a4

refactored code

fc2d5de

refactored code

7b3443e

dizcology requested changes Jun 21, 2023

View reviewed changes

galz10 added 2 commits June 22, 2023 09:41

expanded test_Page

9baac10

refactored code

47055e2

dizcology requested changes Jun 22, 2023

View reviewed changes

galz10 added 2 commits June 22, 2023 15:00

refactored code

6f8b516

refactored code

172df5d

galz10 requested a review from dizcology June 23, 2023 17:09

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 27, 2023

🦉 Updates from OwlBot post-processor

f9df644

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

galz10 requested a review from holtskinner June 27, 2023 16:07

holtskinner added 2 commits June 27, 2023 16:55

Merge branch 'main' into analytic-changes-2

45041f6

Merge branch 'main' into analytic-changes-2

fe8d1db

holtskinner added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 28, 2023

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 28, 2023

galz10 added 2 commits June 28, 2023 12:55

templated hocr file format

1efd731

Merge branch 'analytic-changes-2' of https://github.com/galz10/python…

ca6a256

…-documentai-toolbox into analytic-changes-2

dizcology reviewed Jun 28, 2023

View reviewed changes

google/cloud/documentai_toolbox/wrappers/document.py Show resolved Hide resolved

dizcology reviewed Jun 28, 2023

View reviewed changes

google/cloud/documentai_toolbox/wrappers/page.py Show resolved Hide resolved

dizcology reviewed Jun 28, 2023

View reviewed changes

google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved

galz10 added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 28, 2023

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 28, 2023

dizcology reviewed Jun 28, 2023

View reviewed changes

google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved

dizcology reviewed Jun 28, 2023

View reviewed changes

google/cloud/documentai_toolbox/wrappers/page.py Show resolved Hide resolved

holtskinner approved these changes Jun 28, 2023

View reviewed changes

dizcology reviewed Jun 28, 2023

View reviewed changes

google/cloud/documentai_toolbox/wrappers/page.py Show resolved Hide resolved

dizcology approved these changes Jun 28, 2023

View reviewed changes

galz10 added 2 commits June 28, 2023 14:59

refactored code

8f35f5f

fixed failing test

2646078

galz10 added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 28, 2023

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 28, 2023

🦉 Updates from OwlBot post-processor

e7b7174

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

galz10 merged commit 87d2fc1 into googleapis:main Jun 28, 2023
21 checks passed

galz10 deleted the analytic-changes-2 branch June 28, 2023 22:46

release-please bot mentioned this pull request Jun 28, 2023

chore(main): release 0.9.0-alpha #136

Merged

holtskinner mentioned this pull request Oct 20, 2023

fix: Empty Page SubElements #186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: added hOCR export functionality #123

feat: added hOCR export functionality #123

galz10 commented Jun 6, 2023 •

edited

Loading

dizcology Jun 15, 2023

galz10 Jun 15, 2023

dizcology Jun 16, 2023

galz10 Jun 16, 2023

galz10 Jun 16, 2023

dizcology Jun 15, 2023

galz10 Jun 15, 2023

dizcology Jun 16, 2023

galz10 Jun 16, 2023

dizcology Jun 15, 2023

galz10 Jun 15, 2023

dizcology Jun 16, 2023

galz10 Jun 16, 2023

dizcology Jun 15, 2023

galz10 Jun 15, 2023

dizcology Jun 16, 2023 •

edited

Loading

galz10 Jun 16, 2023

dizcology Jun 22, 2023

feat: added hOCR export functionality #123

feat: added hOCR export functionality #123

Conversation

galz10 commented Jun 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dizcology Jun 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galz10 commented Jun 6, 2023 •

edited

Loading

dizcology Jun 16, 2023 •

edited

Loading