Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added hOCR export functionality #123

Merged
merged 38 commits into from
Jun 28, 2023

Conversation

galz10
Copy link
Collaborator

@galz10 galz10 commented Jun 6, 2023

This PR is based off of #111

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added the size: xl Pull request size is extra large. label Jun 6, 2023
@galz10 galz10 requested a review from dizcology June 6, 2023 21:07

Args:
filename (str):
Required. The filename of the original file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the original filename? Is it the original pdf filename?

Is this information already available in the Document AI's Document message? It is a little odd that we are asking the caller to provide this. The expectation should be that the Toolbox's Document object knows how to present itself as an hOCR string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need the filename for the hOCR, i don't believe the DocAI response has any filename or something close to it so that's why we ask the caller for that information.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current code, filename is used for title, is it the "title" of the whole hOCR document? If so perhaps we should just call it title? Also if it belongs only to the top level (the whole hOCR document), perhaps it does not need to be passed to the lower level elements (e.g. hOCR page).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, i don't think filename gets passed down to lower elements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked, ocr_page needs a title also so that's why it's passed down to page

text: str
tokens: List[Token]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be calculated? As is right now when a (Toolbox) Line is instantiated, the caller has to provide the underlying documentai_line, its text, and its tokens, and it seems like both text and tokens can be calculated from documentai_line (and maybe the page containing the line).

If I understand this correctly, this class is to be used only when we create a Toolbox Document, and the user of the library should never have to write code like wrapped_line = Line(documentai_line=..., text=..., token=[...]). Is that correct? If so let's at least update the docstring and say something like the user of the library should not instantiate objects of this class (and some other classes) directly. But even then, I would still expect that things like text and tokens to be eventually lazily calculated, instead of having to be passed in as init arguments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea users should never wrap their own Line,Entity,Paragraph,etc they should only use the Document.from_ to wrap a whole document not specific parts of a document. My thought process if they are lazy calculated we'll need to pass in the whole page to be able to access things. I think for the initial release this might be okay and if there are performance issues we can go and refactor to do lazy loading.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good on not worrying about lazy calculation until later.

On the other hand, let's evaluate whether the whole page should be passed to every element as a private attribute. Only a reference of the page will be passed so there will be no copies. As is right now, every time an element with layout needs to look up some text, the caller of the method has to pass in a page, since only the page has the text (is this true?).

There seems no harm to just include the page of an element (Block, Line, etc) as a private attribute of the element, and it would simplify the code (we will no longer need to pass in text in the methods).

This is assuming that every element belongs to only one page, though. Do we know this is true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it every element that is loaded is associated to a specific page so as long as we pass in the page i think we will be fine. I'll try out passing the page as reference.


documentai_formfield: documentai.Document.Page.FormField
field_name: str
field_value: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here, shouldn't field_name and field_value already inside of documentai_formfield?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this is showing up as a new change but this i did not add this, this is from Holts addition. The variables field_name and field_value and meant to make it easier to access similar to entity.type_ and entity.mention_text

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you perhaps merge/rebase with Holt's PR at some point? It seems like we will encounter a pretty bad merge conflict if this is not cleaned up very soon.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this was pulled from Main, not holts PR but once holt's PR is submitted i will need to change some of the implementation probably

google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
def _get_blocks(blocks: List[documentai.Document.Page.Block], text: str) -> List[Block]:
r"""Returns a list of Block.
def _get_tokens_in_line(
line: documentai.Document.Page.Line, tokens: List[Token]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These helper functions _get_tokens_in_line, _get_lines_in_paragraph, and _get_paragraphs_in_blocks look identical except for different local variable names, but I might have missed something. If they are indeed identical, how about have a single function instead?

For example:

def _get_elements_in_parent(parent: ElementWithLayout) -> List[ElementWithLayout]:
  ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return type is different for them, _get_tokens_in_line returns List[Token], _get_lines_in_paragraph returns List[Line], _get_paragraphs_in_blocks return List[Paragraph]

Copy link
Collaborator

@dizcology dizcology Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are fancy ways to properly express that with TypeVar, but we probably don't gain much doing that. Is there anything wrong with returning a list of ElementWithLayout, which is a union type? (Technically this would allow the function to return a list of elements that may be of different types, but that may be okay for now.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's not returning the Wrapped version of Lines,Paragraph,etc... so we wouldn't be able to use ElementWithLayout but we can probably make a similar type, would each type need to have the same objects for example Paragraph has similar variables except lines and Line has tokens , would that still be fine?

google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved

def _get_hocr_bounding_box(
element_with_layout: ElementWithLayout,
dimension: documentai.Document.Page.Dimension,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the page's dimension already included in the element's _page?

google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
@galz10 galz10 requested a review from dizcology June 23, 2023 17:09
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 27, 2023
@galz10 galz10 requested a review from holtskinner June 27, 2023 16:07
@holtskinner holtskinner added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 28, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 28, 2023
@galz10 galz10 added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 28, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 28, 2023
@galz10 galz10 added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 28, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 28, 2023
@galz10 galz10 merged commit 87d2fc1 into googleapis:main Jun 28, 2023
21 checks passed
@galz10 galz10 deleted the analytic-changes-2 branch June 28, 2023 22:46
holtskinner added a commit that referenced this pull request Oct 20, 2023
- Changed page elements to be initalized with document text instead of page text.
  - Caused all paragraphs/blocks/etc after the first page to be empty.
- Introduced in #123
  - https://github.com/googleapis/python-documentai-toolbox/pull/123/files#diff-af92c6de8f8e84ca66d2fb9fa7e9bddb5bd644e944153bf7a78d35f47c05853eR251
holtskinner added a commit that referenced this pull request Oct 23, 2023
- Changed page elements to be initalized with document text instead of page text.
  - Caused all paragraphs/blocks/etc after the first page to be empty.
- Introduced in #123
  - https://github.com/googleapis/python-documentai-toolbox/pull/123/files#diff-af92c6de8f8e84ca66d2fb9fa7e9bddb5bd644e944153bf7a78d35f47c05853eR251
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants