-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: added hOCR export functionality #123
Conversation
…ocumentai-toolbox into analytic-changes
…-documentai-toolbox into analytic-changes-2
|
||
Args: | ||
filename (str): | ||
Required. The filename of the original file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the original filename? Is it the original pdf filename?
Is this information already available in the Document AI's Document message? It is a little odd that we are asking the caller to provide this. The expectation should be that the Toolbox's Document object knows how to present itself as an hOCR string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need the filename for the hOCR, i don't believe the DocAI response has any filename or something close to it so that's why we ask the caller for that information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current code, filename
is used for title
, is it the "title" of the whole hOCR document? If so perhaps we should just call it title
? Also if it belongs only to the top level (the whole hOCR document), perhaps it does not need to be passed to the lower level elements (e.g. hOCR page).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, i don't think filename gets passed down to lower elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked, ocr_page needs a title also so that's why it's passed down to page
text: str | ||
tokens: List[Token] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be calculated? As is right now when a (Toolbox) Line is instantiated, the caller has to provide the underlying documentai_line
, its text
, and its tokens
, and it seems like both text
and tokens
can be calculated from documentai_line
(and maybe the page containing the line).
If I understand this correctly, this class is to be used only when we create a Toolbox Document, and the user of the library should never have to write code like wrapped_line = Line(documentai_line=..., text=..., token=[...])
. Is that correct? If so let's at least update the docstring and say something like the user of the library should not instantiate objects of this class (and some other classes) directly. But even then, I would still expect that things like text
and tokens
to be eventually lazily calculated, instead of having to be passed in as init arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea users should never wrap their own Line,Entity,Paragraph,etc
they should only use the Document.from_ to wrap a whole document not specific parts of a document. My thought process if they are lazy calculated we'll need to pass in the whole page to be able to access things. I think for the initial release this might be okay and if there are performance issues we can go and refactor to do lazy loading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good on not worrying about lazy calculation until later.
On the other hand, let's evaluate whether the whole page should be passed to every element as a private attribute. Only a reference of the page will be passed so there will be no copies. As is right now, every time an element with layout needs to look up some text, the caller of the method has to pass in a page, since only the page has the text (is this true?).
There seems no harm to just include the page of an element (Block, Line, etc) as a private attribute of the element, and it would simplify the code (we will no longer need to pass in text
in the methods).
This is assuming that every element belongs to only one page, though. Do we know this is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it every element that is loaded is associated to a specific page so as long as we pass in the page i think we will be fine. I'll try out passing the page as reference.
|
||
documentai_formfield: documentai.Document.Page.FormField | ||
field_name: str | ||
field_value: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here, shouldn't field_name
and field_value
already inside of documentai_formfield
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why this is showing up as a new change but this i did not add this, this is from Holts addition. The variables field_name
and field_value
and meant to make it easier to access similar to entity.type_
and entity.mention_text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you perhaps merge/rebase with Holt's PR at some point? It seems like we will encounter a pretty bad merge conflict if this is not cleaned up very soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this was pulled from Main, not holts PR but once holt's PR is submitted i will need to change some of the implementation probably
def _get_blocks(blocks: List[documentai.Document.Page.Block], text: str) -> List[Block]: | ||
r"""Returns a list of Block. | ||
def _get_tokens_in_line( | ||
line: documentai.Document.Page.Line, tokens: List[Token] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These helper functions _get_tokens_in_line
, _get_lines_in_paragraph
, and _get_paragraphs_in_blocks
look identical except for different local variable names, but I might have missed something. If they are indeed identical, how about have a single function instead?
For example:
def _get_elements_in_parent(parent: ElementWithLayout) -> List[ElementWithLayout]:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the return type is different for them, _get_tokens_in_line
returns List[Token], _get_lines_in_paragraph
returns List[Line], _get_paragraphs_in_blocks
return List[Paragraph]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are fancy ways to properly express that with TypeVar, but we probably don't gain much doing that. Is there anything wrong with returning a list of ElementWithLayout
, which is a union type? (Technically this would allow the function to return a list of elements that may be of different types, but that may be okay for now.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's not returning the Wrapped version of Lines,Paragraph,etc...
so we wouldn't be able to use ElementWithLayout
but we can probably make a similar type, would each type need to have the same objects for example Paragraph
has similar variables except lines
and Line
has tokens
, would that still be fine?
|
||
def _get_hocr_bounding_box( | ||
element_with_layout: ElementWithLayout, | ||
dimension: documentai.Document.Page.Dimension, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the page's dimension already included in the element's _page
?
- Changed page elements to be initalized with document text instead of page text. - Caused all paragraphs/blocks/etc after the first page to be empty. - Introduced in #123 - https://github.com/googleapis/python-documentai-toolbox/pull/123/files#diff-af92c6de8f8e84ca66d2fb9fa7e9bddb5bd644e944153bf7a78d35f47c05853eR251
- Changed page elements to be initalized with document text instead of page text. - Caused all paragraphs/blocks/etc after the first page to be empty. - Introduced in #123 - https://github.com/googleapis/python-documentai-toolbox/pull/123/files#diff-af92c6de8f8e84ca66d2fb9fa7e9bddb5bd644e944153bf7a78d35f47c05853eR251
This PR is based off of #111
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕