Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added DocumentWithRevisions with read functionality #128

Closed
wants to merge 24 commits into from

Conversation

galz10
Copy link
Collaborator

@galz10 galz10 commented Jun 14, 2023

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@galz10 galz10 requested review from a team as code owners June 14, 2023 19:35
@product-auto-label product-auto-label bot added the size: l Pull request size is large. label Jun 14, 2023
@holtskinner holtskinner added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 14, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 14, 2023
@holtskinner
Copy link
Member

Could you add more context to the PR and/or the code comments on why this is needed? I get that this includes the information about Human-in-the-Loop revisions, but I don't understand why a new type is needed to include the revisions. Could these be added as optional fields to the toolbox.Document class?

Copy link
Member

@holtskinner holtskinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, be sure to add in testing

storage_client = storage.Client()

blob_list = storage_client.list_blobs(output_bucket, prefix=output_prefix)
pb = documentai.Document.pb()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you accessing the pb() in the document? Do these doc.dp.dp files work differently to the document.json files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they're very different and the only way to get a usable Object is by accessing pb() and converting the dp.bp files to Document object using pb

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use documentai.Document.from_json (https://github.com/googleapis/proto-plus-python/blob/main/proto/message.py#L407) followed by accessing its text attribute? (https://github.com/googleapis/google-cloud-python/blob/main/packages/google-cloud-documentai/google/cloud/documentai_v1/types/document.py#L72)

I think these practically do the same thing, but it may be worthwhile to avoid having to access the lower-level protobuf API pb and FromString.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @dizcology I think that would be possible.

Note: be sure to use the parameter ignore_unknown_fields because if the client library lags behind the core Document protos, there could be some fields in Document output that isn't in the client library yet. This prevents exceptions from being thrown.

document = documentai.Document.from_json(
    blob.download_as_bytes(), ignore_unknown_fields=True
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able to use the document.from_json since the db.pb isn't bytes of a Document object is bytes of the Document message which is different from what document.from_json uses. If i try using Document.from_json i get this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 3: invalid start byte

Comment on lines 215 to 255
class DocumentWithRevisions:
r"""Represents a wrapped Document.

A single Document protobuf message might be written as several JSON files on
GCS by Document AI's BatchProcessDocuments method. This class hides away the
shards from the users and implements convenient methods for searching and
extracting information within the Document.

Attributes:
gcs_prefix (str):
Required.The gcs path to a single processed document.

Format: `gs://{bucket}/{optional_folder}/{operation_id}/{folder_id}`
where `{operation_id}` is the operation-id given from BatchProcessDocument
and `{folder_id}` is the number corresponding to the target document.
"""

document: Document = dataclasses.field(init=True, repr=False)
revision_nodes: List[documentai.Document] = dataclasses.field(init=True, repr=False)
gcs_prefix: str = dataclasses.field(init=True, repr=False, default=None)
parent_ids: List[str] = dataclasses.field(init=True, repr=False, default=None)

next_: Document = dataclasses.field(init=False, repr=False, default=None)
last: Document = dataclasses.field(init=False, repr=False, default=None)
revision_id: str = dataclasses.field(init=False, repr=False, default=None)
history: List[str] = dataclasses.field(init=False, repr=False, default_factory=list)
root_revision: Document = dataclasses.field(init=False, repr=False, default=None)

parent: DocumentWithRevisions = dataclasses.field(
init=False, repr=False, default=None
)

children: List[DocumentWithRevisions] = dataclasses.field(
init=False, repr=False, default_factory=list
)
children_ids: List[str] = dataclasses.field(
init=False, repr=False, default_factory=list
)
root_revision_nodes: List[DocumentWithRevisions] = dataclasses.field(
init=False, repr=False, default_factory=list
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess I understand more why the extra class for revisions is here (to keep the base Document class simpler) but I'm still not sure I like this solution, it seems like there's a lot of fields involved and this object could get quite large with all of the Documents and DocumentWithRevisions included.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implemented like this for multiple reasons, the first one is to keep continuity with the internal implementation. Another the document that we are importing from gcs here is completely different we are not taking in json files we're processing dp.bp files which is binary protobuf and the revision documents should have a tree like hierarchy which would introduce a lot of clutter/confusion in Document. So it's best to separate it out to make it easier to make changes to DocumentWithRevision. Also there are a lot of fields to support the tree structure and to support movement between non-children nodes.

@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Jul 6, 2023
@galz10 galz10 added the owlbot:run Add this label to trigger the Owlbot post processor. label Jul 6, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jul 6, 2023
@galz10 galz10 added the owlbot:run Add this label to trigger the Owlbot post processor. label Jul 6, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jul 6, 2023
storage_client = storage.Client()

blob_list = storage_client.list_blobs(output_bucket, prefix=output_prefix)
pb = documentai.Document.pb()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use documentai.Document.from_json (https://github.com/googleapis/proto-plus-python/blob/main/proto/message.py#L407) followed by accessing its text attribute? (https://github.com/googleapis/google-cloud-python/blob/main/packages/google-cloud-documentai/google/cloud/documentai_v1/types/document.py#L72)

I think these practically do the same thing, but it may be worthwhile to avoid having to access the lower-level protobuf API pb and FromString.

name = blob.name.split("/")[-1]
if blob.name.endswith(".dp.bp"):
blob_as_bytes = blob.download_as_bytes()
if re.search(r"^doc.dp.bp", name):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use name.startswith("doc.dp.bp") instead? (same comment for the next two cases below)

Copy link
Collaborator Author

@galz10 galz10 Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that might work for doc.dp.bp but i don't know if it will work for rev and pages since they will be rev_0000, rev_0001... and pages_0000,pages_0001...

}
)
elif e.provenance.type_ == _OP_TYPE.REMOVE:
entity = entities_array.pop(int(e.id) - 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of operation is pretty tricky, popping from entities_array, which is added to or modified by the other branches. Can we start by expanding the docstring to clarify what this function intends to accomplish?

gcs_bucket_name: str = dataclasses.field(repr=False, default=None)
gcs_prefix: str = dataclasses.field(repr=False, default=None)
parent_ids: List[str] = dataclasses.field(repr=False, default=None)
all_node_ids: List[str] = dataclasses.field(repr=False, default=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect parent_ids and all_node_ids are things that could be extracted from the data on GCS in __post_init__? (as opposed to requiring them in __init__, that is)

class DocumentWithRevisions:
r"""Represents a wrapped Document.

A single Document protobuf message with revisions will written as several `dp.bp` files on
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A single Document protobuf message with revisions" in the docstring here suggests that an instance of this is a protobuf message (which it is not) that comes with special attributes such as revisions. But from the presence of the next_, parent attributes, we seem to be saying an instance of this class is a revision (that is, a single document at a particular point in time)? These are quite different kinds of data - what are we modeling here?

children: List["DocumentWithRevisions"] = dataclasses.field(
init=False, repr=False, default_factory=list
)
children_ids: List[str] = dataclasses.field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be just a property, computed from children?

Comment on lines 370 to 384
if self.parent:
current_index = self.parent.children_ids.index(self.revision_id)
if current_index != 0:
return self.parent.children[current_index - 1]
elif current_index != None:
print("hi")
elif current_index is not None:
return self.parent
elif self.revision_id in self.parent_ids:
non_parent_index = self.parent_ids.index(self.revision_id)
if non_parent_index > 0:
next_parent = self.parent_ids[non_parent_index - 1]
index = self.all_node_ids.index(next_parent)
return self.root_revision_nodes[index]
else:
return self
return self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this if/else tree could be simplified to make it easier to read. I find that "guard" statements work well to visualize the flow. Swap the conditions for the if and add returns early instead of using many elses.

Copy link
Member

@holtskinner holtskinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend addressing @dizcology 's comments. I agree with all of them.

Overall, still not sure I understand the use case for this or why it was architected the way it is. Is there a design doc with the reasonings?

Also, add a sample for how to use this.

@holtskinner
Copy link
Member

Closing this out based on our conversations

@holtskinner holtskinner closed this Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants