-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: added DocumentWithRevisions with read functionality #128
Conversation
Could you add more context to the PR and/or the code comments on why this is needed? I get that this includes the information about Human-in-the-Loop revisions, but I don't understand why a new type is needed to include the revisions. Could these be added as optional fields to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, be sure to add in testing
storage_client = storage.Client() | ||
|
||
blob_list = storage_client.list_blobs(output_bucket, prefix=output_prefix) | ||
pb = documentai.Document.pb() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you accessing the pb()
in the document? Do these doc.dp.dp files work differently to the document.json files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they're very different and the only way to get a usable Object is by accessing pb()
and converting the dp.bp
files to Document object using pb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use documentai.Document.from_json
(https://github.com/googleapis/proto-plus-python/blob/main/proto/message.py#L407) followed by accessing its text
attribute? (https://github.com/googleapis/google-cloud-python/blob/main/packages/google-cloud-documentai/google/cloud/documentai_v1/types/document.py#L72)
I think these practically do the same thing, but it may be worthwhile to avoid having to access the lower-level protobuf API pb
and FromString
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @dizcology I think that would be possible.
Note: be sure to use the parameter ignore_unknown_fields
because if the client library lags behind the core Document
protos, there could be some fields in Document
output that isn't in the client library yet. This prevents exceptions from being thrown.
document = documentai.Document.from_json(
blob.download_as_bytes(), ignore_unknown_fields=True
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not able to use the document.from_json
since the db.pb
isn't bytes of a Document object is bytes of the Document message which is different from what document.from_json
uses. If i try using Document.from_json
i get this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 3: invalid start byte
class DocumentWithRevisions: | ||
r"""Represents a wrapped Document. | ||
|
||
A single Document protobuf message might be written as several JSON files on | ||
GCS by Document AI's BatchProcessDocuments method. This class hides away the | ||
shards from the users and implements convenient methods for searching and | ||
extracting information within the Document. | ||
|
||
Attributes: | ||
gcs_prefix (str): | ||
Required.The gcs path to a single processed document. | ||
|
||
Format: `gs://{bucket}/{optional_folder}/{operation_id}/{folder_id}` | ||
where `{operation_id}` is the operation-id given from BatchProcessDocument | ||
and `{folder_id}` is the number corresponding to the target document. | ||
""" | ||
|
||
document: Document = dataclasses.field(init=True, repr=False) | ||
revision_nodes: List[documentai.Document] = dataclasses.field(init=True, repr=False) | ||
gcs_prefix: str = dataclasses.field(init=True, repr=False, default=None) | ||
parent_ids: List[str] = dataclasses.field(init=True, repr=False, default=None) | ||
|
||
next_: Document = dataclasses.field(init=False, repr=False, default=None) | ||
last: Document = dataclasses.field(init=False, repr=False, default=None) | ||
revision_id: str = dataclasses.field(init=False, repr=False, default=None) | ||
history: List[str] = dataclasses.field(init=False, repr=False, default_factory=list) | ||
root_revision: Document = dataclasses.field(init=False, repr=False, default=None) | ||
|
||
parent: DocumentWithRevisions = dataclasses.field( | ||
init=False, repr=False, default=None | ||
) | ||
|
||
children: List[DocumentWithRevisions] = dataclasses.field( | ||
init=False, repr=False, default_factory=list | ||
) | ||
children_ids: List[str] = dataclasses.field( | ||
init=False, repr=False, default_factory=list | ||
) | ||
root_revision_nodes: List[DocumentWithRevisions] = dataclasses.field( | ||
init=False, repr=False, default_factory=list | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I guess I understand more why the extra class for revisions is here (to keep the base Document class simpler) but I'm still not sure I like this solution, it seems like there's a lot of fields involved and this object could get quite large with all of the Documents and DocumentWithRevisions
included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implemented like this for multiple reasons, the first one is to keep continuity with the internal implementation. Another the document that we are importing from gcs here is completely different we are not taking in json files we're processing dp.bp
files which is binary protobuf and the revision documents should have a tree like hierarchy which would introduce a lot of clutter/confusion in Document. So it's best to separate it out to make it easier to make changes to DocumentWithRevision
. Also there are a lot of fields to support the tree structure and to support movement between non-children nodes.
storage_client = storage.Client() | ||
|
||
blob_list = storage_client.list_blobs(output_bucket, prefix=output_prefix) | ||
pb = documentai.Document.pb() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use documentai.Document.from_json
(https://github.com/googleapis/proto-plus-python/blob/main/proto/message.py#L407) followed by accessing its text
attribute? (https://github.com/googleapis/google-cloud-python/blob/main/packages/google-cloud-documentai/google/cloud/documentai_v1/types/document.py#L72)
I think these practically do the same thing, but it may be worthwhile to avoid having to access the lower-level protobuf API pb
and FromString
.
name = blob.name.split("/")[-1] | ||
if blob.name.endswith(".dp.bp"): | ||
blob_as_bytes = blob.download_as_bytes() | ||
if re.search(r"^doc.dp.bp", name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use name.startswith("doc.dp.bp")
instead? (same comment for the next two cases below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that might work for doc.dp.bp
but i don't know if it will work for rev
and pages
since they will be rev_0000, rev_0001...
and pages_0000,pages_0001...
} | ||
) | ||
elif e.provenance.type_ == _OP_TYPE.REMOVE: | ||
entity = entities_array.pop(int(e.id) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of operation is pretty tricky, popping from entities_array
, which is added to or modified by the other branches. Can we start by expanding the docstring to clarify what this function intends to accomplish?
gcs_bucket_name: str = dataclasses.field(repr=False, default=None) | ||
gcs_prefix: str = dataclasses.field(repr=False, default=None) | ||
parent_ids: List[str] = dataclasses.field(repr=False, default=None) | ||
all_node_ids: List[str] = dataclasses.field(repr=False, default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect parent_ids
and all_node_ids
are things that could be extracted from the data on GCS in __post_init__
? (as opposed to requiring them in __init__
, that is)
class DocumentWithRevisions: | ||
r"""Represents a wrapped Document. | ||
|
||
A single Document protobuf message with revisions will written as several `dp.bp` files on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"A single Document protobuf message with revisions" in the docstring here suggests that an instance of this is a protobuf message (which it is not) that comes with special attributes such as revisions
. But from the presence of the next_
, parent
attributes, we seem to be saying an instance of this class is a revision (that is, a single document at a particular point in time)? These are quite different kinds of data - what are we modeling here?
children: List["DocumentWithRevisions"] = dataclasses.field( | ||
init=False, repr=False, default_factory=list | ||
) | ||
children_ids: List[str] = dataclasses.field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be just a property, computed from children
?
if self.parent: | ||
current_index = self.parent.children_ids.index(self.revision_id) | ||
if current_index != 0: | ||
return self.parent.children[current_index - 1] | ||
elif current_index != None: | ||
print("hi") | ||
elif current_index is not None: | ||
return self.parent | ||
elif self.revision_id in self.parent_ids: | ||
non_parent_index = self.parent_ids.index(self.revision_id) | ||
if non_parent_index > 0: | ||
next_parent = self.parent_ids[non_parent_index - 1] | ||
index = self.all_node_ids.index(next_parent) | ||
return self.root_revision_nodes[index] | ||
else: | ||
return self | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this if/else tree could be simplified to make it easier to read. I find that "guard" statements work well to visualize the flow. Swap the conditions for the if
and add returns early instead of using many else
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend addressing @dizcology 's comments. I agree with all of them.
Overall, still not sure I understand the use case for this or why it was architected the way it is. Is there a design doc with the reasonings?
Also, add a sample for how to use this.
Closing this out based on our conversations |
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕