feat: added DocumentWithRevisions with read functionality #128

galz10 · 2023-06-14T19:35:28Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

holtskinner · 2023-06-14T20:18:35Z

Could you add more context to the PR and/or the code comments on why this is needed? I get that this includes the information about Human-in-the-Loop revisions, but I don't understand why a new type is needed to include the revisions. Could these be added as optional fields to the toolbox.Document class?

holtskinner

Also, be sure to add in testing

google/cloud/documentai_toolbox/wrappers/document_revision.py

holtskinner · 2023-06-14T20:21:06Z

google/cloud/documentai_toolbox/wrappers/document_revision.py

+    storage_client = storage.Client()
+
+    blob_list = storage_client.list_blobs(output_bucket, prefix=output_prefix)
+    pb = documentai.Document.pb()


Why are you accessing the pb() in the document? Do these doc.dp.dp files work differently to the document.json files?

Yes, they're very different and the only way to get a usable Object is by accessing pb() and converting the dp.bp files to Document object using pb

Why not just use documentai.Document.from_json (https://github.com/googleapis/proto-plus-python/blob/main/proto/message.py#L407) followed by accessing its text attribute? (https://github.com/googleapis/google-cloud-python/blob/main/packages/google-cloud-documentai/google/cloud/documentai_v1/types/document.py#L72)

I think these practically do the same thing, but it may be worthwhile to avoid having to access the lower-level protobuf API pb and FromString.

I agree with @dizcology I think that would be possible.

Note: be sure to use the parameter ignore_unknown_fields because if the client library lags behind the core Document protos, there could be some fields in Document output that isn't in the client library yet. This prevents exceptions from being thrown.

document = documentai.Document.from_json( blob.download_as_bytes(), ignore_unknown_fields=True )

I am not able to use the document.from_json since the db.pb isn't bytes of a Document object is bytes of the Document message which is different from what document.from_json uses. If i try using Document.from_json i get this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 3: invalid start byte

google/cloud/documentai_toolbox/wrappers/document_revision.py

holtskinner · 2023-06-14T20:40:35Z

google/cloud/documentai_toolbox/wrappers/document_revision.py

+class DocumentWithRevisions:
+    r"""Represents a wrapped Document.
+
+    A single Document protobuf message might be written as several JSON files on
+    GCS by Document AI's BatchProcessDocuments method.  This class hides away the
+    shards from the users and implements convenient methods for searching and
+    extracting information within the Document.
+
+    Attributes:
+        gcs_prefix (str):
+            Required.The gcs path to a single processed document.
+
+            Format: `gs://{bucket}/{optional_folder}/{operation_id}/{folder_id}`
+                    where `{operation_id}` is the operation-id given from BatchProcessDocument
+                    and `{folder_id}` is the number corresponding to the target document.
+    """
+
+    document: Document = dataclasses.field(init=True, repr=False)
+    revision_nodes: List[documentai.Document] = dataclasses.field(init=True, repr=False)
+    gcs_prefix: str = dataclasses.field(init=True, repr=False, default=None)
+    parent_ids: List[str] = dataclasses.field(init=True, repr=False, default=None)
+
+    next_: Document = dataclasses.field(init=False, repr=False, default=None)
+    last: Document = dataclasses.field(init=False, repr=False, default=None)
+    revision_id: str = dataclasses.field(init=False, repr=False, default=None)
+    history: List[str] = dataclasses.field(init=False, repr=False, default_factory=list)
+    root_revision: Document = dataclasses.field(init=False, repr=False, default=None)
+
+    parent: DocumentWithRevisions = dataclasses.field(
+        init=False, repr=False, default=None
+    )
+
+    children: List[DocumentWithRevisions] = dataclasses.field(
+        init=False, repr=False, default_factory=list
+    )
+    children_ids: List[str] = dataclasses.field(
+        init=False, repr=False, default_factory=list
+    )
+    root_revision_nodes: List[DocumentWithRevisions] = dataclasses.field(
+        init=False, repr=False, default_factory=list
+    )


Ok, I guess I understand more why the extra class for revisions is here (to keep the base Document class simpler) but I'm still not sure I like this solution, it seems like there's a lot of fields involved and this object could get quite large with all of the Documents and DocumentWithRevisions included.

This is implemented like this for multiple reasons, the first one is to keep continuity with the internal implementation. Another the document that we are importing from gcs here is completely different we are not taking in json files we're processing dp.bp files which is binary protobuf and the revision documents should have a tree like hierarchy which would introduce a lot of clutter/confusion in Document. So it's best to separate it out to make it easier to make changes to DocumentWithRevision. Also there are a lot of fields to support the tree structure and to support movement between non-children nodes.

…ai-toolbox into revisions

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

…ai-toolbox into revisions

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

google/cloud/documentai_toolbox/wrappers/document_revision.py

dizcology · 2023-07-06T22:08:00Z

google/cloud/documentai_toolbox/wrappers/document_revision.py

+    storage_client = storage.Client()
+
+    blob_list = storage_client.list_blobs(output_bucket, prefix=output_prefix)
+    pb = documentai.Document.pb()


Why not just use documentai.Document.from_json (https://github.com/googleapis/proto-plus-python/blob/main/proto/message.py#L407) followed by accessing its text attribute? (https://github.com/googleapis/google-cloud-python/blob/main/packages/google-cloud-documentai/google/cloud/documentai_v1/types/document.py#L72)

I think these practically do the same thing, but it may be worthwhile to avoid having to access the lower-level protobuf API pb and FromString.

dizcology · 2023-07-06T22:09:39Z

google/cloud/documentai_toolbox/wrappers/document_revision.py

+        name = blob.name.split("/")[-1]
+        if blob.name.endswith(".dp.bp"):
+            blob_as_bytes = blob.download_as_bytes()
+            if re.search(r"^doc.dp.bp", name):


Why not use name.startswith("doc.dp.bp") instead? (same comment for the next two cases below)

that might work for doc.dp.bp but i don't know if it will work for rev and pages since they will be rev_0000, rev_0001... and pages_0000,pages_0001...

google/cloud/documentai_toolbox/wrappers/document_revision.py

dizcology · 2023-07-06T22:42:18Z

google/cloud/documentai_toolbox/wrappers/document_revision.py

+                }
+            )
+        elif e.provenance.type_ == _OP_TYPE.REMOVE:
+            entity = entities_array.pop(int(e.id) - 1)


This kind of operation is pretty tricky, popping from entities_array, which is added to or modified by the other branches. Can we start by expanding the docstring to clarify what this function intends to accomplish?

google/cloud/documentai_toolbox/wrappers/document_revision.py

dizcology · 2023-07-06T22:47:11Z

google/cloud/documentai_toolbox/wrappers/document_revision.py

+    gcs_bucket_name: str = dataclasses.field(repr=False, default=None)
+    gcs_prefix: str = dataclasses.field(repr=False, default=None)
+    parent_ids: List[str] = dataclasses.field(repr=False, default=None)
+    all_node_ids: List[str] = dataclasses.field(repr=False, default=None)


I would expect parent_ids and all_node_ids are things that could be extracted from the data on GCS in __post_init__? (as opposed to requiring them in __init__, that is)

dizcology · 2023-07-06T22:50:16Z

google/cloud/documentai_toolbox/wrappers/document_revision.py

+class DocumentWithRevisions:
+    r"""Represents a wrapped Document.
+
+    A single Document protobuf message with revisions will written as several `dp.bp` files on


"A single Document protobuf message with revisions" in the docstring here suggests that an instance of this is a protobuf message (which it is not) that comes with special attributes such as revisions. But from the presence of the next_, parent attributes, we seem to be saying an instance of this class is a revision (that is, a single document at a particular point in time)? These are quite different kinds of data - what are we modeling here?

dizcology · 2023-07-06T22:50:46Z

google/cloud/documentai_toolbox/wrappers/document_revision.py

+    children: List["DocumentWithRevisions"] = dataclasses.field(
+        init=False, repr=False, default_factory=list
+    )
+    children_ids: List[str] = dataclasses.field(


Can this be just a property, computed from children?

google/cloud/documentai_toolbox/wrappers/document_revision.py

tests/unit/test_converter_helpers.py

tests/unit/test_document_revision.py

holtskinner · 2023-07-07T17:08:40Z

google/cloud/documentai_toolbox/wrappers/document_revision.py

        if self.parent:
            current_index = self.parent.children_ids.index(self.revision_id)
            if current_index != 0:
                return self.parent.children[current_index - 1]
-            elif current_index != None:
-                print("hi")
+            elif current_index is not None:
                return self.parent
+        elif self.revision_id in self.parent_ids:
+            non_parent_index = self.parent_ids.index(self.revision_id)
+            if non_parent_index > 0:
+                next_parent = self.parent_ids[non_parent_index - 1]
+                index = self.all_node_ids.index(next_parent)
+                return self.root_revision_nodes[index]
+            else:
+                return self
        return self


I think this if/else tree could be simplified to make it easier to read. I find that "guard" statements work well to visualize the flow. Swap the conditions for the if and add returns early instead of using many elses.

google/cloud/documentai_toolbox/wrappers/document_revision.py

holtskinner

I recommend addressing @dizcology 's comments. I agree with all of them.

Overall, still not sure I understand the use case for this or why it was architected the way it is. Is there a design doc with the reasonings?

Also, add a sample for how to use this.

holtskinner · 2023-08-03T15:47:29Z

Closing this out based on our conversations

galz10 added 4 commits May 31, 2023 11:14

feat: added DocumentWithRevisions

56f7e57

updated document_revision.pym to use tree

2b21e10

added print_tree and last,next revision

3989e21

Merge branch 'googleapis:main' into revisions

023273c

galz10 requested review from a team as code owners June 14, 2023 19:35

product-auto-label bot added the size: l Pull request size is large. label Jun 14, 2023

holtskinner added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 14, 2023

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 14, 2023

🦉 Updates from OwlBot post-processor

c9ce552

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

holtskinner requested changes Jun 14, 2023

View reviewed changes

galz10 added 8 commits June 15, 2023 14:00

document_revision print tree changes

93c9303

Merge branch 'revisions' of https://github.com/galz10/python-document…

792ab0f

…ai-toolbox into revisions

Merge branch 'main' into revisions

2c5749d

added document_revision tests and test resources

d3ddf2e

added more tests

5cfb5fd

fixed coverage

eea5c29

Merge branch 'main' into revisions

4ac3238

update tests

65f0c06

product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Jul 6, 2023

galz10 added the owlbot:run Add this label to trigger the Owlbot post processor. label Jul 6, 2023

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jul 6, 2023

gcf-owl-bot bot and others added 6 commits July 6, 2023 17:31

🦉 Updates from OwlBot post-processor

3ba4d70

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

revised code

a57f89c

Merge branch 'revisions' of https://github.com/galz10/python-document…

e19a914

…ai-toolbox into revisions

lint fix

a751f57

revise code

c91adbe

changed coverage

0417238

galz10 added the owlbot:run Add this label to trigger the Owlbot post processor. label Jul 6, 2023

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jul 6, 2023

gcf-owl-bot bot and others added 3 commits July 6, 2023 19:52

🦉 Updates from OwlBot post-processor

db5be58

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

added comments

98f14ee

revised code

baafa94

dizcology requested changes Jul 6, 2023

View reviewed changes

holtskinner reviewed Jul 7, 2023

View reviewed changes

google/cloud/documentai_toolbox/wrappers/document_revision.py Outdated Show resolved Hide resolved

tests/unit/test_converter_helpers.py Outdated Show resolved Hide resolved

holtskinner reviewed Jul 7, 2023

View reviewed changes

tests/unit/test_document_revision.py Outdated Show resolved Hide resolved

holtskinner reviewed Jul 7, 2023

View reviewed changes

tests/unit/test_document_revision.py Show resolved Hide resolved

holtskinner reviewed Jul 7, 2023

View reviewed changes

google/cloud/documentai_toolbox/wrappers/document_revision.py Show resolved Hide resolved

holtskinner requested changes Jul 7, 2023

View reviewed changes

galz10 added 2 commits July 10, 2023 11:03

revised code

f7730c4

Merge branch 'main' into revisions

1b4ff0a

holtskinner closed this Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: added DocumentWithRevisions with read functionality #128

feat: added DocumentWithRevisions with read functionality #128

galz10 commented Jun 14, 2023

holtskinner commented Jun 14, 2023

holtskinner left a comment

holtskinner Jun 14, 2023

galz10 Jul 6, 2023

dizcology Jul 6, 2023

holtskinner Jul 7, 2023

galz10 Jul 7, 2023

holtskinner Jun 14, 2023

galz10 Jul 6, 2023

dizcology Jul 6, 2023

dizcology Jul 6, 2023

galz10 Jul 7, 2023 •

edited

Loading

dizcology Jul 6, 2023

dizcology Jul 6, 2023

dizcology Jul 6, 2023

dizcology Jul 6, 2023

holtskinner Jul 7, 2023

holtskinner left a comment

holtskinner commented Aug 3, 2023

feat: added DocumentWithRevisions with read functionality #128

feat: added DocumentWithRevisions with read functionality #128

Conversation

galz10 commented Jun 14, 2023

holtskinner commented Jun 14, 2023

holtskinner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galz10 Jul 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holtskinner left a comment

Choose a reason for hiding this comment

holtskinner commented Aug 3, 2023

galz10 Jul 7, 2023 •

edited

Loading