Skip to content

Commit

Permalink
fix: Handle Edge Case where GCS Shards are out of order (#69)
Browse files Browse the repository at this point in the history
* fix: Handle Edge Case where GCS Shards are out of order

fixes #68

* test: Add additional test cases for unordered shards

* docs: updtaes to docstrings

* Replace Test Sharded Document with smaller document

* Added Check for multiple shards before sorting
  • Loading branch information
holtskinner committed Mar 2, 2023
1 parent c32a371 commit 709fe86
Show file tree
Hide file tree
Showing 8 changed files with 49 additions and 11 deletions.
20 changes: 11 additions & 9 deletions google/cloud/documentai_toolbox/wrappers/document.py
Expand Up @@ -106,11 +106,11 @@ def _get_bytes(gcs_bucket_name: str, gcs_prefix: str) -> List[bytes]:
gcs_bucket_name (str):
Required. The name of the gcs bucket.
Format: `gs://bucket/optional_folder/target_folder/` where gcs_bucket_name=`bucket`.
Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/` where gcs_bucket_name=`bucket`.
gcs_prefix (str):
Required. The prefix of the json files in the target_folder
Format: `gs://bucket/optional_folder/target_folder/` where gcs_prefix=`optional_folder/target_folder`.
Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/` where gcs_prefix=`{optional_folder}/{target_folder}`.
Returns:
List[bytes]:
A list of bytes.
Expand Down Expand Up @@ -138,11 +138,11 @@ def _get_shards(gcs_bucket_name: str, gcs_prefix: str) -> List[documentai.Docume
gcs_bucket_name (str):
Required. The name of the gcs bucket.
Format: `gs://bucket/optional_folder/target_folder/` where gcs_bucket_name=`bucket`.
Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/` where gcs_bucket_name=`bucket`.
gcs_prefix (str):
Required. The prefix of the json files in the target_folder.
Format: `gs://bucket/optional_folder/target_folder/` where gcs_prefix=`optional_folder/target_folder`.
Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/` where gcs_prefix=`{optional_folder}/{target_folder}`.
Returns:
List[google.cloud.documentai.Document]:
A list of documentai.Documents.
Expand All @@ -160,6 +160,8 @@ def _get_shards(gcs_bucket_name: str, gcs_prefix: str) -> List[documentai.Docume
for byte in byte_array:
shards.append(documentai.Document.from_json(byte, ignore_unknown_fields=True))

if len(shards) > 1:
shards.sort(key=lambda x: int(x.shard_info.shard_index))
return shards


Expand All @@ -181,11 +183,11 @@ def print_gcs_document_tree(gcs_bucket_name: str, gcs_prefix: str) -> None:
gcs_bucket_name (str):
Required. The name of the gcs bucket.
Format: `gs://bucket/optional_folder/target_folder/` where gcs_bucket_name=`bucket`.
Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/` where gcs_bucket_name=`bucket`.
gcs_prefix (str):
Required. The prefix of the json files in the target_folder.
Format: `gs://bucket/optional_folder/target_folder/` where gcs_prefix=`optional_folder/target_folder`.
Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/` where gcs_prefix=`{optional_folder}/{target_folder}`.
Returns:
None.
Expand Down Expand Up @@ -240,11 +242,11 @@ class Document:
gcs_bucket_name (Optional[str]):
Optional. The name of the gcs bucket.
Format: `gs://bucket/optional_folder/target_folder/` where gcs_bucket_name=`bucket`.
Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/` where gcs_bucket_name=`bucket`.
gcs_prefix (Optional[str]):
Optional. The prefix of the json files in the target_folder.
Format: `gs://bucket/optional_folder/target_folder/` where gcs_prefix=`optional_folder/target_folder`.
Format: `gs://{bucket_name}/{optional_folder}/{target_folder}/` where gcs_prefix=`{optional_folder}/{target_folder}`.
For more information please take a look at https://cloud.google.com/storage/docs/json_api/v1/objects/list .
pages: (List[Page]):
Expand Down Expand Up @@ -315,7 +317,7 @@ def from_gcs(cls, gcs_bucket_name: str, gcs_prefix: str):
gcs_prefix (str):
Required. The prefix to the location of the target folder.
Format: Given `gs://{bucket_name}/optional_folder/target_folder` where gcs_prefix=`{optional_folder}/{target_folder}`.
Format: Given `gs://{bucket_name}/{optional_folder}/{target_folder}` where gcs_prefix=`{optional_folder}/{target_folder}`.
Returns:
Document:
A document from gcs.
Expand Down
7 changes: 5 additions & 2 deletions google/cloud/documentai_toolbox/wrappers/page.py
Expand Up @@ -326,11 +326,14 @@ class Page:
Required. The original google.cloud.documentai.Document.Page object.
text: (str):
Required. The full text of the Document containing the Page.
lines (List[str]):
form_fields (List[FormField]):
Required. A list of visually detected form fields on the
page.
lines (List[Line]):
Required. A list of visually detected text lines on the
page. A collection of tokens that a human would
perceive as a line.
paragraphs (List[str]):
paragraphs (List[Paragraph]):
Required. A list of visually detected text paragraphs
on the page. A collection of lines that a human
would perceive as a paragraph.
Expand Down
@@ -0,0 +1 @@
{"pages":[{"layout":{"boundingPoly":{"normalizedVertices":[{},{"x":1},{"x":1,"y":1},{"y":1}],"vertices":[{},{"x":1596},{"x":1596,"y":2505},{"y":2505}]},"confidence":0.98390293,"orientation":"PAGE_UP","textAnchor":{"textSegments":[{"endIndex":"942"}]}},"pageNumber":41},{"layout":{"boundingPoly":{"normalizedVertices":[{},{"x":1},{"x":1,"y":1},{"y":1}],"vertices":[{},{"x":1602},{"x":1602,"y":2496},{"y":2496}]},"confidence":0.98344266,"orientation":"PAGE_UP","textAnchor":{"textSegments":[{"endIndex":"2211","startIndex":"942"}]}},"pageNumber":42},{"layout":{"boundingPoly":{"normalizedVertices":[{},{"x":1},{"x":1,"y":1},{"y":1}],"vertices":[{},{"x":1602},{"x":1602,"y":2496},{"y":2496}]},"confidence":0.79652208,"orientation":"PAGE_UP","textAnchor":{"textSegments":[{"endIndex":"2573","startIndex":"2211"}]}},"pageNumber":43},{"layout":{"boundingPoly":{"normalizedVertices":[{},{"x":1},{"x":1,"y":1},{"y":1}],"vertices":[{},{"x":1622},{"x":1622,"y":2465},{"y":2465}]},"confidence":0.97713888,"orientation":"PAGE_UP","textAnchor":{"textSegments":[{"endIndex":"3381","startIndex":"2573"}]}},"pageNumber":44},{"layout":{"boundingPoly":{"normalizedVertices":[{},{"x":1},{"x":1,"y":1},{"y":1}],"vertices":[{},{"x":1597},{"x":1597,"y":2503},{"y":2503}]},"confidence":0.87524492,"orientation":"PAGE_UP","textAnchor":{"textSegments":[{"endIndex":"3599","startIndex":"3381"}]}},"pageNumber":45},{"layout":{"boundingPoly":{"normalizedVertices":[{},{"x":1},{"x":1,"y":1},{"y":1}],"vertices":[{},{"x":1616},{"x":1616,"y":2473},{"y":2473}]},"confidence":0.98405439,"orientation":"PAGE_UP","textAnchor":{"textSegments":[{"endIndex":"4424","startIndex":"3599"}]}},"pageNumber":46},{"layout":{"boundingPoly":{"normalizedVertices":[{},{"x":1},{"x":1,"y":1},{"y":1}],"vertices":[{},{"x":1605},{"x":1605,"y":2490},{"y":2490}]},"confidence":0.97508377,"orientation":"PAGE_UP","textAnchor":{"textSegments":[{"endIndex":"5175","startIndex":"4424"}]}},"pageNumber":47},{"layout":{"boundingPoly":{"normalizedVertices":[{},{"x":1},{"x":1,"y":1},{"y":1}],"vertices":[{},{"x":1619},{"x":1619,"y":2469},{"y":2469}]},"confidence":0.98273796,"orientation":"PAGE_UP","textAnchor":{"textSegments":[{"endIndex":"6181","startIndex":"5175"}]}},"pageNumber":48},{"layout":{"boundingPoly":{"normalizedVertices":[{},{"x":1},{"x":1,"y":1},{"y":1}],"vertices":[{},{"x":1605},{"x":1605,"y":2490},{"y":2490}]},"confidence":0.97522026,"orientation":"PAGE_UP","textAnchor":{"textSegments":[{"endIndex":"7366","startIndex":"6181"}]}},"pageNumber":49},{"layout":{"boundingPoly":{"normalizedVertices":[{},{"x":1},{"x":1,"y":1},{"y":1}],"vertices":[{},{"x":1609},{"x":1609,"y":2484},{"y":2484}]},"confidence":0.97771299,"orientation":"PAGE_UP","textAnchor":{"textSegments":[{"endIndex":"8532","startIndex":"7366"}]}},"pageNumber":50}],"shardInfo":{"shardCount":"5","shardIndex":"4","textOffset":"27701"},"text":"WINNIE-THE-POOH\n\"Pooh!\" cried Piglet. \"Do you think it is another\nWoozle?\"\n38\n\"No,\" said Pooh, \"because it makes different marks.\nIt is either Two Woozles and one, as it might be,\nWizzle, or Two, as it might be, Wizzles and one,\nif so it is, Woozle. Let us continue to follow them.'\nSo they went on, feeling just a little anxious now,\nin case the three animals in front of them were of\nHostile Intent. And Piglet wished very much that\nhis Grandfather T. W. were there, instead of else-\nwhere, and Pooh thought how nice it would be if\nthey met Christopher Robin suddenly but quite ac-\ncidentally, and only because he liked Christopher\nRobin so much. And then, all of a sudden, Winnie-\nthe-Pooh stopped again, and licked the tip of his\nnose in a cooling manner, for he was feeling more\nhot and anxious than ever in his life before. There\nwere four animals in front of them!\n\"Do you see, Piglet? Look at their tracks! Three,\nDigitized by\nGoogle\nPOOH AND PIGLET HUNT\n39\nas it were, Woozles, and one, as it was, Wizzle. An-\nother Woozle has joined them!”\nAnd so it seemed to be. There were the tracks;\ncrossing over each other here, getting muddled up\nwith each other there; but, quite plainly every now\nand then, the tracks of four sets of paws.\n\"I think,\" said Piglet, when he had licked the tip\nof his nose too, and found that it brought very little\ncomfort, \"I think that I have just remembered\nsomething. I have just remembered something that\nI forgot to do yesterday and shan't be able to do to-\nmorrow. So I suppose I really ought to go back and\ndo it now.'\n\"We'll do it this afternoon, and I'll come with\nyou,\" said Pooh.\n\"It isn't the sort of thing you can do in the after-\nnoon,” said Piglet quickly. “It's a very particular\nmorning thing, that has to be done in the morning,\nand, if possible, between the hours of What\nwould you say the time was?\"\n\"About twelve,\" said Winnie-the-Pooh, looking at\nthe sun.\n\"Between, as I was saying, the hours of twelve and\ntwelve five. So, really, dear old Pooh, if you'll ex-\ncuse me- What's that?\"\nPooh looked up at the sky, and then, as he heard\nthe whistle again, he looked up into the branches of\na big oak-tree, and then he saw a friend of his.\nDigitized by\nGoogle\n40\n\"It's Christopher Robin,\" he said.\nWINNIE-THE-POOH\nDigitized by\nMart\nAM\n\"Ah, then you'll be all right,\" said Piglet. \"You'll\nbe quite safe with him. Good-bye,\" and he trotted\noff home as quickly as he could, very glad to be\nOut of All Danger again.\nGoogle\n13\nWATER\nChristopher Robin came slowly down his tree.\n\"Silly old Bear,\" he said, \"what were you doing?\nPOOH AND PIGLET HUNT\n41\nFirst you went round the spinney twice by your-\nself, and then Piglet ran after you and you went\nround again together, and then you were just going\nround a fourth time--\"\n\"Wait a moment,\" said Winnie-the-Pooh, holding\nup his paw.\nHe sat down and thought, in the most thoughtful\nway he could think. Then he fitted his paw into\none of the Tracks . . . and then he scratched his\nnose twice, and stood up.\n\"Yes,\" said Winnie-the-Pooh.\n\"I see now,\" said Winnie-the-Pooh.\n\"I have been Foolish and Deluded,\" said he, \"and\nI am a Bear of No Brain at All.\"\n\"You're the Best Bear in All the World,” said\nChristopher Robin soothingly.\n\"Am I?\" said Pooh hopefully. And then he bright-\nened up suddenly.\n\"Anyhow,\" he said, \"it is nearly Luncheon Time.\"\nSo he went home for it.\nDigitized by Google\nIN WHICH Eeyore Loses a Tail\nand Pooh Finds One\nTHE Old Grey Donkey, Eeyore,\nstood by himself in a thistly corner of the forest,\nhis front feet well apart, his head on one side, and\nC\nCHAPTER IV\n42\nDigitized by\nGoogle\nEEYORE LOSES A TAIL\n43\nthought about things. Sometimes he thought sadly\nto himself, \"Why?\" and sometimes he thought,\n\"Wherefore?\" and sometimes he thought, \"Inas-\nmuch as which?\"-and sometimes he didn't quite\nknow what he was thinking about. So when Winnie-\nthe-Pooh came stumping along, Eeyore was very\nglad to be able to stop thinking for a little, in order\nto say \"How do you do?\" in a gloomy manner to\nhim.\n\"And how are you?\" said Winnie-the-Pooh.\nEeyore shook his head from side to side.\n\"Not very how,\" he said. \"I don't seem to have\nfelt at all how for a long time.\"\n\"Dear, dear,\" said Pooh, \"I'm sorry about that.\nLet's have a look at you.\"\nSo Eeyore stood there, gazing sadly at the ground,\nand Winnie-the-Pooh walked all round him once.\n\"Why, what's happened to your tail?\" he said in\nsurprise.\nDigitized by\nGoogle\n44\n\"What has happened to it?\" said Eeyore.\n\"It isn't there!\"\n\"Are you sure?\"\nWINNIE-THE-POOH\n\"Well, either a tail is there or it isn't there. You\ncan't make a mistake about it. And yours isn't\nthere!\"\n\"Then what is?\"\n\"Nothing.\"\n\"Let's have a look,\" said Eeyore, and he turned\nslowly round to the place where his tail had been a\nlittle while ago, and then, finding that he couldn't\ncatch it up, he turned round the other way, until he\ncame back to where he was at first, and then he put\nhis head down and looked between his front legs,\nand at last he said, with a long, sad sigh, \"I believe\nyou're right.\"\n\"Of course I'm right,\" said Pooh.\n\"That Accounts for a Good Deal,\" said Eeyore\ngloomily. \"It Explains Everything. No Wonder.\"\nDigitized by\nGoogle\nEEYORE LOSES A TAIL\n45\n\"You must have left it somewhere,\" said Winnie-\nthe-Pooh.\n\"Somebody must have taken it,\" said Eeyore. \"How\nLike Them,\" he added, after a long silence.\nPooh felt that he ought to say something helpful\nabout it, but didn't quite know what. So he decided\nto do something helpful instead.\n\"Eeyore,\" he said solemnly, \"I, Winnie-the-Pooh,\nwill find your tail for you.\"\nn\n\"Thank you, Pooh,\" answered Eeyore. \"You're a\nreal friend,\" said he. \"Not like Some,\" he said.\nSo Winnie-the-Pooh went off to find Eeyore's tail.\nIt was a fine spring morning in the forest as he\nstarted out. Little soft clouds played happily in a\nblue sky, skipping from time to time in front of the\nsun as if they had come to put it out, and then slid-\ning away suddenly so that the next might have his\nturn. Through them and between them the sun\nshone bravely; and a copse which had worn its firs\nall the year round seemed old and dowdy now be-\nside the new green lace which the beeches had put\nDigitized by\nGoogle\n46\non so prettily. Through copse and spinney marched\nBear; down open slopes of gorse and heather, over\nrocky beds of streams, up steep banks of sandstone\ninto the heather again; and so at last, tired and hun-\ngry, to the Hundred Acre Wood. For it was in the\nHundred Acre Wood that Owl lived.\n\"And if anyone knows anything about anything,\"\nsaid Bear to himself, \"it's Owl who knows some-\nthing about something,\" he said, “or my name's not\nWinnie-the-Pooh,” he said. “Which it is,” he added.\n\"So there\nyou are.\nOwl lived at The Chestnuts, an old-world resi-\ndence of great charm, which was grander than any-\nbody else's, or seemed so to Bear, because it had\nboth a knocker and a bell-pull. Underneath the\nknocker there was a notice which said:\nWINNIE-THE-POOH\nPLES RING IF AN RNSER IS REQIRD.\nUnderneath the bell-pull there was a notice which\nsaid:\nPLEZ CNOKE IF AN RNSR IS NOT REQID.\nThese notices had been written by Christopher\nRobin, who was the only one in the forest who\ncould spell; for Owl, wise though he was in many\nways, able to read and write and spell his own name\nWOL, yet somehow went all to pieces over delicate\nwords like MEASLES and BUTTERED TOAST.\nDigitized by\nGoogle\n48\nWINNIE-THE-POOH\nWinnie-the-Pooh read the two notices very care-\nfully, first from left to right, and afterwards, in case\nhe had missed some of it, from right to left. Then,\nto make quite sure, he knocked and pulled the\nknocker, and he pulled and knocked the bell-rope,\nand he called out in a very loud voice, “Owl! I re-\nquire an answer! It's Bear speaking.\" And the door\nopened, and Owl looked out.\n\"Hallo, Pooh,\" he said. \"How's things?\"\n\"Terrible and Sad,\" said Pooh, \"because Eeyore,\nwho is a friend of mine, has lost his tail. And he's\nMoping about it. So could you very kindly tell me\nhow to find it for him?\"\n\"Well,\" said Owl, \"the customary procedure in\nsuch cases is as follows.\"\n\"What does Crustimoney Proseedcake mean?” said\nPooh. \"For I am a Bear of Very Little Brain, and\nlong words, Bother me.\"\n\"It means the Thing to Do.\"\n\"As long as it means that, I don't mind,\" said Pooh\nhumbly.\n\"The thing to do is as follows. First, Issue a Re-\nward. Then--\"\n\"Just a moment,\" said Pooh, holding up his paw.\n“What do we do to this-what you were saying?\nYou sneezed just as you were going to tell me.\"\n\"I didn't sneeze.\"\n\"Yes, you did, Owl.\"\nDigitized by\nGoogle\n"}

0 comments on commit 709fe86

Please sign in to comment.