Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Restructuring of classes to reduce duplication and increase readibility #110

Merged
merged 22 commits into from
Jun 23, 2023

Conversation

holtskinner
Copy link
Member

@holtskinner holtskinner commented Apr 19, 2023

  • Changed vision_helpers.py to input Document.Page to reduce extra imports.
  • Rewrote Entity image extraction to avoid storing the Pillow image in memory.
  • Added error checking for missing shards.
  • Added sorting of shards by index in case they are read out of order.
  • Fixed bug where Entity page numbers would be incorrect in multi-shard documents.
  • Simplified creation of Page sub classes by using __post_init__()
  • Renamed Page.text and Table.text to document_text to be more descriptive and match the other Page elements.
  • Rewrote table_sample.py to use more accurate/descriptive variable names.
  • Minor formatting of comments/docs

@holtskinner holtskinner requested review from a team as code owners April 19, 2023 21:43
@product-auto-label product-auto-label bot added the size: m Pull request size is medium. label Apr 19, 2023
@holtskinner holtskinner added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 19, 2023
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 19, 2023
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Apr 20, 2023
@holtskinner holtskinner requested a review from a team as a code owner April 20, 2023 21:59
@galz10 galz10 requested a review from dizcology April 25, 2023 14:53
@galz10
Copy link
Collaborator

galz10 commented Apr 25, 2023

@dizcology i'd like to get a second opinion on these change, please take a look when you have time.

@galz10 galz10 requested review from galz10 and removed request for balajismaniam April 25, 2023 14:54
google/cloud/documentai_toolbox/wrappers/document.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/document.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/entity.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/entity.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/entity.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
samples/snippets/table_sample.py Outdated Show resolved Hide resolved
google/cloud/documentai_toolbox/wrappers/page.py Outdated Show resolved Hide resolved
- Add incomplete shard check
- refactor some of the vision helpers to simplify imports
- Fix documentation formatting issue
@dizcology dizcology mentioned this pull request May 2, 2023
4 tasks
- Avoids some backwards incompatibility issues
- Fixed bug where Entity page numbers would not line up with multi-shard documents
@holtskinner holtskinner changed the title refactor: Change classes to avoid extra copies refactor: Restructuring of classes to reduce duplication and increase readibility Jun 15, 2023
@holtskinner holtskinner added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 15, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 15, 2023
holtskinner and others added 3 commits June 20, 2023 14:15
- Added a more specific type for dictionary. `Dict[str, Union[str, List[str]]]`
- Rewrote Docstring for `Entity.page_offset`
- Wrote unit test for `page_offset`
@holtskinner holtskinner added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 20, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 20, 2023
@galz10 galz10 requested a review from dizcology June 23, 2023 17:20
@holtskinner holtskinner merged commit 778e433 into main Jun 23, 2023
21 checks passed
@holtskinner holtskinner deleted the memory branch June 23, 2023 17:35
holtskinner added a commit that referenced this pull request Jun 30, 2023
holtskinner added a commit that referenced this pull request Jul 7, 2023
* refactor: Reorganize hocr functions

- Use more jinja templating instead of hardcoding strings
- Simplified bounding box function
- Changed parameter name for `_get_hocr_bounding_box` to `page_dimension` for more clarity.

* samples: Added sample for convert to hocr

* refactor: Reordering of classes in page.py

* refactor: Re-added refactoring to remove extra `get_*()` methods in page.py

- Added in #110 Lost in Merge

* fix: Moved `templates` directory into package.
- Required for template to work in installed library
holtskinner added a commit that referenced this pull request Jul 20, 2023
…ciency and follow standard practices. (#139)

* refactor: Reorganize hocr functions

- Use more jinja templating instead of hardcoding strings
- Simplified bounding box function
- Changed parameter name for `_get_hocr_bounding_box` to `page_dimension` for more clarity.

* samples: Added sample for convert to hocr

* refactor: Reordering of classes in page.py

* refactor: Re-added refactoring to remove extra `get_*()` methods in page.py

- Added in #110 Lost in Merge

* fix: Moved `templates` directory into package.
- Required for template to work in installed library

* chore: Ran isort and black

* chore: Ran no-implicit-optional

* refactor: Refactored document.py
- improve readability, follow python conventions, and improve efficiency
- Also, fixed a previously unknown bug where `Document.search_pages()` returned inaccurate results because it only searched paragraph.text, not page.text

* refactor: Refactor gcs_utilities for readability/pythonic style

* refactor: Refactor page.py to improve efficiency, readability and follow python conventions

* refactor: Rename `Entity.documentai_entity` to `Entity.documentai_object` to match the page.py file

* refactor: Move bounding box extraction to `docai_utilities.py`

* refactor: Major Refactoring of converter_helpers.py to simplify/organize functions, reduce complexity, and increase readability

* fix: Fixed refactor of export_images in document.py

* refactor: Cleanup of blocks.py using `getattr()`

* refactor: Refactoring of bbox_conversion.py to improve readability and efficiency

* fix: Change _get_files() to send full gcs uri to _get_bytes()
- Also reduce wait_time in tests

* refactor: Move `converter_helpers.py` functions into `converter.py`

- `converter.py` only had one external facing function that called an internal function with the same parameters.
- Not sure if there was a specific reason for this setup, can be undone if needed.

* chore(deps): update dependency google-cloud-documentai to v2.16.1 (#138)

* fix: Change _get_files() to send full gcs uri to _get_bytes()
- Also reduce wait_time in tests

* refactor: Move `converter_helpers.py` functions into `converter.py`

- `converter.py` only had one external facing function that called an internal function with the same parameters.
- Not sure if there was a specific reason for this setup, can be undone if needed.

* chore: Reran black formatting after merge conflict

* refactor: Minor refactoring of test_bbox_conversion.py to improve readability

* refactor: Changed blocks.py to block.py for consistency.

- Changed how `Block` is initialized.
- Changed `load_blocks_from_schema` into a `@classmethod` to simplify imports.

* fix: Added Missing type annotations to `document.py`

* fix: Add new filename for block.py into test_bbox_conversion.py

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* fix: Fix failing tests for Block class. Changed all fields to have types

* fix: Changed `converter._get_bytes` to return a Tuple

* chore: Addressed Code Review Comments

- Removed FILES_TO_IGNORE
- Simplification of logic in `_get_multiplier` `convert_bbox_to_docproto_bbox`
- Addressed other lint errors
- Adjusted function names to indicate not protected members.

* fix: Remove extra reference to metadata_blob

* fix: Change expected test output and remove references to `geometry`

---------

Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants