fix: `docai_utilities.py` to return `Optional` #176

holtskinner · 2023-10-03T16:29:08Z

Should resolve customer reported issue in support case #47169701 relating to duplicate/inaccurate elements in hOCR output
Followup to:
- fix: Add handling for documents missing all layout elements. #161
- fix: Change ocr_line <span> to include all ocr_word #169

conventional-commit-lint-gcf · 2023-10-03T16:29:13Z

🤖 I detect that the PR title and the commit message differ and there's only one commit. To use the PR title for the commit history, you can use Github's automerge feature with squashing, or use automerge label. Good luck human!

-- conventional-commit-lint bot
https://conventionalcommits.org/

- Should resolve customer reported issue in support case #47169701 relating to duplicate/inaccurate elements in hOCR output - Followup to: - #161 - #169

google/cloud/documentai_toolbox/utilities/docai_utilities.py

google/cloud/documentai_toolbox/wrappers/entity.py

tests/unit/test_document.py

dizcology · 2023-10-03T18:19:43Z

If possible, please also give some explanations about the duplicate/inaccurate elements in hOCR output in the PR description.

holtskinner · 2023-10-03T18:21:49Z

If possible, please also give some explanations about the duplicate/inaccurate elements in hOCR output in the PR description.

I will once I get that information

holtskinner · 2023-10-06T16:09:11Z

Note: In the customers code, they use this whenever the document is blank. Not sure if this is a standard structure for blank hOCR documents, but could be good to look into.

Note - It doesn't follow the corrected structure after #169

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="unknown" lang="unknown">
<head>
<title>hocr</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="ocr-system" content="Document AI OCR" />
<meta name="ocr-langs" content="unknown" />
<meta name="ocr-number-of-pages" content="1" />
<meta name="ocr-capabilities" content="ocr_page ocr_carea ocr_par ocr_line ocrx_word" />
</head>
<body>
<div class='ocr_page' lang='unknown' title='bbox 0 0 0 0'>
<span class='ocr_carea' id='block_1_0' title='bbox 0 0 0 0'>
<span class='ocr_par' id='par_1_0_0' title='bbox 0 0 0 0'>
<span class='ocr_line' id='line_1_0_0_0' title='bbox 0 0 0 0'></span>
<span class='ocrx_word' id='word_1_0_0_0_0' title='bbox 0 0 0 0'></span>
</span>
</span>
</div>
</body>
</html>

holtskinner requested review from a team as code owners October 3, 2023 16:29

product-auto-label bot added the size: m Pull request size is medium. label Oct 3, 2023

fix: docai_utilities.py to return Optional

4da74a2

- Should resolve customer reported issue in support case #47169701 relating to duplicate/inaccurate elements in hOCR output - Followup to: - #161 - #169

holtskinner changed the title ~~fix: Update docai_utilities.py to return an Optional~~ fix: docai_utilities.py to return Optional Oct 3, 2023

holtskinner force-pushed the hocr-fixes branch from 15a3021 to 4da74a2 Compare October 3, 2023 16:31

holtskinner assigned dizcology Oct 3, 2023

holtskinner and others added 2 commits October 3, 2023 11:47

Increase test coverage

48772b8

Merge branch 'main' into hocr-fixes

1fbf76d

holtskinner added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 3, 2023

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 3, 2023

dizcology requested changes Oct 3, 2023

View reviewed changes

google/cloud/documentai_toolbox/utilities/docai_utilities.py Show resolved Hide resolved

google/cloud/documentai_toolbox/wrappers/entity.py Outdated Show resolved Hide resolved

tests/unit/test_document.py Outdated Show resolved Hide resolved

Addressed review comments

38d6727

holtskinner added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Oct 3, 2023

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 3, 2023

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 3, 2023

dizcology approved these changes Oct 6, 2023

View reviewed changes

holtskinner merged commit 028bc37 into main Oct 6, 2023
23 checks passed

holtskinner deleted the hocr-fixes branch October 6, 2023 16:14

release-please bot mentioned this pull request Oct 6, 2023

chore(main): release 0.10.3-alpha #177

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: `docai_utilities.py` to return `Optional` #176

fix: `docai_utilities.py` to return `Optional` #176

holtskinner commented Oct 3, 2023 •

edited

conventional-commit-lint-gcf bot commented Oct 3, 2023 •

edited

dizcology commented Oct 3, 2023

holtskinner commented Oct 3, 2023

holtskinner commented Oct 6, 2023 •

edited

fix: docai_utilities.py to return Optional #176

fix: docai_utilities.py to return Optional #176

Conversation

holtskinner commented Oct 3, 2023 • edited

conventional-commit-lint-gcf bot commented Oct 3, 2023 • edited

dizcology commented Oct 3, 2023

holtskinner commented Oct 3, 2023

holtskinner commented Oct 6, 2023 • edited

fix: `docai_utilities.py` to return `Optional` #176

fix: `docai_utilities.py` to return `Optional` #176

holtskinner commented Oct 3, 2023 •

edited

conventional-commit-lint-gcf bot commented Oct 3, 2023 •

edited

holtskinner commented Oct 6, 2023 •

edited