New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: docai_utilities.py
to return Optional
#176
Conversation
🤖 I detect that the PR title and the commit message differ and there's only one commit. To use the PR title for the commit history, you can use Github's automerge feature with squashing, or use -- conventional-commit-lint bot |
docai_utilities.py
to return an Optional
docai_utilities.py
to return Optional
15a3021
to
4da74a2
Compare
If possible, please also give some explanations about the duplicate/inaccurate elements in hOCR output in the PR description. |
I will once I get that information |
Note: In the customers code, they use this whenever the document is blank. Not sure if this is a standard structure for blank hOCR documents, but could be good to look into. Note - It doesn't follow the corrected structure after #169 <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="unknown" lang="unknown">
<head>
<title>hocr</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="ocr-system" content="Document AI OCR" />
<meta name="ocr-langs" content="unknown" />
<meta name="ocr-number-of-pages" content="1" />
<meta name="ocr-capabilities" content="ocr_page ocr_carea ocr_par ocr_line ocrx_word" />
</head>
<body>
<div class='ocr_page' lang='unknown' title='bbox 0 0 0 0'>
<span class='ocr_carea' id='block_1_0' title='bbox 0 0 0 0'>
<span class='ocr_par' id='par_1_0_0' title='bbox 0 0 0 0'>
<span class='ocr_line' id='line_1_0_0_0' title='bbox 0 0 0 0'></span>
<span class='ocrx_word' id='word_1_0_0_0_0' title='bbox 0 0 0 0'></span>
</span>
</span>
</div>
</body>
</html> |
ocr_line
<span>
to include allocr_word
#169