New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add interword space option to HOCR pdf renderer #225
Conversation
This commit includes an optional work around for limitations of the PDF.js viewer described in #133. Here is explicitly add an addition space to text elements before drawing them on the PDF canvas when using the HOCR renderer. This option does not apply to other pdf renderers in OCRmyPDF and is turned off by default.
Thanks for documenting this so thoroughly.
I am curious - does the default sandwich renderer not work for you?
Is there a reason why this should not be default behavior - for example if
it doesn't work on some viewers?
…On Mar 1, 2018 10:31 AM, "Charles Forcey" ***@***.***> wrote:
This pull request adds a new advanced option --interword-space to
OCRmyPDF to allow the hocr renderer to produce PDF output compatible with
PDF.js and potentially other viewers that have difficulty detecting
phrases, lines, and paragraphs in separately placed text layers. This new
switch is a workaround for limitations of the PDF.js viewer described in
#133 <#133>.
Background
OCRmyPDF justifiably prioritizes the accurate placement of words on the
text layer as individual glyphs. Most PDF viewers have heuristics that
allow them to identify paragraphs, lines, and phrases while searching and
to insert the correct inter-word spacing when copying and pasting. PDF.js
has over 80 issues flagged with 4-text-selection and there have been a
number of pull requests to address the issue that have apparently gotten
bogged down with edge cases, performance concerns, and perhaps the inherent
challenges of a pure Javascript and HTML approach to PDF rendering.
Strategy
The goal of this pull request is to add an unobtrusive option to OCRmyPDF
to allow it to produce PDF.js compatible output for those that must support
PDF.js as a business requirement. Specifically, this PR follows the code
conventions by adding an advanced option --interword-space to the options
parser and ensures this option is available to the hocrtransform.py
renderer. When set to true, the HOCR renderer will add an additional space
at the end of each text element before drawing it on the text layer. This
option does not apply to other pdf renderers in OCRmyPDF, is turned off by
default, and issues a warning if used without the --pdf-renderer hocr
option also set.
Documentation
This PR added a new section to the advanced documentation for the new
option, a note on the 'hocr' renderer description about the option in the
same file, and a note that this is available in the introduction where
there is a relevant discussion of PDF as a layout format dependent on the
viewer to interpret the structure of the document in terms of words,
sentences, and paragraphs.
Testing
We confirmed that existing tests that exercised this code continue to
pass. We encountered some seemingly preexisting failures in other tests. We
explored the option of adding additional tests for confirm the warning is
provided and the output is as expected, and would welcome guidance as to
where that test should be placed and how best to combine it with RENDERERS
tests in the test_main.py or the more specific test_hocrtransform.py.
Sample PDF Output
The following file was processed with this option set to true. When loaded
into the latest PDF.js viewer, multi-word search and copy and paste are
improved over the standard HOCR output:
Input PDF: https://github.com/logikcull/OCRmyPDF/blob/master/tests/
resources/linn.pdf
# original command
ocrmypdf --output-type pdf --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.pdf
Output PDF: https://www.dropbox.com/s/2ugzxldqsvy8q6x/output.linn.hocr.pdf
Behavior when loaded into latest PDF.js viewer
<https://mozilla.github.io/pdf.js/web/viewer.html> -- note that you have
to remove spaces to find multiple words. Selecting and pasting the text
also has spaces removed:
[image: screen shot 2018-03-01 at 12 03 10 pm]
<https://user-images.githubusercontent.com/38297/36858555-6a119b3e-1d49-11e8-970b-a9e8f3c69aec.png>
# command with new --interword-spaces option
ocrmypdf --output-type pdf --interword-spaces --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.interword.pdf
Output PDF: https://www.dropbox.com/s/tukp6ftpjebe1gh/output.linn.
hocr.interword.pdf
Behavior when loaded into latest PDF.js viewer
<https://mozilla.github.io/pdf.js/web/viewer.html> -- note that you can
find multiple words separated by spaces. Copy and paste is also improved:
[image: screen shot 2018-03-01 at 12 03 42 pm]
<https://user-images.githubusercontent.com/38297/36858656-b19047f8-1d49-11e8-82a3-1d68c4519242.png>
Testing in Adobe Reader and Chrome's native PDF viewer showed that files
rendered with the new option continued to perform as well or better when
searching and copying and pasting. Apple Preview handled neither output
file particularly well so we think there has at least been no harm done.
Alternative Approaches
If this approach of adding a new option with a warning if used without
hocr is too disruptive, we could also consider contributing a new pipeline
task for a fifth renderer titled 'hocr-sloppy-text' or something similar
that runs a nearly identical version of hocrtransform.py with the space
suffix turned on by default. This approach has the serious downside of
repeating complex code, but the upside of leaving the existing hocr
rendered behavior 100% unchanged and opening the way in the future for
other "sloppy-text" fixes required to produce PDFs for simpler viewers like
PDF.js.
Related Issues
OCRMyPDF:
- 133: Some hints that Tesseract upgrades might provide some relief,
but underlying conclusion was that PDF.js has a naive implementation of
text selection and word boundaries (#133
<#133>).
Tesseract:
- 1235 December 2017: tesseract-ocr/tesseract#1235
<tesseract-ocr/tesseract#1235> includes good
explanation of reason for space detection issues: "Known problem. Root
cause is PDF spec which forces heuristics into text extraction, and Preview
is well known to have some of the wonkiest heuristics."
- 699 tesseract-ocr/tesseract#699 (comment)
<tesseract-ocr/tesseract#699 (comment)>
- 382 tesseract-ocr/tesseract#382
<tesseract-ocr/tesseract#382>
- 337 tesseract-ocr/tesseract#337
<tesseract-ocr/tesseract#337>
PDF.js:
- 7310: Super helpful discussion of HTML divs: mozilla/pdf.js#7310
<mozilla/pdf.js#7310>
- 6657: mozilla/pdf.js#6657
<mozilla/pdf.js#6657>
- Related PR not merged: mozilla/pdf.js#5783
<mozilla/pdf.js#5783>
- Dozens of text selection issues: https://github.com/mozilla/
pdf.js/issues?q=is%3Aopen+is%3Aissue+label%3A4-text-selection
<https://github.com/mozilla/pdf.js/issues?q=is%3Aopen+is%3Aissue+label%3A4-text-selection>
------------------------------
You can view, comment on, or merge this pull request online at:
#225
Commit Summary
- Add option to explicitly add interword spaces to HOCR pdf-renderer
- Add a note to the documentation about interword-spaces
File Changes
- *M* docs/advanced.rst
<https://github.com/jbarlow83/OCRmyPDF/pull/225/files#diff-0> (13)
- *M* docs/introduction.rst
<https://github.com/jbarlow83/OCRmyPDF/pull/225/files#diff-1> (3)
- *M* ocrmypdf/__main__.py
<https://github.com/jbarlow83/OCRmyPDF/pull/225/files#diff-2> (9)
- *M* ocrmypdf/hocrtransform.py
<https://github.com/jbarlow83/OCRmyPDF/pull/225/files#diff-3> (14)
- *M* ocrmypdf/pipeline.py
<https://github.com/jbarlow83/OCRmyPDF/pull/225/files#diff-4> (8)
Patch Links:
- https://github.com/jbarlow83/OCRmyPDF/pull/225.patch
- https://github.com/jbarlow83/OCRmyPDF/pull/225.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#225>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvcM7jOE8qbwxP8V54lBC9TjzO000Pvks5taD6RgaJpZM4SYyCd>
.
|
Thanks so much @jbarlow83 for your quick response. Two great questions:
Thanks so much for this remarkable script -- we tried a lot of the others and did not want to lose all the other great aspects of this pipeline just because of the PDF.js challenges. Best wishes, Charlie |
@jbreiden Thought you'd find this interesting. This PR proposes a change a hocr to PDF renderer, and the finding is that adding space characters between words assists the usual suspect mediocre PDF readers with finding word boundaries, without any regressions for smarter readers. Charlie also reported that Tesseract's PDF renderer had the same problem. So of course I'm thinking, why can't we apply a similar solution to Tesseract's? Has this been tried with Tesseract's PDF renderer? Wouldn't be the first time doing something not quite to spec in a PDF produced better results in the field.... |
Thanks for mentioning. I remember discussing with an accessibility expert from Adobe, but can't remember if I experimented with it. The sample PS. Tesseract & PDF.js have even bigger problems. mozilla/pdf.js#6509 mozilla/pdf.js#6863 |
In PDF.js, for baseline, double-clicking highlights several words. For interword, double-clicking highlights a word consistently. So interword is an improvement here, although the bounding boxes seem to be a regression on Acrobat and PDFium. As for its PDF.js itself, there are a lot of web apps that use it to present a custom PDF viewer – I get the impression that is @cforcey's case. So it should be with us for a long time even if Mozilla adopts PDFium. The hocr to PDF renderer isn't capable of handling full Unicode, so I don't know how other layouts would be affected. |
Let me know what you think. As for the regression spotted so far appears in both Acrobat and PDFium (Chrome PDF) is that double-clicking a word doesn't highlight properly. If several words are selected, the bounding box on the last character will be misaligned too. Highlighting the last letter in a word looks a little wonky. hocrtransform.py has a bounding boxes debug feature that might be helpful. No way to enable this other than hard coding. Of course the bounding boxes in hocr mode are not perfect vertically, |
I suspect the spot you'll need to adjust is the |
@jbarlow83 I suspect the pre-existing test failures were related to #217. I was running the test suite on my MacBook with Tesseract 4 and getting test failures due to Tesseract timeouts. |
Here we are manually scaling the pt width used for the BoundingBox and the Text element when manually adding whitespace to account for limitations of the PDF.js viewer. This fixes an initial regression noticed when selecting text elements in Chrome and PDFium. The width of the Text element and BoundBox had not been adjusted for the additional whitespace so the highlighting was offset slightly.
The latest commit is an attempt to address the regression found for text selection in Acrobat and PDFium. Here are three samples: There are still some examples where text highlighting is offset from the image text in the linn.pdf but these do not appear to be regressions caused by interword-spaces but rather HOCR inaccuracies that already exist in the control. I'm also not entirely sure if scaling the pt Rect tuple is appropriate here or if we should specifically target the horizontal scaling of the Text element. The rationale for scaling the pt Rect tuple was to provide scaling of the Text element as well as the BoundingBox element. I'm open to suggestions if we are not happy with this implementation. |
Homebrew removed python3 and python now defaults to version 3. Here we use `brew upgrade python` to upgrade the pre-installed version of python to python3.
Not sure if the Homebrew python fix belongs on this branch so I submitted #227. Maybe it makes more sense to merge the python fix to master and then rebase master here. |
A couple quick thoughts from the peanut gallery. Please be super careful to never introduce overlap in word bounding boxes. This causes all sorts of chaos. It's also probably worth checking Preview on MacOS X, which has notoriously persnickety word boundary heuristics. Finally, if you get a chance, try testing on a language like Chinese and see what happens. |
Thank you for the feedback. I have not noticed any material difference when using the proposed Control HOCR renderer ocrmypdf --pdf-renderer --pdf-renderer hocr ./tests/resources/linn.pdf linn.hocr.pdf Interword spaces with HOCR render ocrmypdf --interword-spaces --pdf-renderer hocr ./tests/resources/linn.pdf linn.hocr.interword-spaces.pdf I also tested simplified Chinese as you suggested. The HOCR renderer is not great in terms of 'copy-and-paste' and search highlighting on PDFium and Adobe but there were no observable regressions from the HOCR renderer with @jbarlow83 @jbreiden The only loose end we have is whether this option should be the default HOCR behaviour or, as coded and documented now, an optional mode of HOCR rendering. Having it as an option follows the principle of least surprise for existing users upgrading for other reasons, so having it as an option makes the most sense to us but of course either way works well for our use case. |
I took what you started and made some further improvements. I believe I've addressed the "ransom note" effect by using the same font size for everything Tesseract considers part of the same line, improved the quality of the PDF content stream, and took a stab at dealing with text on a skewed baseline. It's still a little vertically misaligned in some cases. I'd prefer if we can get confident enough in the changes overall to remove the need for an option. Please try out the |
I would love to review HOCR + PDF pairs when available. |
Thanks so much @jbarlow83 for the improvements to HOCR layout -- Tucker reviewed the changes with me and we were both thrilled with the improvements and the results. Tucker will be posting some samples shortly, but I realized in the meantime that I have neglected to post a gist of the solution that my colleagues came up with as a mash up of other open source solutions. I include theme here just because this script have us pretty good HOCR positioning and might have a few useful fragments: https://gist.github.com/cforcey/0f219f5be2017a7d8059e030abf971eb#file-gistfile1-txt-L177 I am also attaching an original PDF and a sample processed by this script as a sample. The one will _ocr in the filename is No pressure to even look at it as it might be more primitive that what is being done currently on this branch. 143931771.pdf When you are ready to decide about adding or removing the interword-spacing option, let us know and I will edit up the documents to match the final interface if that is helpful. Thanks so much! Charlie |
@cforcey @jbarlow83 I just tested your branch |
Here are sample HOCR output and PDFs rendered with the hocr pdf-render taking from existing test resources using the Here are the switches I used: ocrmypdf --interword-spaces --pdf-renderer hocr --clean --deskew --rotate-pages @jbarlow83 |
Thank you for the samples, I am reviewing now. Right now I'm a little confused why PDFium is highlighting these two words from
|
Added a few further improvements, mainly adjusting the baseline. I'm confident enough to switch this on for everyone unless someone can point out a serious issue. Two files: PDF.js looks quite sharp now even in skew: macOS Preview is still not quite right, but nothing to do with the interword space change. |
@jbarlow83 This is thrilling progress. Our PDF.js users will be very happy with the tighter highlighting placement you have engineered! Can't thank you enough! Tucker and I look forward to pulling all this goodness down from the official branch and getting it into production. Best wishes and looking forward to future evolutions of this incredibly useful project. Charlie |
This looks good to me. I will see if PDFium team has any comments but not expecting trouble. |
Released in 5.7.0 |
Thanks again for all your help and improvements getting this out. We will let you know once we get it running how things go. Best wishes! Charlie |
They are not going to switch to PDFium. |
Sorry for my naivety, but i cannot access |
|
@jbarlow83 Could I please see an ocrmypdf result for this file? pdf.js is still really struggling with horizontally stretching a glyphless invisible font. |
I set DPI to 150. The HOCR renderer discussed in this PR is not enabled by default because it doesn't work as well in non-Latin languages. Case 1: Using the Tesseract PDF renderer and not using Ghostscript: As a baseline, this first case uses Tesseract's PDF renderer so it should be nearly the same as using Tesseract directly.
Result: Case 2: Using the HOCR renderer and no Ghostscript: Now enable HOCR.
Result: Neither works particularly well. I think there must be some bug in pdf.js since highlight the text with command+A gives different results than click and drag. |
This pull request adds a new advanced option
--interword-spaces
to OCRmyPDF to allow the hocr renderer to produce PDF output compatible with PDF.js and potentially other viewers that have difficulty detecting phrases, lines, and paragraphs in separately placed text layers. This new switch is a workaround for limitations of the PDF.js viewer described in #133.Background
OCRmyPDF justifiably prioritizes the accurate placement of words on the text layer as individual glyphs. Most PDF viewers have heuristics that allow them to identify paragraphs, lines, and phrases while searching and to insert the correct inter-word spacing when copying and pasting. PDF.js has over 80 issues flagged with 4-text-selection and there have been a number of pull requests to address the issue that have apparently gotten bogged down with edge cases, performance concerns, and perhaps the inherent challenges of a pure Javascript and HTML approach to PDF rendering.
Strategy
The goal of this pull request is to add an unobtrusive option to OCRmyPDF to allow it to produce PDF.js compatible output for those that must support PDF.js as a business requirement. Specifically, this PR follows the code conventions by adding an advanced option
--interword-spaces
to the options parser and ensures this option is available to the hocrtransform.py renderer. When set to true, the HOCR renderer will add an additional space at the end of each text element before drawing it on the text layer. This option does not apply to other pdf renderers in OCRmyPDF, is turned off by default, and issues a warning if used without the--pdf-renderer hocr
option also set.Documentation
This PR added a new section to the advanced documentation for the new option, a note on the 'hocr' renderer description about the option in the same file, and a note that this is available in the introduction where there is a relevant discussion of PDF as a layout format dependent on the viewer to interpret the structure of the document in terms of words, sentences, and paragraphs.
Testing
We confirmed that existing tests that exercised this code continue to pass. We encountered some seemingly preexisting failures in other tests. We explored the option of adding additional tests for confirm the warning is provided and the output is as expected, and would welcome guidance as to where that test should be placed and how best to combine it with RENDERERS tests in the test_main.py or the more specific test_hocrtransform.py.
Sample PDF Output
The following file was processed with this option set to true. When loaded into the latest PDF.js viewer, multi-word search and copy and paste are improved over the standard HOCR output:
Input PDF: https://github.com/logikcull/OCRmyPDF/blob/master/tests/resources/linn.pdf
Output PDF: https://www.dropbox.com/s/2ugzxldqsvy8q6x/output.linn.hocr.pdf
Behavior when loaded into latest PDF.js viewer -- note that you have to remove spaces to find multiple words. Selecting and pasting the text also has spaces removed:
Output PDF: https://www.dropbox.com/s/tukp6ftpjebe1gh/output.linn.hocr.interword.pdf
Behavior when loaded into latest PDF.js viewer -- note that you can find multiple words separated by spaces. Copy and paste is also improved:
Testing in Adobe Reader and Chrome's native PDF viewer showed that files rendered with the new option continued to perform as well or better when searching and copying and pasting. Apple Preview handled neither output file particularly well so we think there has at least been no harm done.
Alternative Approaches
If this approach of adding a new option with a warning if used without hocr is too disruptive, we could also consider contributing a new pipeline task for a fifth renderer titled 'hocr-sloppy-text' or something similar that runs a nearly identical version of hocrtransform.py with the space suffix turned on by default. This approach has the serious downside of repeating complex code, but the upside of leaving the existing hocr rendered behavior 100% unchanged and opening the way in the future for other "sloppy-text" fixes required to produce PDFs for simpler viewers like PDF.js.
Related Issues
OCRMyPDF:
Tesseract:
PDF.js: