Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add interword space option to HOCR pdf renderer #225

Closed
wants to merge 5 commits into from
Closed

Add interword space option to HOCR pdf renderer #225

wants to merge 5 commits into from

Conversation

cforcey
Copy link
Contributor

@cforcey cforcey commented Mar 1, 2018

This pull request adds a new advanced option --interword-spaces to OCRmyPDF to allow the hocr renderer to produce PDF output compatible with PDF.js and potentially other viewers that have difficulty detecting phrases, lines, and paragraphs in separately placed text layers. This new switch is a workaround for limitations of the PDF.js viewer described in #133.

Background

OCRmyPDF justifiably prioritizes the accurate placement of words on the text layer as individual glyphs. Most PDF viewers have heuristics that allow them to identify paragraphs, lines, and phrases while searching and to insert the correct inter-word spacing when copying and pasting. PDF.js has over 80 issues flagged with 4-text-selection and there have been a number of pull requests to address the issue that have apparently gotten bogged down with edge cases, performance concerns, and perhaps the inherent challenges of a pure Javascript and HTML approach to PDF rendering.

Strategy

The goal of this pull request is to add an unobtrusive option to OCRmyPDF to allow it to produce PDF.js compatible output for those that must support PDF.js as a business requirement. Specifically, this PR follows the code conventions by adding an advanced option --interword-spaces to the options parser and ensures this option is available to the hocrtransform.py renderer. When set to true, the HOCR renderer will add an additional space at the end of each text element before drawing it on the text layer. This option does not apply to other pdf renderers in OCRmyPDF, is turned off by default, and issues a warning if used without the --pdf-renderer hocr option also set.

Documentation

This PR added a new section to the advanced documentation for the new option, a note on the 'hocr' renderer description about the option in the same file, and a note that this is available in the introduction where there is a relevant discussion of PDF as a layout format dependent on the viewer to interpret the structure of the document in terms of words, sentences, and paragraphs.

Testing

We confirmed that existing tests that exercised this code continue to pass. We encountered some seemingly preexisting failures in other tests. We explored the option of adding additional tests for confirm the warning is provided and the output is as expected, and would welcome guidance as to where that test should be placed and how best to combine it with RENDERERS tests in the test_main.py or the more specific test_hocrtransform.py.

Sample PDF Output

The following file was processed with this option set to true. When loaded into the latest PDF.js viewer, multi-word search and copy and paste are improved over the standard HOCR output:

Input PDF: https://github.com/logikcull/OCRmyPDF/blob/master/tests/resources/linn.pdf

# original command 
ocrmypdf --output-type pdf --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.pdf

Output PDF: https://www.dropbox.com/s/2ugzxldqsvy8q6x/output.linn.hocr.pdf

Behavior when loaded into latest PDF.js viewer -- note that you have to remove spaces to find multiple words. Selecting and pasting the text also has spaces removed:

screen shot 2018-03-01 at 12 03 10 pm

# command with new --interword-spaces option
ocrmypdf --output-type pdf --interword-spaces --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.interword.pdf

Output PDF: https://www.dropbox.com/s/tukp6ftpjebe1gh/output.linn.hocr.interword.pdf

Behavior when loaded into latest PDF.js viewer -- note that you can find multiple words separated by spaces. Copy and paste is also improved:
screen shot 2018-03-01 at 12 03 42 pm

Testing in Adobe Reader and Chrome's native PDF viewer showed that files rendered with the new option continued to perform as well or better when searching and copying and pasting. Apple Preview handled neither output file particularly well so we think there has at least been no harm done.

Alternative Approaches

If this approach of adding a new option with a warning if used without hocr is too disruptive, we could also consider contributing a new pipeline task for a fifth renderer titled 'hocr-sloppy-text' or something similar that runs a nearly identical version of hocrtransform.py with the space suffix turned on by default. This approach has the serious downside of repeating complex code, but the upside of leaving the existing hocr rendered behavior 100% unchanged and opening the way in the future for other "sloppy-text" fixes required to produce PDFs for simpler viewers like PDF.js.

Related Issues

OCRMyPDF:

Tesseract:

PDF.js:

ctbarbour and others added 2 commits March 1, 2018 13:15
This commit includes an optional work around for limitations of the
PDF.js viewer described in
#133. Here is explicitly
add an addition space to text elements before drawing them on the PDF
canvas when using the HOCR renderer. This option does not apply to
other pdf renderers in OCRmyPDF and is turned off by default.
@jbarlow83
Copy link
Collaborator

jbarlow83 commented Mar 1, 2018 via email

@cforcey
Copy link
Contributor Author

cforcey commented Mar 1, 2018

Thanks so much @jbarlow83 for your quick response. Two great questions:

  1. In terms of the sandwich renderer, we took a sample tif from the UNLV ground truth tesseract testing project (open source) and ran it through with each of the render options including sandwich. In each case, we observed in PDF.js the same mashing of words together without whitespace.

screen shot 2018-03-01 at 3 38 33 pm

pdf_render_modes_tested.zip

  1. As to whether this should be the default for the hocr renderer, that is a great question. We could not in light testing of our originals observe any degradation on the more advanced viewers (Acrobat, Chrome, etc) and did confirm improvements to the white space issues that were blocking our searchable PDF deployment to our PDF.js application. In my previous job, I ran very similar code (written by another team member) that added the space and produced reliably searchable PDFs in PDF.js across several million files (https://powersuite.aee.net). I will send along a gist of that code by email. Since this will only fire on relatively legacy setups -- hocr -- it might be acceptable to make this the default behavior.

Thanks so much for this remarkable script -- we tried a lot of the others and did not want to lose all the other great aspects of this pipeline just because of the PDF.js challenges. Best wishes,

Charlie

@jbarlow83
Copy link
Collaborator

@jbreiden Thought you'd find this interesting. This PR proposes a change a hocr to PDF renderer, and the finding is that adding space characters between words assists the usual suspect mediocre PDF readers with finding word boundaries, without any regressions for smarter readers.

Charlie also reported that Tesseract's PDF renderer had the same problem. So of course I'm thinking, why can't we apply a similar solution to Tesseract's?

Has this been tried with Tesseract's PDF renderer? Wouldn't be the first time doing something not quite to spec in a PDF produced better results in the field....

@jbreiden
Copy link

jbreiden commented Mar 2, 2018

Thanks for mentioning. I remember discussing with an accessibility expert from Adobe, but can't remember if I experimented with it. The sample output.linn.hocr.pdf is a little weird in Chrome/PDFium, for example double clicking a word gives a smaller bounding box because the space is not included. I also have questions in my mind about some of the diverse languages and layouts that are fed to Tesseract. What is the future of PDF.js, by the way? Is Mozilla planning to switch to PDFium?

PS. Tesseract & PDF.js have even bigger problems. mozilla/pdf.js#6509 mozilla/pdf.js#6863

@jbarlow83
Copy link
Collaborator

output.linn.hocr.pdf is the baseline and output.linn.hocr.interword.pdf is the proposed change.

In PDF.js, for baseline, double-clicking highlights several words. For interword, double-clicking highlights a word consistently. So interword is an improvement here, although the bounding boxes seem to be a regression on Acrobat and PDFium.

As for its PDF.js itself, there are a lot of web apps that use it to present a custom PDF viewer – I get the impression that is @cforcey's case. So it should be with us for a long time even if Mozilla adopts PDFium.

The hocr to PDF renderer isn't capable of handling full Unicode, so I don't know how other layouts would be affected.

@jbarlow83
Copy link
Collaborator

@cforcey

  • Regarding the pre-existing test failures you spotted, feel free to report any new issues you identify. On Travis CI everything passes, so it should possible to get everything to pass. (Might want to check if it's related to the Debian issue.)

  • I'm prepared to accept this is pretty much as is, presented as a PDF.js compatibility option.

  • But if you're able to fix the regression (or any others that show up in the mean time) we can remove the option entirely and just make it the default/only hocr behaviour. I'd prefer to do it this way.

Let me know what you think.


As for the regression spotted so far appears in both Acrobat and PDFium (Chrome PDF) is that double-clicking a word doesn't highlight properly. If several words are selected, the bounding box on the last character will be misaligned too. Highlighting the last letter in a word looks a little wonky.

Baseline
image

Interwords
image

hocrtransform.py has a bounding boxes debug feature that might be helpful. No way to enable this other than hard coding.

Of course the bounding boxes in hocr mode are not perfect vertically,

@jbarlow83
Copy link
Collaborator

I suspect the spot you'll need to adjust is the text.setHorizScale() calculation since that doesn't account for the additional space added on to the end of the word.

@ctbarbour
Copy link
Contributor

ctbarbour commented Mar 2, 2018

@jbarlow83 I suspect the pre-existing test failures were related to #217. I was running the test suite on my MacBook with Tesseract 4 and getting test failures due to Tesseract timeouts.

Here we are manually scaling the pt width used for the BoundingBox and
the Text element when manually adding whitespace to account for
limitations of the PDF.js viewer. This fixes an initial regression
noticed when selecting text elements in Chrome and PDFium. The width
of the Text element and BoundBox had not been adjusted for the
additional whitespace so the highlighting was offset slightly.
@ctbarbour
Copy link
Contributor

ctbarbour commented Mar 2, 2018

The latest commit is an attempt to address the regression found for text selection in Acrobat and PDFium. Here are three samples:

Baseline
linn hocr baseline

Interwords
linn hocr interwordspaces

Interwords with Scaling
linn hocr intewordspaces-scaled

There are still some examples where text highlighting is offset from the image text in the linn.pdf but these do not appear to be regressions caused by interword-spaces but rather HOCR inaccuracies that already exist in the control.

Baseline
linn locate baseline

Interwords with Scaling
linn locate interspacespaces-scaled

I'm also not entirely sure if scaling the pt Rect tuple is appropriate here or if we should specifically target the horizontal scaling of the Text element. The rationale for scaling the pt Rect tuple was to provide scaling of the Text element as well as the BoundingBox element. I'm open to suggestions if we are not happy with this implementation.

Homebrew removed python3 and python now defaults to version 3. Here we
use `brew upgrade python` to upgrade the pre-installed version of
python to python3.
@ctbarbour
Copy link
Contributor

Not sure if the Homebrew python fix belongs on this branch so I submitted #227. Maybe it makes more sense to merge the python fix to master and then rebase master here.

@jbreiden
Copy link

jbreiden commented Mar 2, 2018

A couple quick thoughts from the peanut gallery. Please be super careful to never introduce overlap in word bounding boxes. This causes all sorts of chaos. It's also probably worth checking Preview on MacOS X, which has notoriously persnickety word boundary heuristics. Finally, if you get a chance, try testing on a language like Chinese and see what happens.

@ctbarbour
Copy link
Contributor

@jbreiden

Thank you for the feedback. I have not noticed any material difference when using the proposed --interword-spaces on macOS Preview and Safari compared to the HOCR renderer without --interword-spaces enabled. I'm using Preview Version 10.0 (944.4) and Safari Version 11.0.3 (13604.5.6) on macOS 10.13.3.

Control HOCR renderer

ocrmypdf --pdf-renderer --pdf-renderer hocr ./tests/resources/linn.pdf linn.hocr.pdf

linn hocr

Interword spaces with HOCR render

ocrmypdf --interword-spaces --pdf-renderer hocr ./tests/resources/linn.pdf linn.hocr.interword-spaces.pdf

linn hocr interword-spaces

I also tested simplified Chinese as you suggested. The HOCR renderer is not great in terms of 'copy-and-paste' and search highlighting on PDFium and Adobe but there were no observable regressions from the HOCR renderer with --interword-spaces enabled. The sandwich renderer is far better and works as one would expect with both PDFium and Adobe.

@jbarlow83 @jbreiden
After implementing your suggested changes, I think we have restored parity with the existing HOCR implementation in terms of bounding box accuracy, and, of course, not solved any of the issues with HOCR that are solved by more modern approaches to OCR text placement for non-ascii output. The goal of the PR was just to extend the compatibility of legacy HOCR workflows specifically for naive PDF viewers like PDF.js. As those viewers improve, both HOCR and this interword-spacing shim will hopefully no longer be needed.

The only loose end we have is whether this option should be the default HOCR behaviour or, as coded and documented now, an optional mode of HOCR rendering. Having it as an option follows the principle of least surprise for existing users upgrading for other reasons, so having it as an option makes the most sense to us but of course either way works well for our use case.

@jbarlow83
Copy link
Collaborator

I took what you started and made some further improvements. I believe I've addressed the "ransom note" effect by using the same font size for everything Tesseract considers part of the same line, improved the quality of the PDF content stream, and took a stab at dealing with text on a skewed baseline. It's still a little vertically misaligned in some cases.

I'd prefer if we can get confident enough in the changes overall to remove the need for an option.

Please try out the feature/better-hocr branch and let me know what you think.

@jbreiden
Copy link

jbreiden commented Mar 5, 2018

I would love to review HOCR + PDF pairs when available.

@cforcey
Copy link
Contributor Author

cforcey commented Mar 6, 2018

Thanks so much @jbarlow83 for the improvements to HOCR layout -- Tucker reviewed the changes with me and we were both thrilled with the improvements and the results. Tucker will be posting some samples shortly, but I realized in the meantime that I have neglected to post a gist of the solution that my colleagues came up with as a mash up of other open source solutions. I include theme here just because this script have us pretty good HOCR positioning and might have a few useful fragments:

https://gist.github.com/cforcey/0f219f5be2017a7d8059e030abf971eb#file-gistfile1-txt-L177

I am also attaching an original PDF and a sample processed by this script as a sample. The one will _ocr in the filename is No pressure to even look at it as it might be more primitive that what is being done currently on this branch.

143931771.pdf
143931771_ocr.pdf

When you are ready to decide about adding or removing the interword-spacing option, let us know and I will edit up the documents to match the final interface if that is helpful. Thanks so much! Charlie

@acaloiaro
Copy link

acaloiaro commented Mar 7, 2018

@cforcey @jbarlow83 I just tested your branch feature/better-hocr for a few viewers that I'm targeting: PDF.js, Preview, Adobe Reader. And it renders the text near flawlessly. Great work!

@ctbarbour
Copy link
Contributor

ctbarbour commented Mar 7, 2018

@jbreiden

Here are sample HOCR output and PDFs rendered with the hocr pdf-render taking from existing test resources using the feature/better-hocr. The tar should be globally readable so let me know if you have issues.

Here are the switches I used:

ocrmypdf --interword-spaces --pdf-renderer hocr --clean --deskew --rotate-pages

@jbarlow83
All the HOCR output generated using feature/better-hocr looks great for our use-cases.

@jbreiden
Copy link

jbreiden commented Mar 8, 2018

Thank you for the samples, I am reviewing now. Right now I'm a little confused why PDFium is highlighting these two words from 00000000.pdf as adjacent when they should have a small gap between them. By the way if this is really successful, maybe it can replace the hocr-pdf tool that I wrote for https://github.com/tmbdev/hocr-tools/blob/master/hocr-pdf

 65.28667 Tz
(OCR\040) Tj
16.5466 0 Td
192.3833 Tz
(is\040) Tj

@jbarlow83
Copy link
Collaborator

Added a few further improvements, mainly adjusting the baseline. I'm confident enough to switch this on for everyone unless someone can point out a serious issue.

Two files:
Improved
linn_improved.pdf
Skew case
_linn_skew_improved.pdf

PDF.js looks quite sharp now even in skew:
image

macOS Preview is still not quite right, but nothing to do with the interword space change.

PDFium still does the ransom note effect a little:
image

@cforcey
Copy link
Contributor Author

cforcey commented Mar 15, 2018

@jbarlow83 This is thrilling progress. Our PDF.js users will be very happy with the tighter highlighting placement you have engineered! Can't thank you enough! Tucker and I look forward to pulling all this goodness down from the official branch and getting it into production. Best wishes and looking forward to future evolutions of this incredibly useful project. Charlie

@jbreiden
Copy link

This looks good to me. I will see if PDFium team has any comments but not expecting trouble.

@jbarlow83
Copy link
Collaborator

Released in 5.7.0

@jbarlow83 jbarlow83 closed this Mar 17, 2018
@cforcey
Copy link
Contributor Author

cforcey commented Mar 19, 2018

Thanks again for all your help and improvements getting this out. We will let you know once we get it running how things go. Best wishes! Charlie

@amitdo
Copy link

amitdo commented Jul 20, 2018

@jbreiden

What is the future of PDF.js, by the way? Is Mozilla planning to switch to PDFium?

They are not going to switch to PDFium.
https://wiki.mozilla.org/Mortar_Projecte

@vinaynb
Copy link

vinaynb commented Aug 30, 2018

Sorry for my naivety, but i cannot access --interword-spaces option in command line when using docker image of ocrmypdf with v 7.0.3 even though this has been released in 5.7.0
Any help ?
cc @jbarlow83

@jbarlow83
Copy link
Collaborator

--interword-spaces was only used for comparative testing of the new feature. Because it's an improvement with no drawbacks, it is now "always on".

@jbreiden
Copy link

jbreiden commented Sep 6, 2018

@jbarlow83 Could I please see an ocrmypdf result for this file? pdf.js is still really struggling with horizontally stretching a glyphless invisible font.

align

@jbarlow83
Copy link
Collaborator

I set DPI to 150.

The HOCR renderer discussed in this PR is not enabled by default because it doesn't work as well in non-Latin languages.

Case 1: Using the Tesseract PDF renderer and not using Ghostscript:

As a baseline, this first case uses Tesseract's PDF renderer so it should be nearly the same as using Tesseract directly.

ocrmypdf --output-type pdf --image-dpi 150 45187336-31724680-b1e5-11e8-90cc-cf437e154046.png _.pdf

Result:
tess_no_gs.pdf

Case 2: Using the HOCR renderer and no Ghostscript:

Now enable HOCR.

ocrmypdf --output-type pdf --pdf-renderer hocr --image-dpi 150 45187336-31724680-b1e5-11e8-90cc-cf437e154046.png _.pdf

Result:
hocr_no_gs.pdf

Neither works particularly well. I think there must be some bug in pdf.js since highlight the text with command+A gives different results than click and drag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants