Add interword space option to HOCR pdf renderer #225

cforcey · 2018-03-01T18:31:44Z

This pull request adds a new advanced option --interword-spaces to OCRmyPDF to allow the hocr renderer to produce PDF output compatible with PDF.js and potentially other viewers that have difficulty detecting phrases, lines, and paragraphs in separately placed text layers. This new switch is a workaround for limitations of the PDF.js viewer described in #133.

Background

OCRmyPDF justifiably prioritizes the accurate placement of words on the text layer as individual glyphs. Most PDF viewers have heuristics that allow them to identify paragraphs, lines, and phrases while searching and to insert the correct inter-word spacing when copying and pasting. PDF.js has over 80 issues flagged with 4-text-selection and there have been a number of pull requests to address the issue that have apparently gotten bogged down with edge cases, performance concerns, and perhaps the inherent challenges of a pure Javascript and HTML approach to PDF rendering.

Strategy

The goal of this pull request is to add an unobtrusive option to OCRmyPDF to allow it to produce PDF.js compatible output for those that must support PDF.js as a business requirement. Specifically, this PR follows the code conventions by adding an advanced option --interword-spaces to the options parser and ensures this option is available to the hocrtransform.py renderer. When set to true, the HOCR renderer will add an additional space at the end of each text element before drawing it on the text layer. This option does not apply to other pdf renderers in OCRmyPDF, is turned off by default, and issues a warning if used without the --pdf-renderer hocr option also set.

Documentation

This PR added a new section to the advanced documentation for the new option, a note on the 'hocr' renderer description about the option in the same file, and a note that this is available in the introduction where there is a relevant discussion of PDF as a layout format dependent on the viewer to interpret the structure of the document in terms of words, sentences, and paragraphs.

Testing

We confirmed that existing tests that exercised this code continue to pass. We encountered some seemingly preexisting failures in other tests. We explored the option of adding additional tests for confirm the warning is provided and the output is as expected, and would welcome guidance as to where that test should be placed and how best to combine it with RENDERERS tests in the test_main.py or the more specific test_hocrtransform.py.

Sample PDF Output

The following file was processed with this option set to true. When loaded into the latest PDF.js viewer, multi-word search and copy and paste are improved over the standard HOCR output:

Input PDF: https://github.com/logikcull/OCRmyPDF/blob/master/tests/resources/linn.pdf

# original command 
ocrmypdf --output-type pdf --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.pdf

Output PDF: https://www.dropbox.com/s/2ugzxldqsvy8q6x/output.linn.hocr.pdf

Behavior when loaded into latest PDF.js viewer -- note that you have to remove spaces to find multiple words. Selecting and pasting the text also has spaces removed:

# command with new --interword-spaces option
ocrmypdf --output-type pdf --interword-spaces --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.interword.pdf

Output PDF: https://www.dropbox.com/s/tukp6ftpjebe1gh/output.linn.hocr.interword.pdf

Behavior when loaded into latest PDF.js viewer -- note that you can find multiple words separated by spaces. Copy and paste is also improved:

Testing in Adobe Reader and Chrome's native PDF viewer showed that files rendered with the new option continued to perform as well or better when searching and copying and pasting. Apple Preview handled neither output file particularly well so we think there has at least been no harm done.

Alternative Approaches

If this approach of adding a new option with a warning if used without hocr is too disruptive, we could also consider contributing a new pipeline task for a fifth renderer titled 'hocr-sloppy-text' or something similar that runs a nearly identical version of hocrtransform.py with the space suffix turned on by default. This approach has the serious downside of repeating complex code, but the upside of leaving the existing hocr rendered behavior 100% unchanged and opening the way in the future for other "sloppy-text" fixes required to produce PDFs for simpler viewers like PDF.js.

Related Issues

OCRMyPDF:

133: Some hints that Tesseract upgrades might provide some relief, but underlying conclusion was that PDF.js has a naive implementation of text selection and word boundaries (Feature request, recognize words on a line as a continuous phrase. #133).

Tesseract:

1235 December 2017: PDF output is missing spaces in some cases, while TXT output contains them tesseract-ocr/tesseract#1235 includes good explanation of reason for space detection issues: "Known problem. Root cause is PDF spec which forces heuristics into text extraction, and Preview is well known to have some of the wonkiest heuristics."
699 PDF output: odd spaces on OSX preview tesseract-ocr/tesseract#699 (comment)
382 Text is garbled in pdf.js (Cygwin / UB Mannheim binaries) tesseract-ocr/tesseract#382
337 Spaces between characters when generating a pdf tesseract-ocr/tesseract#337

PDF.js:

7310: Super helpful discussion of HTML divs: missing all spaces on page 1 (text selection) mozilla/pdf.js#7310
6657: Spaces missing when copying text from PDF mozilla/pdf.js#6657
Related PR not merged: Improve Copy/Paste mozilla/pdf.js#5783
Dozens of text selection issues: https://github.com/mozilla/pdf.js/issues?q=is%3Aopen+is%3Aissue+label%3A4-text-selection

This commit includes an optional work around for limitations of the PDF.js viewer described in #133. Here is explicitly add an addition space to text elements before drawing them on the PDF canvas when using the HOCR renderer. This option does not apply to other pdf renderers in OCRmyPDF and is turned off by default.

jbarlow83 · 2018-03-01T20:10:23Z

Thanks for documenting this so thoroughly. I am curious - does the default sandwich renderer not work for you? Is there a reason why this should not be default behavior - for example if it doesn't work on some viewers?

…

On Mar 1, 2018 10:31 AM, "Charles Forcey" ***@***.***> wrote: This pull request adds a new advanced option --interword-space to OCRmyPDF to allow the hocr renderer to produce PDF output compatible with PDF.js and potentially other viewers that have difficulty detecting phrases, lines, and paragraphs in separately placed text layers. This new switch is a workaround for limitations of the PDF.js viewer described in #133 <#133>. Background OCRmyPDF justifiably prioritizes the accurate placement of words on the text layer as individual glyphs. Most PDF viewers have heuristics that allow them to identify paragraphs, lines, and phrases while searching and to insert the correct inter-word spacing when copying and pasting. PDF.js has over 80 issues flagged with 4-text-selection and there have been a number of pull requests to address the issue that have apparently gotten bogged down with edge cases, performance concerns, and perhaps the inherent challenges of a pure Javascript and HTML approach to PDF rendering. Strategy The goal of this pull request is to add an unobtrusive option to OCRmyPDF to allow it to produce PDF.js compatible output for those that must support PDF.js as a business requirement. Specifically, this PR follows the code conventions by adding an advanced option --interword-space to the options parser and ensures this option is available to the hocrtransform.py renderer. When set to true, the HOCR renderer will add an additional space at the end of each text element before drawing it on the text layer. This option does not apply to other pdf renderers in OCRmyPDF, is turned off by default, and issues a warning if used without the --pdf-renderer hocr option also set. Documentation This PR added a new section to the advanced documentation for the new option, a note on the 'hocr' renderer description about the option in the same file, and a note that this is available in the introduction where there is a relevant discussion of PDF as a layout format dependent on the viewer to interpret the structure of the document in terms of words, sentences, and paragraphs. Testing We confirmed that existing tests that exercised this code continue to pass. We encountered some seemingly preexisting failures in other tests. We explored the option of adding additional tests for confirm the warning is provided and the output is as expected, and would welcome guidance as to where that test should be placed and how best to combine it with RENDERERS tests in the test_main.py or the more specific test_hocrtransform.py. Sample PDF Output The following file was processed with this option set to true. When loaded into the latest PDF.js viewer, multi-word search and copy and paste are improved over the standard HOCR output: Input PDF: https://github.com/logikcull/OCRmyPDF/blob/master/tests/ resources/linn.pdf # original command ocrmypdf --output-type pdf --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.pdf Output PDF: https://www.dropbox.com/s/2ugzxldqsvy8q6x/output.linn.hocr.pdf Behavior when loaded into latest PDF.js viewer <https://mozilla.github.io/pdf.js/web/viewer.html> -- note that you have to remove spaces to find multiple words. Selecting and pasting the text also has spaces removed: [image: screen shot 2018-03-01 at 12 03 10 pm] <https://user-images.githubusercontent.com/38297/36858555-6a119b3e-1d49-11e8-970b-a9e8f3c69aec.png> # command with new --interword-spaces option ocrmypdf --output-type pdf --interword-spaces --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.interword.pdf Output PDF: https://www.dropbox.com/s/tukp6ftpjebe1gh/output.linn. hocr.interword.pdf Behavior when loaded into latest PDF.js viewer <https://mozilla.github.io/pdf.js/web/viewer.html> -- note that you can find multiple words separated by spaces. Copy and paste is also improved: [image: screen shot 2018-03-01 at 12 03 42 pm] <https://user-images.githubusercontent.com/38297/36858656-b19047f8-1d49-11e8-82a3-1d68c4519242.png> Testing in Adobe Reader and Chrome's native PDF viewer showed that files rendered with the new option continued to perform as well or better when searching and copying and pasting. Apple Preview handled neither output file particularly well so we think there has at least been no harm done. Alternative Approaches If this approach of adding a new option with a warning if used without hocr is too disruptive, we could also consider contributing a new pipeline task for a fifth renderer titled 'hocr-sloppy-text' or something similar that runs a nearly identical version of hocrtransform.py with the space suffix turned on by default. This approach has the serious downside of repeating complex code, but the upside of leaving the existing hocr rendered behavior 100% unchanged and opening the way in the future for other "sloppy-text" fixes required to produce PDFs for simpler viewers like PDF.js. Related Issues OCRMyPDF: - 133: Some hints that Tesseract upgrades might provide some relief, but underlying conclusion was that PDF.js has a naive implementation of text selection and word boundaries (#133 <#133>). Tesseract: - 1235 December 2017: tesseract-ocr/tesseract#1235 <tesseract-ocr/tesseract#1235> includes good explanation of reason for space detection issues: "Known problem. Root cause is PDF spec which forces heuristics into text extraction, and Preview is well known to have some of the wonkiest heuristics." - 699 tesseract-ocr/tesseract#699 (comment) <tesseract-ocr/tesseract#699 (comment)> - 382 tesseract-ocr/tesseract#382 <tesseract-ocr/tesseract#382> - 337 tesseract-ocr/tesseract#337 <tesseract-ocr/tesseract#337> PDF.js: - 7310: Super helpful discussion of HTML divs: mozilla/pdf.js#7310 <mozilla/pdf.js#7310> - 6657: mozilla/pdf.js#6657 <mozilla/pdf.js#6657> - Related PR not merged: mozilla/pdf.js#5783 <mozilla/pdf.js#5783> - Dozens of text selection issues: https://github.com/mozilla/ pdf.js/issues?q=is%3Aopen+is%3Aissue+label%3A4-text-selection <https://github.com/mozilla/pdf.js/issues?q=is%3Aopen+is%3Aissue+label%3A4-text-selection> ------------------------------ You can view, comment on, or merge this pull request online at: #225 Commit Summary - Add option to explicitly add interword spaces to HOCR pdf-renderer - Add a note to the documentation about interword-spaces File Changes - *M* docs/advanced.rst <https://github.com/jbarlow83/OCRmyPDF/pull/225/files#diff-0> (13) - *M* docs/introduction.rst <https://github.com/jbarlow83/OCRmyPDF/pull/225/files#diff-1> (3) - *M* ocrmypdf/__main__.py <https://github.com/jbarlow83/OCRmyPDF/pull/225/files#diff-2> (9) - *M* ocrmypdf/hocrtransform.py <https://github.com/jbarlow83/OCRmyPDF/pull/225/files#diff-3> (14) - *M* ocrmypdf/pipeline.py <https://github.com/jbarlow83/OCRmyPDF/pull/225/files#diff-4> (8) Patch Links: - https://github.com/jbarlow83/OCRmyPDF/pull/225.patch - https://github.com/jbarlow83/OCRmyPDF/pull/225.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#225>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvcM7jOE8qbwxP8V54lBC9TjzO000Pvks5taD6RgaJpZM4SYyCd> .

cforcey · 2018-03-01T20:50:54Z

Thanks so much @jbarlow83 for your quick response. Two great questions:

In terms of the sandwich renderer, we took a sample tif from the UNLV ground truth tesseract testing project (open source) and ran it through with each of the render options including sandwich. In each case, we observed in PDF.js the same mashing of words together without whitespace.

pdf_render_modes_tested.zip

As to whether this should be the default for the hocr renderer, that is a great question. We could not in light testing of our originals observe any degradation on the more advanced viewers (Acrobat, Chrome, etc) and did confirm improvements to the white space issues that were blocking our searchable PDF deployment to our PDF.js application. In my previous job, I ran very similar code (written by another team member) that added the space and produced reliably searchable PDFs in PDF.js across several million files (https://powersuite.aee.net). I will send along a gist of that code by email. Since this will only fire on relatively legacy setups -- hocr -- it might be acceptable to make this the default behavior.

Thanks so much for this remarkable script -- we tried a lot of the others and did not want to lose all the other great aspects of this pipeline just because of the PDF.js challenges. Best wishes,

Charlie

jbarlow83 · 2018-03-02T01:04:58Z

@jbreiden Thought you'd find this interesting. This PR proposes a change a hocr to PDF renderer, and the finding is that adding space characters between words assists the usual suspect mediocre PDF readers with finding word boundaries, without any regressions for smarter readers.

Charlie also reported that Tesseract's PDF renderer had the same problem. So of course I'm thinking, why can't we apply a similar solution to Tesseract's?

Has this been tried with Tesseract's PDF renderer? Wouldn't be the first time doing something not quite to spec in a PDF produced better results in the field....

jbreiden · 2018-03-02T01:45:09Z

Thanks for mentioning. I remember discussing with an accessibility expert from Adobe, but can't remember if I experimented with it. The sample output.linn.hocr.pdf is a little weird in Chrome/PDFium, for example double clicking a word gives a smaller bounding box because the space is not included. I also have questions in my mind about some of the diverse languages and layouts that are fed to Tesseract. What is the future of PDF.js, by the way? Is Mozilla planning to switch to PDFium?

PS. Tesseract & PDF.js have even bigger problems. mozilla/pdf.js#6509 mozilla/pdf.js#6863

jbarlow83 · 2018-03-02T07:46:57Z

output.linn.hocr.pdf is the baseline and output.linn.hocr.interword.pdf is the proposed change.

In PDF.js, for baseline, double-clicking highlights several words. For interword, double-clicking highlights a word consistently. So interword is an improvement here, although the bounding boxes seem to be a regression on Acrobat and PDFium.

As for its PDF.js itself, there are a lot of web apps that use it to present a custom PDF viewer – I get the impression that is @cforcey's case. So it should be with us for a long time even if Mozilla adopts PDFium.

The hocr to PDF renderer isn't capable of handling full Unicode, so I don't know how other layouts would be affected.

jbarlow83 · 2018-03-02T08:12:21Z

@cforcey

Regarding the pre-existing test failures you spotted, feel free to report any new issues you identify. On Travis CI everything passes, so it should possible to get everything to pass. (Might want to check if it's related to the Debian issue.)
I'm prepared to accept this is pretty much as is, presented as a PDF.js compatibility option.
But if you're able to fix the regression (or any others that show up in the mean time) we can remove the option entirely and just make it the default/only hocr behaviour. I'd prefer to do it this way.

Let me know what you think.

As for the regression spotted so far appears in both Acrobat and PDFium (Chrome PDF) is that double-clicking a word doesn't highlight properly. If several words are selected, the bounding box on the last character will be misaligned too. Highlighting the last letter in a word looks a little wonky.

Baseline

Interwords

hocrtransform.py has a bounding boxes debug feature that might be helpful. No way to enable this other than hard coding.

Of course the bounding boxes in hocr mode are not perfect vertically,

jbarlow83 · 2018-03-02T08:17:03Z

I suspect the spot you'll need to adjust is the text.setHorizScale() calculation since that doesn't account for the additional space added on to the end of the word.

ctbarbour · 2018-03-02T09:54:54Z

@jbarlow83 I suspect the pre-existing test failures were related to #217. I was running the test suite on my MacBook with Tesseract 4 and getting test failures due to Tesseract timeouts.

Here we are manually scaling the pt width used for the BoundingBox and the Text element when manually adding whitespace to account for limitations of the PDF.js viewer. This fixes an initial regression noticed when selecting text elements in Chrome and PDFium. The width of the Text element and BoundBox had not been adjusted for the additional whitespace so the highlighting was offset slightly.

ctbarbour · 2018-03-02T11:38:12Z

The latest commit is an attempt to address the regression found for text selection in Acrobat and PDFium. Here are three samples:

Baseline

Interwords

Interwords with Scaling

There are still some examples where text highlighting is offset from the image text in the linn.pdf but these do not appear to be regressions caused by interword-spaces but rather HOCR inaccuracies that already exist in the control.

Baseline

Interwords with Scaling

I'm also not entirely sure if scaling the pt Rect tuple is appropriate here or if we should specifically target the horizontal scaling of the Text element. The rationale for scaling the pt Rect tuple was to provide scaling of the Text element as well as the BoundingBox element. I'm open to suggestions if we are not happy with this implementation.

Homebrew removed python3 and python now defaults to version 3. Here we use `brew upgrade python` to upgrade the pre-installed version of python to python3.

ctbarbour · 2018-03-02T14:39:53Z

Not sure if the Homebrew python fix belongs on this branch so I submitted #227. Maybe it makes more sense to merge the python fix to master and then rebase master here.

jbreiden · 2018-03-02T16:31:52Z

A couple quick thoughts from the peanut gallery. Please be super careful to never introduce overlap in word bounding boxes. This causes all sorts of chaos. It's also probably worth checking Preview on MacOS X, which has notoriously persnickety word boundary heuristics. Finally, if you get a chance, try testing on a language like Chinese and see what happens.

ctbarbour · 2018-03-05T14:42:29Z

@jbreiden

Thank you for the feedback. I have not noticed any material difference when using the proposed --interword-spaces on macOS Preview and Safari compared to the HOCR renderer without --interword-spaces enabled. I'm using Preview Version 10.0 (944.4) and Safari Version 11.0.3 (13604.5.6) on macOS 10.13.3.

Control HOCR renderer

ocrmypdf --pdf-renderer --pdf-renderer hocr ./tests/resources/linn.pdf linn.hocr.pdf

Interword spaces with HOCR render

ocrmypdf --interword-spaces --pdf-renderer hocr ./tests/resources/linn.pdf linn.hocr.interword-spaces.pdf

I also tested simplified Chinese as you suggested. The HOCR renderer is not great in terms of 'copy-and-paste' and search highlighting on PDFium and Adobe but there were no observable regressions from the HOCR renderer with --interword-spaces enabled. The sandwich renderer is far better and works as one would expect with both PDFium and Adobe.

@jbarlow83 @jbreiden
After implementing your suggested changes, I think we have restored parity with the existing HOCR implementation in terms of bounding box accuracy, and, of course, not solved any of the issues with HOCR that are solved by more modern approaches to OCR text placement for non-ascii output. The goal of the PR was just to extend the compatibility of legacy HOCR workflows specifically for naive PDF viewers like PDF.js. As those viewers improve, both HOCR and this interword-spacing shim will hopefully no longer be needed.

The only loose end we have is whether this option should be the default HOCR behaviour or, as coded and documented now, an optional mode of HOCR rendering. Having it as an option follows the principle of least surprise for existing users upgrading for other reasons, so having it as an option makes the most sense to us but of course either way works well for our use case.

jbarlow83 · 2018-03-05T16:29:33Z

I took what you started and made some further improvements. I believe I've addressed the "ransom note" effect by using the same font size for everything Tesseract considers part of the same line, improved the quality of the PDF content stream, and took a stab at dealing with text on a skewed baseline. It's still a little vertically misaligned in some cases.

I'd prefer if we can get confident enough in the changes overall to remove the need for an option.

Please try out the feature/better-hocr branch and let me know what you think.

jbreiden · 2018-03-05T20:04:09Z

I would love to review HOCR + PDF pairs when available.

cforcey · 2018-03-06T14:54:34Z

Thanks so much @jbarlow83 for the improvements to HOCR layout -- Tucker reviewed the changes with me and we were both thrilled with the improvements and the results. Tucker will be posting some samples shortly, but I realized in the meantime that I have neglected to post a gist of the solution that my colleagues came up with as a mash up of other open source solutions. I include theme here just because this script have us pretty good HOCR positioning and might have a few useful fragments:

https://gist.github.com/cforcey/0f219f5be2017a7d8059e030abf971eb#file-gistfile1-txt-L177

I am also attaching an original PDF and a sample processed by this script as a sample. The one will _ocr in the filename is No pressure to even look at it as it might be more primitive that what is being done currently on this branch.

143931771.pdf
143931771_ocr.pdf

When you are ready to decide about adding or removing the interword-spacing option, let us know and I will edit up the documents to match the final interface if that is helpful. Thanks so much! Charlie

acaloiaro · 2018-03-07T01:00:49Z

@cforcey @jbarlow83 I just tested your branch feature/better-hocr for a few viewers that I'm targeting: PDF.js, Preview, Adobe Reader. And it renders the text near flawlessly. Great work!

ctbarbour · 2018-03-07T17:24:05Z

@jbreiden

Here are sample HOCR output and PDFs rendered with the hocr pdf-render taking from existing test resources using the feature/better-hocr. The tar should be globally readable so let me know if you have issues.

Here are the switches I used:

ocrmypdf --interword-spaces --pdf-renderer hocr --clean --deskew --rotate-pages

@jbarlow83
All the HOCR output generated using feature/better-hocr looks great for our use-cases.

jbreiden · 2018-03-08T23:56:19Z

Thank you for the samples, I am reviewing now. Right now I'm a little confused why PDFium is highlighting these two words from 00000000.pdf as adjacent when they should have a small gap between them. By the way if this is really successful, maybe it can replace the hocr-pdf tool that I wrote for https://github.com/tmbdev/hocr-tools/blob/master/hocr-pdf

 65.28667 Tz
(OCR\040) Tj
16.5466 0 Td
192.3833 Tz
(is\040) Tj

jbarlow83 · 2018-03-15T01:03:11Z

Added a few further improvements, mainly adjusting the baseline. I'm confident enough to switch this on for everyone unless someone can point out a serious issue.

Two files:
Improved
linn_improved.pdf
Skew case
_linn_skew_improved.pdf

PDF.js looks quite sharp now even in skew:

macOS Preview is still not quite right, but nothing to do with the interword space change.

PDFium still does the ransom note effect a little:

cforcey · 2018-03-15T13:57:34Z

@jbarlow83 This is thrilling progress. Our PDF.js users will be very happy with the tighter highlighting placement you have engineered! Can't thank you enough! Tucker and I look forward to pulling all this goodness down from the official branch and getting it into production. Best wishes and looking forward to future evolutions of this incredibly useful project. Charlie

jbreiden · 2018-03-15T14:13:59Z

This looks good to me. I will see if PDFium team has any comments but not expecting trouble.

jbarlow83 · 2018-03-17T00:24:33Z

Released in 5.7.0

cforcey · 2018-03-19T12:57:57Z

Thanks again for all your help and improvements getting this out. We will let you know once we get it running how things go. Best wishes! Charlie

amitdo · 2018-07-20T14:56:44Z

@jbreiden

What is the future of PDF.js, by the way? Is Mozilla planning to switch to PDFium?

They are not going to switch to PDFium.
https://wiki.mozilla.org/Mortar_Projecte

vinaynb · 2018-08-30T03:51:51Z

Sorry for my naivety, but i cannot access --interword-spaces option in command line when using docker image of ocrmypdf with v 7.0.3 even though this has been released in 5.7.0
Any help ?
cc @jbarlow83

jbarlow83 · 2018-08-30T06:19:04Z

--interword-spaces was only used for comparative testing of the new feature. Because it's an improvement with no drawbacks, it is now "always on".

jbreiden · 2018-09-06T22:00:10Z

@jbarlow83 Could I please see an ocrmypdf result for this file? pdf.js is still really struggling with horizontally stretching a glyphless invisible font.

jbarlow83 · 2018-09-07T06:16:30Z

I set DPI to 150.

The HOCR renderer discussed in this PR is not enabled by default because it doesn't work as well in non-Latin languages.

Case 1: Using the Tesseract PDF renderer and not using Ghostscript:

As a baseline, this first case uses Tesseract's PDF renderer so it should be nearly the same as using Tesseract directly.

ocrmypdf --output-type pdf --image-dpi 150 45187336-31724680-b1e5-11e8-90cc-cf437e154046.png _.pdf

Result:
tess_no_gs.pdf

Case 2: Using the HOCR renderer and no Ghostscript:

Now enable HOCR.

ocrmypdf --output-type pdf --pdf-renderer hocr --image-dpi 150 45187336-31724680-b1e5-11e8-90cc-cf437e154046.png _.pdf

Result:
hocr_no_gs.pdf

Neither works particularly well. I think there must be some bug in pdf.js since highlight the text with command+A gives different results than click and drag.

ctbarbour and others added 2 commits March 1, 2018 13:15

Add a note to the documentation about interword-spaces

422e619

Fix Homebrew python package

f6c7031

Homebrew removed python3 and python now defaults to version 3. Here we use `brew upgrade python` to upgrade the pre-installed version of python to python3.

Merge branch 'master' into feature/interword-spaces

d82e7c3

jbarlow83 closed this Mar 17, 2018

jbarlow83 mentioned this pull request Aug 28, 2018

Merged words in OCRed PDF #286

Closed

jbreiden mentioned this pull request Sep 6, 2018

intraword spacing for slightly better pdf copy-paste performance tesseract-ocr/tesseract#1900

Closed

yregaieg mentioned this pull request Apr 21, 2019

Issues when trying to OCR files with Arabic script #379

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add interword space option to HOCR pdf renderer #225

Add interword space option to HOCR pdf renderer #225

cforcey commented Mar 1, 2018 •

edited

jbarlow83 commented Mar 1, 2018 via email

cforcey commented Mar 1, 2018 •

edited

jbarlow83 commented Mar 2, 2018

jbreiden commented Mar 2, 2018 •

edited

jbarlow83 commented Mar 2, 2018

jbarlow83 commented Mar 2, 2018

jbarlow83 commented Mar 2, 2018

ctbarbour commented Mar 2, 2018 •

edited

ctbarbour commented Mar 2, 2018 •

edited

ctbarbour commented Mar 2, 2018

jbreiden commented Mar 2, 2018

ctbarbour commented Mar 5, 2018

jbarlow83 commented Mar 5, 2018

jbreiden commented Mar 5, 2018

cforcey commented Mar 6, 2018

acaloiaro commented Mar 7, 2018 •

edited

ctbarbour commented Mar 7, 2018 •

edited

jbreiden commented Mar 8, 2018

jbarlow83 commented Mar 15, 2018

cforcey commented Mar 15, 2018

jbreiden commented Mar 15, 2018

jbarlow83 commented Mar 17, 2018

cforcey commented Mar 19, 2018

amitdo commented Jul 20, 2018

vinaynb commented Aug 30, 2018 •

edited

jbarlow83 commented Aug 30, 2018

jbreiden commented Sep 6, 2018

jbarlow83 commented Sep 7, 2018

Add interword space option to HOCR pdf renderer #225

Add interword space option to HOCR pdf renderer #225

Conversation

cforcey commented Mar 1, 2018 • edited

Background

Strategy

Documentation

Testing

Sample PDF Output

Alternative Approaches

Related Issues

jbarlow83 commented Mar 1, 2018 via email

cforcey commented Mar 1, 2018 • edited

jbarlow83 commented Mar 2, 2018

jbreiden commented Mar 2, 2018 • edited

jbarlow83 commented Mar 2, 2018

jbarlow83 commented Mar 2, 2018

jbarlow83 commented Mar 2, 2018

ctbarbour commented Mar 2, 2018 • edited

ctbarbour commented Mar 2, 2018 • edited

ctbarbour commented Mar 2, 2018

jbreiden commented Mar 2, 2018

ctbarbour commented Mar 5, 2018

jbarlow83 commented Mar 5, 2018

jbreiden commented Mar 5, 2018

cforcey commented Mar 6, 2018

acaloiaro commented Mar 7, 2018 • edited

ctbarbour commented Mar 7, 2018 • edited

jbreiden commented Mar 8, 2018

jbarlow83 commented Mar 15, 2018

cforcey commented Mar 15, 2018

jbreiden commented Mar 15, 2018

jbarlow83 commented Mar 17, 2018

cforcey commented Mar 19, 2018

amitdo commented Jul 20, 2018

vinaynb commented Aug 30, 2018 • edited

jbarlow83 commented Aug 30, 2018

jbreiden commented Sep 6, 2018

jbarlow83 commented Sep 7, 2018

cforcey commented Mar 1, 2018 •

edited

cforcey commented Mar 1, 2018 •

edited

jbreiden commented Mar 2, 2018 •

edited

ctbarbour commented Mar 2, 2018 •

edited

ctbarbour commented Mar 2, 2018 •

edited

acaloiaro commented Mar 7, 2018 •

edited

ctbarbour commented Mar 7, 2018 •

edited

vinaynb commented Aug 30, 2018 •

edited