Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 columns only sometimes recognized #77

Closed
Jmuccigr opened this issue Jun 29, 2016 · 12 comments
Closed

2 columns only sometimes recognized #77

Jmuccigr opened this issue Jun 29, 2016 · 12 comments

Comments

@Jmuccigr
Copy link
Contributor

I have a document with two columns of text, which seem to give some trouble. When I convert the PDF to a 300-dpi tif to use directly with tesseract, the columns do get recognized successfully.

Is there a way to improve performance with ocrmypdf?

@jbarlow83
Copy link
Collaborator

You could try ocrmypdf --pdf-renderer tesseract to see if the issue is
related to how ocrmypdf's internal PDF renderer compares to tesseract.

Also try ocrmypdf --tesseract-pagesegmode 1

On Wed, 29 Jun 2016 at 14:19 John Muccigrosso notifications@github.com
wrote:

I have a document with two columns of text, which seem to give some
trouble. When I convert the PDF to a 300-dpi tif to use directly with
tesseract, the columns do get recognized successfully.

Is there a way to improve performance with ocrmypdf?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#77, or mute the thread
https://github.com/notifications/unsubscribe/ABvcM3KhBcDwi-FhEboHun1yoK61zYXdks5qQuF8gaJpZM4JBk7n
.

@Jmuccigr
Copy link
Contributor Author

Ah, much better with the tesseract renderer. The PSM didn't make much a difference that I could see.

tesseract itself still does a better job of finding word divisions, but I wonder if that's due to the resolution of the tiff (300dpi) compared with the pdf (72, I think).

@jbarlow83
Copy link
Collaborator

You can force ocrmypdf to oversample to 300 DPI with --oversample 300

The details are: PDF is a vector format that can be rendered at any DPI.
Raster objects (images) inside a PDF have an implied DPI based on their
pixel dimensions and the target canvas. ocrmypdf calculates the implied
DPI, and then constrains it to a reasonable value. Many images however do
not have DPI set correctly, and many programs including those that ought to
know better, like Photoshop in some cases, overwrite the DPI with 72 or 96,
causing the implied DPI in the PDF to be incorrect (8.5x11" page at 300 DPI
becomes a 26x34" page at 96 DPI for example). You will probably want to fix
this in your workflow because oversized PDFs cause printing and display
problems.

On Wed, Jun 29, 2016 at 16:02 John Muccigrosso notifications@github.com
wrote:

Ah, much better with the tesseract renderer. The PSM didn't make much a
difference that I could see.

tesseract itself still does a better job of finding word divisions, but I
wonder if that's due to the resolution of the tiff (300dpi) compared with
the pdf (72, I think).


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#77 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABvcM_0-IWVsJ-zKueGByrVlI6Lm8GOMks5qQvmCgaJpZM4JBk7n
.

@Jmuccigr
Copy link
Contributor Author

Jmuccigr commented Jun 30, 2016

No change with --oversample 300.

The file is obtained from a Canon scanner which has a setting for scanning at 200 dpi resolution. I'll check later today on what Adobe says about the images in it.

EDIT

imageMagick's identify reports:

  Number pixels: 482K
  Geometry: 616x793+0+0
  Resolution: 72x72

So that's all consistent, though confusing to me considering what I set the scanner to do, unless I don't understand how it reports.

@jbarlow83
Copy link
Collaborator

Could you send me a copy of the PDF and TIFF? I'd like to investigate why
oversample did not improve results.

A Dropbox link will do. Thanks.

On Thu, 30 Jun 2016 at 08:08 John Muccigrosso notifications@github.com
wrote:

No change with --oversample 300.

The file is obtained from a Canon scanner which has a setting for scanning
at 200 dpi resolution. I'll check later today on what Adobe says about the
images in it.


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#77 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABvcM4dkNif542l3V6KS2BaC2N1VuEzcks5qQ9wIgaJpZM4JBk7n
.

@Jmuccigr
Copy link
Contributor Author

On 30 Jun 2016, at 15:29 , jbarlow83 notifications@github.com wrote:

Could you send me a copy of the PDF and TIFF? I'd like to investigate why
oversample did not improve results.

A Dropbox link will do. Thanks.

Here you go: https://dl.dropboxusercontent.com/u/55583820/Archive.zip

John

@jbarlow83
Copy link
Collaborator

You should definitely ocrmypdf --deskew files that are that skewed. It will make a big difference to OCR quality. There is an open bug in either ghostscript or tesseract I am currently working that affects OCR quality produced by tesseract and then processed by ghostscript. ocrmypdf uses both tools in conjunction; using tesseract on its own does not involve the separate ghostscript step. Most likely, this is the issue you entered.

--oversample 300 works as designed but had little impact on OCR performance either way for this file. You might still see slight differences between a PDF you rasterize to 300 DPI and one created with --oversample 300 if you specific different raster settings or use a program other than ghostscript to raster it.

@Jmuccigr
Copy link
Contributor Author

Jmuccigr commented Jul 1, 2016

Just to beat this dead horse a little more, I played around with various settings. Here are some results:

tl;dr: Easily the best OCR work was done by tesseract by itself on the 300-dpi tiff I exported via Preview.app from the pdf I have. It preserved line endings and read the columns correctly. None of the various options with ocrmypdf generated text that was nearly so accurate. The various combinations all, to one degree or another, missed word breaks, jumped across columns, or oddly cut lines part-way through.

The tesseract text starts like this, with the only error being the insertion of an apostrophe for a tiny footnote 1:

In reconstructing the historical events of Rome,
we are provided with valuable information about
kings, consuls, and wars in the writings of both
Livy and Dionysios of Halikarnassos.‘ Although
large sections are missing, it is still possible to
analyze the overall scope and structure of their
works, at least for the early history of Rome. Each
author has a definite reason for writing, and a set
goal in mind. Thus, for Livy the accounts of the
past center around individuals, men and women,

Here's an image showing the results when I try to select text (again in Preview) by dragging down the first column. All but the last one jump columns (though you can barely see that on the second-to-last):
all

And here's what the text looks like when copied out of those various files, all with ocrmypdf and the indicated options.

\1. --clean: starts off well then runs words together, right where it jumps across the column break (at "Dedicationdates"). The output varies in observance of line breaks.

In reconstructing the historical events of Rome, we are provided with valuable information about kings, consuls, and wars in the writings of both Livy and Dionysios of Halikarnassos.‘ Although large sections are missing, it is still possible to analyze the overall scope and structure of their
works, at least for the early history of Rome. Each
author has a definite reason for writing, and a set
goal in mind. Thus, for Livy the accounts of the
pastcenteraroundindividuals,menandwomen, Dedicationdatesoftemplesandtherituals whointheirrolesasleadersandheroesdisplayas connectedwitheithertheconsecrationorthe  

\2. --clean and deskew: running together more words at the start and jumping the column right away too, so again those two problems seem connected.

InreconstructingthehistoricaleventsofRome, Firstofall,wemustestablishwhenLivyand weareprovidedwithvaluableinformationabout DionysiosrefertoabuildinginRome,andthe
context in which it is mentioned (or not mention- ed, in cases where their accounts differ). Since neither of them is writing a study of Roman to- pographyorarchitecture,itmustbeassumedthat such references are made with a specific purpose in mind, one that makes them relevant to the
narrative as a whole.‘
Dedication dates of temples and the rituals

\3. clean, deskew, oversample 300: still a jump at the start, but now it stops.

In reconstructing the historical events of Rome, First of all, we must establish wben Livy and weareprovidedwithvaluableinformationabout DionysiosrefertoabuildinginRome,andthe
kings, consuls, and wars in the writings of both
Livy and Dionysios of Halikarnassos.‘ Although large sections are missing, it is still possible to
analyze the overall scope and structure of their works, at least for the early history of Rome. Each author has a definite reason for writing, and a set goal in mind. Thus, for Livy the accounts of the past center around individuals, men and women, who in their roles as leaders and heroes display as exempla the true Roman ideals of virtue, faith, and courage. The stability of the past is idealized, and placed in contrast with the uncertainty of the

\4. clean, deskew, oversample 300 and tesseract rendering: a little better at the start, but still runs some words together.

InreconstructingthehistoricaleventsofRome, Firstofall,wemustestablishwhenLivyand
we are provided with valuable information about kings, consuls, and wars in the writings of both Livy and Dionysios of Halikarnassos.1 Although large sections are missing, it is still possible to analyze the overall scope and structure of their works,atleastfortheearlyhistoryofRome.Each author has a de nite reason for writing, and a set goal in mind. Thus, for Livy the accounts of the
past center around individuals, men and women, who in their roles as leaders and heroes display as example the true Roman ideals of virtue, faith, and courage. The stability of the past is idealized, and placed in contrast with the uncertainty of the 

\5. Same as #4, but without deskew: this one started out best, but then started leaving out words at the end of some lines. You can barely see it in the image about ⅔ of the way down, where a line-ending "Polybius" isn't highlighted, but it happens a few more times at the end of the column, even breaking off mid-word.

InreconstructingthehistoricaleventsofRome, we are provided with valuable information about kings, consuls, and wars in the writings of both Livy and Dionysios of Halikarnassos.1 Although large sections are missing, it is still possible to analyze the overall scope and structure of their works,atleastfortheearlyhistoryofRome.Each author has a de nite reason for writing, and a set goal in mind. Thus, for Livy the accounts of the past center around individuals, men and women, who in their roles as leaders and heroes display as
example the true Roman ideals of virtue, faith, and courage. The stability of the past is idealized, and placed in contrast with the uncertainty of the present.2DionysionsofHalikarnassos,ontheother hand, sees it as his mission to explain to his coun- trymen, and to some extent also to the Romans,  how Rome in fact is part of the Greek world, and how her miraculous achievements were ac-
complished.WhereasLivyapologizesforspending so much time on the early history (Praef.), Diony-
siosisconsciously llingagapinthehistoriography of Rome by concentrating on the period down to

@jbarlow83
Copy link
Collaborator

Thanks for your detailed investigation. The problem appears to be in Ghostscript, which ocrmypdf uses to create a PDF/A. I am waiting on them to investigate.
http://bugs.ghostscript.com/show_bug.cgi?id=696874

I don't have an option to bypass Ghostscript at the moment, because no other open source tools create PDF/As that I know of (without using Ghostscript under the hood). I am considering a non-PDF/A option for users who prefer the smallest file sizes and are not concerned about archiving (see #48).

That said, Preview.app does a much worse job at extracting text from a PDF than other viewers, and specifically has the problem of gluing words together where other viewers do not. Have a look in Chrome PDF Viewer or Adobe Reader and you'll probably see different and better results. This infuriating issue actually comes back to discrepancies in the PDF spec and the fact that OCR text is basically a hack to the spec done with invisible fonts.

@jbarlow83 jbarlow83 reopened this Jul 1, 2016
@Jmuccigr
Copy link
Contributor Author

Jmuccigr commented Jul 1, 2016

Yes, I was wondering about the text extraction. texttopdf does an awful job, often sticking spaces in between every character, it seems. Skim looks like Preview (and maybe is using the same system service?). Acrobat gets it right, even if the way selection works is very un-Mac-like (or un-Windows-like for that matter). Chrome PDF Viewer looked good on the page, but the copied text was all run together.

My use case involves uses a viewer and highlighting text to be extracted by copying or scripting, so I'd like to have the text match up as much as possible.

Interesting to me is that the Canon software, which doesn't do a great job of OCR, is able to put in a text layer that Preview reads without problems (apart from the crappy OCR).

@jbarlow83
Copy link
Collaborator

My use case involves uses a viewer and highlighting text to be extracted by copying or scripting, so I'd like to have the text match up as much as possible.

All PDF viewers use heuristics to decide what text is contiguous because PDFs don't include this information on their own. A PDF is conceptually closer to a vector drawing than a word processor. It supports precise glyph positioning, but there is little to organize text into words, lines, or paragraphs. This is why editing PDFs is hard. It seems that Canon and Preview.app are just a lucky combination that play well together (sometimes).

Perhaps something like DjVu is a more appropriate format if you want text extraction to be perfect.

@jbarlow83
Copy link
Collaborator

v4.2 adds an option to bypass Ghostscript and produce a plain PDF instead of PDF/A. If you use --pdf-renderer tesseract --output-type pdf you should be able to replicate the results that you'd get from manually converting the image to TIFF and using Tesseract for TIFF to PDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants