-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2 columns only sometimes recognized #77
Comments
You could try ocrmypdf --pdf-renderer tesseract to see if the issue is Also try ocrmypdf --tesseract-pagesegmode 1 On Wed, 29 Jun 2016 at 14:19 John Muccigrosso notifications@github.com
|
Ah, much better with the tesseract renderer. The PSM didn't make much a difference that I could see. tesseract itself still does a better job of finding word divisions, but I wonder if that's due to the resolution of the tiff (300dpi) compared with the pdf (72, I think). |
You can force ocrmypdf to oversample to 300 DPI with --oversample 300 The details are: PDF is a vector format that can be rendered at any DPI. On Wed, Jun 29, 2016 at 16:02 John Muccigrosso notifications@github.com
|
No change with The file is obtained from a Canon scanner which has a setting for scanning at 200 dpi resolution. I'll check later today on what Adobe says about the images in it. EDIT imageMagick's identify reports:
So that's all consistent, though confusing to me considering what I set the scanner to do, unless I don't understand how it reports. |
Could you send me a copy of the PDF and TIFF? I'd like to investigate why A Dropbox link will do. Thanks. On Thu, 30 Jun 2016 at 08:08 John Muccigrosso notifications@github.com
|
Here you go: https://dl.dropboxusercontent.com/u/55583820/Archive.zip John |
You should definitely
|
Just to beat this dead horse a little more, I played around with various settings. Here are some results: tl;dr: Easily the best OCR work was done by tesseract by itself on the 300-dpi tiff I exported via Preview.app from the pdf I have. It preserved line endings and read the columns correctly. None of the various options with ocrmypdf generated text that was nearly so accurate. The various combinations all, to one degree or another, missed word breaks, jumped across columns, or oddly cut lines part-way through. The tesseract text starts like this, with the only error being the insertion of an apostrophe for a tiny footnote 1:
Here's an image showing the results when I try to select text (again in Preview) by dragging down the first column. All but the last one jump columns (though you can barely see that on the second-to-last): And here's what the text looks like when copied out of those various files, all with ocrmypdf and the indicated options. \1. --clean: starts off well then runs words together, right where it jumps across the column break (at "Dedicationdates"). The output varies in observance of line breaks.
\2. --clean and deskew: running together more words at the start and jumping the column right away too, so again those two problems seem connected.
\3. clean, deskew, oversample 300: still a jump at the start, but now it stops.
\4. clean, deskew, oversample 300 and tesseract rendering: a little better at the start, but still runs some words together.
\5. Same as #4, but without deskew: this one started out best, but then started leaving out words at the end of some lines. You can barely see it in the image about ⅔ of the way down, where a line-ending "Polybius" isn't highlighted, but it happens a few more times at the end of the column, even breaking off mid-word.
|
Thanks for your detailed investigation. The problem appears to be in Ghostscript, which ocrmypdf uses to create a PDF/A. I am waiting on them to investigate. I don't have an option to bypass Ghostscript at the moment, because no other open source tools create PDF/As that I know of (without using Ghostscript under the hood). I am considering a non-PDF/A option for users who prefer the smallest file sizes and are not concerned about archiving (see #48). That said, Preview.app does a much worse job at extracting text from a PDF than other viewers, and specifically has the problem of gluing words together where other viewers do not. Have a look in Chrome PDF Viewer or Adobe Reader and you'll probably see different and better results. This infuriating issue actually comes back to discrepancies in the PDF spec and the fact that OCR text is basically a hack to the spec done with invisible fonts. |
Yes, I was wondering about the text extraction. texttopdf does an awful job, often sticking spaces in between every character, it seems. Skim looks like Preview (and maybe is using the same system service?). Acrobat gets it right, even if the way selection works is very un-Mac-like (or un-Windows-like for that matter). Chrome PDF Viewer looked good on the page, but the copied text was all run together. My use case involves uses a viewer and highlighting text to be extracted by copying or scripting, so I'd like to have the text match up as much as possible. Interesting to me is that the Canon software, which doesn't do a great job of OCR, is able to put in a text layer that Preview reads without problems (apart from the crappy OCR). |
All PDF viewers use heuristics to decide what text is contiguous because PDFs don't include this information on their own. A PDF is conceptually closer to a vector drawing than a word processor. It supports precise glyph positioning, but there is little to organize text into words, lines, or paragraphs. This is why editing PDFs is hard. It seems that Canon and Preview.app are just a lucky combination that play well together (sometimes). Perhaps something like DjVu is a more appropriate format if you want text extraction to be perfect. |
v4.2 adds an option to bypass Ghostscript and produce a plain PDF instead of PDF/A. If you use |
I have a document with two columns of text, which seem to give some trouble. When I convert the PDF to a 300-dpi tif to use directly with tesseract, the columns do get recognized successfully.
Is there a way to improve performance with ocrmypdf?
The text was updated successfully, but these errors were encountered: