First column of two column PDF text image skipped #248

atanasj · 2018-04-04T01:31:19Z

Hi,

I have run into a problem. I have a file (2008.pdf) that I want to run ocrymypdf on. The pdf is a photocopy of a book. The photocopy is two pages of the book per pdf page, in landscape orientation. I guess I think of this as two columns per page (but I could be thinking about this wrong).

I have run the following command:

ocrmypdf -v -d -c --oversample=300 -f 2008.pdf output/2008b.pdf >> ~/Desktop/ocrmypdfddebug.txt 2>&1

with the output in ocrmypdfddebug.txt.

The resultant output file is 2008b.pdf. As you can see, the ocr layer is only on the right side (second column) of each page.

Other relevant details:

ocrmypdf 5.7.0 
MacBook Pro (Retina, 13-inch, Late 2013)
macOS 10.13.3

Not sure if I am doing something wrong, or if this is an issue in the package. Please let me know if you need more info.

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2018-04-05T09:04:38Z

The problem is the -c (--clean) argument. This runs a program called unpaper, and unpaper is erasing all of your left columns because it expects to see only a single column.

I do not send a --layout double argument to unpaper. https://github.com/Flameeyes/unpaper/blob/master/doc/basic-concepts.md

(You can use -dci to see what the image that is sent to OCR looks like.)

I personally don't trust unpaper unsupervised (without reviewing the output) – so you're aware.

It looks like I should be sending unpaper --layout none since that is a more conservative setting that its default.

atanasj · 2018-04-05T14:22:22Z

Thanks for your response.

Sounds like I should drop the-c argument, as there is no way to call the --layout none argument?

Thanks for such a useful package too---truly appreciated!

jbarlow83 · 2018-04-05T20:10:03Z

Yes, drop -c.

If you have a lot of files and want to play with you could add --layout none or double to ocrmypdf.exec.unpaper.clean.

atanasj · 2018-04-07T02:22:22Z

Great, that seems to have worked. I have three more questions, one related to this issue, and two not so related (let me know if you want me to move these questions somewhere else).

Where do I find ocrmypdf.exec.unpaper.clean if I wanted to play with it?
Is there a way to prepend a timestamp to the lines of output that I usually see in my terminal window when running ocrymypdf?
Is there a away to preprocess documents in ocrmypdf to turn a pdf to an image file (e.g., jpg), and then back to pdf. I ask this because I have received pdf files where ocrmypdf would fail to convert. I wasn't sure why, but assumed that the file was corrupt (perhaps a mix of image and pdf within a pdf---not sure, as I don't really know how this works). However, this was resolved by first converting the provided (corrupt) pdf into seperate jpg files, then converting all jpg to pdf, and then running ocrmypdf to get the ocr layer. I supposed I could write a script for this (still learning how to script, so this is easier said than done) but was just wondering if this was already a feature or if you were aware of an easier workflow.

Again, thanks for all the awesome!

jbarlow83 · 2018-04-07T03:45:49Z

src/ocrmypdf/exec/unpaper.py, def clean. If you're going to play with it you should clone the Github version and install it into a Python virtual environment.
I think this would do it: https://unix.stackexchange.com/questions/26728/prepending-a-timestamp-to-each-line-of-output-from-a-command . Or you could also hack it into src/ocrmypdf/__main__.py in logging_factory where the output format is configured.
The argument --force-ocr does that internally, and on top of that it tries really hard to preserve the quality level of images in the corrupt PDF. That being said, if the PDF is in really bad shape OCRmyPDF might not be able to figure out enough about the file to use --force-ocr, in which case you could "refry"/completely reconstruct the input file with Ghostscript: gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -o out1.pdf in1.pdf

atanasj · 2018-04-07T11:47:01Z

Hi again. Thanks for all your help.

I think I could probably handle to def clean hacking... but I'll probably just drop the -c, as I am not sure how to setup and Python virtual environment. Although I'd like to learn, I don't really have the spare time right now.
This one might be a bit beyond me. I had already tried the suggested solutions in the link you references via the command line, and I just cannot get it to work. The first suggestion using moreutils conflicts with GNU parallel as they both have parallel commands. I tried all the others and no joy. I was wanting to setup a script using watchman and GNU parallel as you suggested on the Batch Processing section of your help page. I can setup a shell script and run it via the command line, but don't seem to be able to get watchman to work... Anyway, that's another issue! Do you prepended timestamps are something that would be useful to others? Perhaps not, otherwise it likely would have been requested by now, and given the link above, there are other workarounds that could work.
Thanks for the tips. I'll try using the inbuilt --force-ocr and then progress to the Ghostscript you provided above if I need more.

Once again, thanks again for all the help and guidance and fine work.

jbarlow83 · 2018-04-08T19:50:37Z

It would work to locate where homebrew installs unpaper.py and edit that file in place. Of course, when homebrew updates ocrmypdf, your changes will be lost, which is why long term it's worth keeping your changes in a private fork. The latest homebrew version is 5.7.0 which is a fair bit behind the current release.
Apparently you can do brew install moreutils --without-parallel to work around the conflict with parallel. I'd try watchmedo/watchdog first, it's simpler than watchman. Programs like watchmedo and watchman usually have their own timestamp abilities so I don't think I'd want to add – "orthogonal tools" is a good principle.

If this is a commercial project and you'd like support for setting up a batch processing solution that is something I can offer.

atanasj · 2018-04-18T01:00:22Z

Thanks for all your help on this issues. I will close as your help provided a solution to my problem. Thanks again.

atanasj closed this as completed Apr 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First column of two column PDF text image skipped #248

First column of two column PDF text image skipped #248

atanasj commented Apr 4, 2018 •

edited

jbarlow83 commented Apr 5, 2018

atanasj commented Apr 5, 2018

jbarlow83 commented Apr 5, 2018

atanasj commented Apr 7, 2018

jbarlow83 commented Apr 7, 2018

atanasj commented Apr 7, 2018

jbarlow83 commented Apr 8, 2018

atanasj commented Apr 18, 2018

First column of two column PDF text image skipped #248

First column of two column PDF text image skipped #248

Comments

atanasj commented Apr 4, 2018 • edited

jbarlow83 commented Apr 5, 2018

atanasj commented Apr 5, 2018

jbarlow83 commented Apr 5, 2018

atanasj commented Apr 7, 2018

jbarlow83 commented Apr 7, 2018

atanasj commented Apr 7, 2018

jbarlow83 commented Apr 8, 2018

atanasj commented Apr 18, 2018

atanasj commented Apr 4, 2018 •

edited