Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First column of two column PDF text image skipped #248

Closed
atanasj opened this issue Apr 4, 2018 · 8 comments
Closed

First column of two column PDF text image skipped #248

atanasj opened this issue Apr 4, 2018 · 8 comments

Comments

@atanasj
Copy link

atanasj commented Apr 4, 2018

Hi,

I have run into a problem. I have a file (2008.pdf) that I want to run ocrymypdf on. The pdf is a photocopy of a book. The photocopy is two pages of the book per pdf page, in landscape orientation. I guess I think of this as two columns per page (but I could be thinking about this wrong).

I have run the following command:

ocrmypdf -v -d -c --oversample=300 -f 2008.pdf output/2008b.pdf >> ~/Desktop/ocrmypdfddebug.txt 2>&1

with the output in ocrmypdfddebug.txt.

The resultant output file is 2008b.pdf. As you can see, the ocr layer is only on the right side (second column) of each page.

Other relevant details:

ocrmypdf 5.7.0 
MacBook Pro (Retina, 13-inch, Late 2013)
macOS 10.13.3

Not sure if I am doing something wrong, or if this is an issue in the package. Please let me know if you need more info.

@jbarlow83
Copy link
Collaborator

The problem is the -c (--clean) argument. This runs a program called unpaper, and unpaper is erasing all of your left columns because it expects to see only a single column.

I do not send a --layout double argument to unpaper. https://github.com/Flameeyes/unpaper/blob/master/doc/basic-concepts.md

(You can use -dci to see what the image that is sent to OCR looks like.)

I personally don't trust unpaper unsupervised (without reviewing the output) – so you're aware.

It looks like I should be sending unpaper --layout none since that is a more conservative setting that its default.

@atanasj
Copy link
Author

atanasj commented Apr 5, 2018

Thanks for your response.

Sounds like I should drop the-c argument, as there is no way to call the --layout none argument?

Thanks for such a useful package too---truly appreciated!

@jbarlow83
Copy link
Collaborator

Yes, drop -c.

If you have a lot of files and want to play with you could add --layout none or double to ocrmypdf.exec.unpaper.clean.

@atanasj
Copy link
Author

atanasj commented Apr 7, 2018

Great, that seems to have worked. I have three more questions, one related to this issue, and two not so related (let me know if you want me to move these questions somewhere else).

  1. Where do I find ocrmypdf.exec.unpaper.clean if I wanted to play with it?
  2. Is there a way to prepend a timestamp to the lines of output that I usually see in my terminal window when running ocrymypdf?
  3. Is there a away to preprocess documents in ocrmypdf to turn a pdf to an image file (e.g., jpg), and then back to pdf. I ask this because I have received pdf files where ocrmypdf would fail to convert. I wasn't sure why, but assumed that the file was corrupt (perhaps a mix of image and pdf within a pdf---not sure, as I don't really know how this works). However, this was resolved by first converting the provided (corrupt) pdf into seperate jpg files, then converting all jpg to pdf, and then running ocrmypdf to get the ocr layer. I supposed I could write a script for this (still learning how to script, so this is easier said than done) but was just wondering if this was already a feature or if you were aware of an easier workflow.

Again, thanks for all the awesome!

@jbarlow83
Copy link
Collaborator

  1. src/ocrmypdf/exec/unpaper.py, def clean. If you're going to play with it you should clone the Github version and install it into a Python virtual environment.

  2. I think this would do it: https://unix.stackexchange.com/questions/26728/prepending-a-timestamp-to-each-line-of-output-from-a-command . Or you could also hack it into src/ocrmypdf/__main__.py in logging_factory where the output format is configured.

  3. The argument --force-ocr does that internally, and on top of that it tries really hard to preserve the quality level of images in the corrupt PDF. That being said, if the PDF is in really bad shape OCRmyPDF might not be able to figure out enough about the file to use --force-ocr, in which case you could "refry"/completely reconstruct the input file with Ghostscript: gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -o out1.pdf in1.pdf

@atanasj
Copy link
Author

atanasj commented Apr 7, 2018

Hi again. Thanks for all your help.

  1. I think I could probably handle to def clean hacking... but I'll probably just drop the -c, as I am not sure how to setup and Python virtual environment. Although I'd like to learn, I don't really have the spare time right now.
  2. This one might be a bit beyond me. I had already tried the suggested solutions in the link you references via the command line, and I just cannot get it to work. The first suggestion using moreutils conflicts with GNU parallel as they both have parallel commands. I tried all the others and no joy. I was wanting to setup a script using watchman and GNU parallel as you suggested on the Batch Processing section of your help page. I can setup a shell script and run it via the command line, but don't seem to be able to get watchman to work... Anyway, that's another issue! Do you prepended timestamps are something that would be useful to others? Perhaps not, otherwise it likely would have been requested by now, and given the link above, there are other workarounds that could work.
  3. Thanks for the tips. I'll try using the inbuilt --force-ocr and then progress to the Ghostscript you provided above if I need more.

Once again, thanks again for all the help and guidance and fine work.

@jbarlow83
Copy link
Collaborator

  1. It would work to locate where homebrew installs unpaper.py and edit that file in place. Of course, when homebrew updates ocrmypdf, your changes will be lost, which is why long term it's worth keeping your changes in a private fork. The latest homebrew version is 5.7.0 which is a fair bit behind the current release.

  2. Apparently you can do brew install moreutils --without-parallel to work around the conflict with parallel. I'd try watchmedo/watchdog first, it's simpler than watchman. Programs like watchmedo and watchman usually have their own timestamp abilities so I don't think I'd want to add – "orthogonal tools" is a good principle.

If this is a commercial project and you'd like support for setting up a batch processing solution that is something I can offer.

@atanasj
Copy link
Author

atanasj commented Apr 18, 2018

Thanks for all your help on this issues. I will close as your help provided a solution to my problem. Thanks again.

@atanasj atanasj closed this as completed Apr 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants