Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pdf-renderer tess4 looses DPI Info with any image-preprocessing. #147

Closed
17Halbe opened this issue Mar 24, 2017 · 9 comments
Closed

Pdf-renderer tess4 looses DPI Info with any image-preprocessing. #147

17Halbe opened this issue Mar 24, 2017 · 9 comments

Comments

@17Halbe
Copy link

17Halbe commented Mar 24, 2017

Hi there,

A (with a Fujitsu ScanSnap) scanned 600 dpi not ocr'd pdf wich is preprocessed with any of the preprocessing parameters (I tried -c, -r, -d, --oversample DPI and --remove-background) will get a Tesseract error of:

INFO - 1: [tesseract] Warning. Invalid resolution 0 dpi. Using 70 instead.

This is happening on the jbarlow83/ocrmypdf-tess4 docker image.
exact commandline:
ocrmypdf -l deu -c --pdf-renderer tess4 input.pdf output.pdf
See also: [Clarification request/bug?] "Warning. Invalid resolution 0 dpi. Using 70 instead." #649
and also a Tesseract Forums entry: Invalid resolution 0 dpi. Using 70 instead.

So it seems like the dpi Information is lost during the preprocessing (unpaper(?)).

Anyone else?

regards Alex

@17Halbe 17Halbe changed the title Pdf-renderer tess4 loses DPI Info with any image-preprocessing. Pdf-renderer tess4 looses DPI Info with any image-preprocessing. Mar 24, 2017
@jbarlow83
Copy link
Collaborator

Looks like unpaper was indeed responsible for discarding the DPI. It always has, but the loss of this information only matters to the tess4 renderer.

Fixed in 4.5.2.

@17Halbe
Copy link
Author

17Halbe commented Mar 25, 2017

That was a quick one! Thanks again for your effort and that great piece of software you cobbled together!

@17Halbe
Copy link
Author

17Halbe commented Mar 25, 2017

Somehow the automated build for the docker image ( jbarlow83/ocrmypdf-tess4) didn't kick in. Can you please build it manually? Thanks a lot!

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Mar 26, 2017 via email

@17Halbe
Copy link
Author

17Halbe commented Mar 28, 2017

Further investigations! ;)

Most of the problems solved, though, when using the --clean parameter I still get the error:
INFO - 1: [tesseract] Warning. Invalid resolution 0 dpi. Using 70 instead.

Surprisingly (probably) using the --clean-final parameter everything works as expected.
I tested following additional parameters(and they all work):

  • --rotate-pages(with a page to rotate),
  • --remove-background and
  • --deskew
  • --oversample DPI

@jbarlow83 jbarlow83 reopened this Mar 28, 2017
@jbarlow83
Copy link
Collaborator

Could you provide a full command line that is still causing trouble?

Also please check that ocrmypdf --version is 4.5.2

@17Halbe
Copy link
Author

17Halbe commented Mar 29, 2017

Oh, you are right. Docker pulled an update after you kicked it but it still shows 4.5.1.
So either the versioning number somehow didn't made it into the docker image, or docker didn't build the new image.

docker run --rm -v /myDir/:/home/docker ocrmypdf --version 4.5.1
Though docker pulled something new, because some errors went away. (Same PDF, same commandline)
The exact commandline is:
docker run --rm -v /Homepool/Documents/Home-Folder/alex/No-Images/:/home/docker ocrmypdf -j 1 --tesseract-timeout 360 -l deu+eng -c --pdf-renderer tess4 -f input.pdf output.pdf

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Mar 29, 2017 via email

@17Halbe
Copy link
Author

17Halbe commented Apr 3, 2017

Yep, that was the problem. I'm fairly new to docker, so excuse my lack of knowledge!

Can be closed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants