Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] --deskew not compatible with blank pages or with tesseract_timeout = 0 #1049

Closed
deexpabada opened this issue Dec 15, 2022 · 6 comments
Closed

Comments

@deexpabada
Copy link

deexpabada commented Dec 15, 2022

Describe the bug
The --deskew option is not behaving as expected on Ocrmypdf 13.7.0. I am experiencing two issues related to deskew.

Issue 1: Deskew not working on blank pages

I'm using the following options --output-type=pdf --tesseract-timeout=30 on this blank_image.pdf.
When I run the Ocrmypdf command above, I get a SubprocessOutputError. I see that issue is referenced here: #868, but I don't think the bug fix covered all scenarios.

Issue 2: Deskew not working with tesseract_timeout=0

I want to deskew PDFs without running OCR on them, as mentioned in the docs here. However, when --tesseract-timeout=0, the document is not being deskewed because OCR is not being run. If I change --tesseract-timeout to a different integer, it successfully deskews. Here is a skewed PDF that can be used to reproduce the issue: skewed_text.pdf

To Reproduce
Issue 1: Use blank_image.pdf and run ocrmypdf --deskew --output-type=pdf --tesseract-timeout=30 blank_image.pdf result.pdf .
Issue2: Use skewed_text.pdf and run ocrmypdf --deskew --output-type=pdf --tesseract-timeout=0 skewed_text.pdf result_pdf.

Expected behavior
I expect that blank pages do not completely block the ocrmypdf command from running. It should be able to gracefully handle the error and skip deskewing that specific page.
I expect that with --tesseract_timeout=0 the page can be deskewed without having OCR applied.

Screenshots
If applicable, add screenshots to help explain your problem.
Deskew with 0 second timeout:
skewed_with_0_second_timeout
Deskew with 30 second timeout:
skewed_with_30_second_timeout

System (please complete the following information):

  • OS: MacOS Ventura 13.0.1
  • OCRmyPDF version: 13.7.0

Installation
brew install ocrmypdf

@HF706
Copy link

HF706 commented Mar 21, 2023

I want to deskew PDFs without running OCR.The --deskew option is not working. Ocrmypdf v14.0.4.

@deexpabada
Copy link
Author

@jbarlow83 Any thoughts on the above? :)

@jbarlow83
Copy link
Collaborator

Fixed as of v14.1.0
--tesseract-timeout now applies to OCR, not deskew; and a new option allows control of timeout on deskew if necessary

@akki42
Copy link

akki42 commented Jun 25, 2023

Hi @jbarlow83,
Thank you for adding the --tesseract-non-ocr-timeout option.
However, that does not seem to address the "Issue 1: Deskew not working on blank pages" reported above and in #868 .

With v14.3.0, I am still encountering that problem:

akki@op:~ > ocrmypdf --deskew empty.pdf result.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 256.17page/s]
1 [tesseract] [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
1 [tesseract] [DS] Device[1] 0:(null) score is 0.511918
1 [tesseract] [DS] Selected Device[1]: "(null)" (Native)
OCR: 0%| | 0.0/1.0 [00:00<?, ?page/s]
SubprocessOutputError

Adding --rotate-pages, tesseract gives an additional error message:

akki@op:~ > ocrmypdf --deskew --rotate-pages empty.pdf result.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 287.71page/s]
1 [tesseract] [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
1 [tesseract] [DS] Device[1] 0:(null) score is 0.511918
1 [tesseract] [DS] Selected Device[1]: "(null)" (Native)
1 [tesseract] Too few characters. Skipping this page
1 [tesseract] Error during processing.
1 page is facing ⇧, confidence 0.00 - no change
1 [tesseract] [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
1 [tesseract] [DS] Device[1] 0:(null) score is 0.511918
1 [tesseract] [DS] Selected Device[1]: "(null)" (Native)
OCR: 0%| | 0.0/1.0 [00:01<?, ?page/s]
SubprocessOutputError

Version information:

akki@op:~ > ocrmypdf --version
14.3.0
akki@op:~ > tesseract --version
tesseract 5.3.1
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.40 : libtiff 4.0.9 : zlib 1.2.13 : libwebp 1.0.3 : libopenjp2 2.3.0
OpenCL info:
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.0
Found libcurl/8.0.1 OpenSSL/1.1.1l-fips zlib/1.2.13 brotli/1.0.7 zstd/1.5.0 libidn2/2.2.0 libpsl/0.20.1 (+libidn2/2.2.0) libssh/0.9.6/openssl/zlib nghttp2/1.40.0

Sample file:
empty.pdf

As pointed out by @azogue in #868 (comment), d48254d seems not to cover sample pages like this.

(I am experiencing this error quite frequently when duplex scanning incoming mail when one, e.g. the last, sheet has an empty back page.)

Any chance to address this? Happy to provider further information or test solutions / workarounds.
Thank you!

@jbarlow83
Copy link
Collaborator

You are also using the OpenCL version of Tesseract. Although I haven't checked in recently, when I last looked it was quite unstable and not something anyone should expect to work reliably. As you can see from the commits, I'm relying on Tesseract to provide consistent error messages, and it seems to not do that for OpenCL. What happens if you use regular Tesseract?

@akki42
Copy link

akki42 commented Jun 25, 2023

Thanks for the quick response.
I am using openSUSE 15.5; the above mentioned tesseract version was installed from one of the main repositories.
If I understand the changelog correctly, the SUSE maintainers have been building tesseract with OpenCL enabled since December 2019.

Using the tesseract version (currently at version 5.3.0) mentioned in the tesseract online docu - and build without the OpenCL flag, though not packaged for openSUSE 15.5, but 15.4 - does indeed work without a SubprocessOutputError.

I will stick to that for the time being...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants