Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] tesseract returns SIGFPE Signal #1062

Closed
C0D3D3V opened this issue Jan 17, 2023 · 4 comments
Closed

[BUG] tesseract returns SIGFPE Signal #1062

C0D3D3V opened this issue Jan 17, 2023 · 4 comments

Comments

@C0D3D3V
Copy link

C0D3D3V commented Jan 17, 2023

Describe the bug
tesseract returns SIGFPE Signal?

   41 Rasterize with png16m, rotation 0
   41 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=41', '-dLastPage=41', '-r599.441022x599.441022', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.29mqbkv2/origin.pdf']
   40 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   40 Grafting
   40 Page rotation: (content, auto) -> page = (0, 0) -> 0
   41 Rotating output by 0
   41 resolution (599.4399999999999, 599.4399999999999)
   41 Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.29mqbkv2/000041_ocr.png', '/tmp/ocrmypdf.io.29mqbkv2/000041_ocr_tess', 'pdf', 'txt']
   42 Rasterize with png16m, rotation 0
   42 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=42', '-dLastPage=42', '-r599.441022x599.441022', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.29mqbkv2/origin.pdf']
   41 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
   41 Grafting
   41 Page rotation: (content, auto) -> page = (0, 0) -> 0
   42 Rotating output by 0
   42 resolution (599.4399999999999, 599.4399999999999)
   42 Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.29mqbkv2/000042_ocr.png', '/tmp/ocrmypdf.io.29mqbkv2/000042_ocr_tess', 'pdf', 'txt']
   42 [tesseract] Image too small to scale!! (2x48 vs min width of 3)
   42 [tesseract] Line cannot be recognized!!
   43 Rasterize with png16m, rotation 0
   43 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=43', '-dLastPage=43', '-r599.441022x599.441022', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.29mqbkv2/origin.pdf']
   43 Rotating output by 0
   43 resolution (599.4399999999999, 599.4399999999999)
   43 Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.29mqbkv2/000043_ocr.png', '/tmp/ocrmypdf.io.29mqbkv2/000043_ocr_tess', 'pdf', 'txt']
OCR:  49%|█████████████████████████████████████████████████████████████████████████████████                                                                                     | 41.0/84.0 [09:55<10:24, 14.52s/page]
ExitCodeException
Traceback (most recent call last):
  File "/home/daniel/.local/lib/python3.10/site-packages/ocrmypdf/_exec/tesseract.py", line 401, in generate_pdf
    p = run(args_tesseract, stdout=PIPE, stderr=STDOUT, timeout=timeout, check=True)
  File "/home/daniel/.local/lib/python3.10/site-packages/ocrmypdf/subprocess/__init__.py", line 57, in run
    proc = subprocess_run(args, env=env, check=check, **kwargs)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.29mqbkv2/000042_ocr.png', '/tmp/ocrmypdf.io.29mqbkv2/000042_ocr_tess', 'pdf', 'txt']' died with <Signals.SIGFPE: 8>.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/daniel/.local/lib/python3.10/site-packages/ocrmypdf/_sync.py", line 393, in run_pipeline
    optimize_messages = exec_concurrent(context, executor)
  File "/home/daniel/.local/lib/python3.10/site-packages/ocrmypdf/_sync.py", line 280, in exec_concurrent
    executor(
  File "/home/daniel/.local/lib/python3.10/site-packages/ocrmypdf/_concurrent.py", line 87, in __call__
    self._execute(
  File "/home/daniel/.local/lib/python3.10/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 141, in _execute
    result = future.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/daniel/.local/lib/python3.10/site-packages/ocrmypdf/_sync.py", line 220, in exec_page_sync
    (ocr_out, text_out) = ocr_engine_textonly_pdf(ocr_image_out, page_context)
  File "/home/daniel/.local/lib/python3.10/site-packages/ocrmypdf/_pipeline.py", line 661, in ocr_engine_textonly_pdf
    ocr_engine.generate_pdf(
  File "/home/daniel/.local/lib/python3.10/site-packages/ocrmypdf/builtin_plugins/tesseract_ocr.py", line 189, in generate_pdf
    tesseract.generate_pdf(
  File "/home/daniel/.local/lib/python3.10/site-packages/ocrmypdf/_exec/tesseract.py", line 413, in generate_pdf
    raise SubprocessOutputError() from e
ocrmypdf.exceptions.SubprocessOutputError

To Reproduce

ocrmypdf -v -l deu --jobs 1  'test.pdf' 'test.pdf'

I also tried without --jobs and with --force-ocr

Example file

This only happens with this test file, on 33 similar files it worked without problems.

test file is up for 30 days:
https://easyupload.io/as1sst

System

  • OS: Arch Linux 6.1.6-arch1-1
  • OCRmyPDF Version: 14.0.2
  • How did you install ocrmypdf? pip
@C0D3D3V C0D3D3V changed the title [BUG] [BUG] tesseract returns SIGFPE Signal Jan 17, 2023
@C0D3D3V
Copy link
Author

C0D3D3V commented Jan 17, 2023

I made a issue on the tesseract repo too, I guess its not really related to OCRmyPDF
tesseract-ocr/tesseract#3995

@C0D3D3V
Copy link
Author

C0D3D3V commented Jan 17, 2023

An option to ignore tesseract errors would be nice. So that the page with an error is just skipped instead of crashing OCRmyPDF

@jbarlow83
Copy link
Collaborator

I'm reluctant to add such an option because it could mask more serious issues than a one-time failure. I think it's reasonable for the program to ask for user intervention in this case, and an exception is a good way of doing that.

One could write a plugin to suppress errors from the OCR engine if needed.

@jbarlow83 jbarlow83 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 17, 2023
@C0D3D3V
Copy link
Author

C0D3D3V commented Jan 17, 2023

just a side note gscan2pdf also issued the warnings

42 [tesseract] Image too small to scale!! (2x48 vs min width of 3)
42 [tesseract] Line cannot be recognized!!

for page 42 but did not crash and just created a complete pdf. It also uses tesseract, I tried to dig a little into the code of gsccan2pdf, to find a difference in the way it executes tesseract, but gave up... (I guess they have a fallback to cuneiform/gocr, not totally sure)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants