Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeroDivisionError #253

Closed
jullit31 opened this issue Apr 14, 2018 · 3 comments
Closed

ZeroDivisionError #253

jullit31 opened this issue Apr 14, 2018 · 3 comments

Comments

@jullit31
Copy link

When running a specific PDF (wich I cannot share here) I get a ZeroDivisionError: float division by zero error. The PDF was created from PNGs using PDF24. Other PDFs created in the same manner work fine, as does piping img2pdf's output using the same PNGs.
Running ocrmypdf -v 1 in.pdf out.pdf |& tee debug.txt gives me this:
debug.txt

I'm running Ubuntu 16.04 with the WSL.

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Apr 14, 2018 via email

jbarlow83 pushed a commit that referenced this issue Apr 15, 2018
Issue #253 - PDF that produces the error is not available, but if font_width
is zero, chances are the text is nonprinting characters, so suppress it.
@jullit31
Copy link
Author

I can confirm that this solved the problem. Thanks!

@gwern
Copy link

gwern commented Mar 7, 2019

I recently got a strange PDF from a university scanning service, the second of 2 PDFs. The first PDF processed without any issues, but the second one crashes with the same division by zero error as above:

ocrmypdf --version; ocrmypdf 462763_Vol2.pdf 462763_Vol2-small.pdf 
8.0.0
  ERROR - Traceback (most recent call last):
  File "/home/gwern/.local/lib/python3.6/site-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/home/gwern/.local/lib/python3.6/site-packages/ruffus/task.py", line 544, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_pipeline.py", line 171, in repair_and_parse_pdf
    output_file, detailed_page_analysis=detailed_page_analysis, log=log
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 753, in __init__
    infile, detailed_page_analysis, log=log
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 630, in _pdf_get_all_pageinfo
    page = PageInfo(pdf, n, infile, page_xml, detailed_analysis)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 640, in __init__
    self._pageinfo = _pdf_get_pageinfo(pdf, pageno, infile, xmltext)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 606, in _pdf_get_pageinfo
    xres = Decimal(max(image.xres for image in pageinfo['images']))
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 606, in <genexpr>
    xres = Decimal(max(image.xres for image in pageinfo['images']))
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 346, in xres
    return _get_dpi(self._shorthand, (self._width, self._height))[0]
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 255, in _get_dpi
    scale_w = image_size[0] / image_drawn_width
ZeroDivisionError: float division by zero

However, the workaround suggested no longer seems to exist? Trying with --pdf-renderer tesseract tells me that's not an option:

usage: ocrmypdf [-h] [-l LANGUAGE] [--image-dpi DPI] [--output-type {pdfa,pdf,pdfa-1,pdfa-2,pdfa-3}] [--sidecar [FILE]] [--version] [-j N] [-q] [-v [VERBOSE]]
                [--title TITLE] [--author AUTHOR] [--subject SUBJECT] [--keywords KEYWORDS] [-r] [--remove-background] [-d] [-c] [-i] [--oversample DPI]
                [--remove-vectors] [--mask-barcodes] [--threshold] [-f] [-s] [--redo-ocr] [--skip-big MPixels] [-O {0,1,2,3}] [--jpeg-quality Q] [--png-quality Q]
                [--jbig2-lossy] [--max-image-mpixels MPixels] [--tesseract-config CFG] [--tesseract-pagesegmode PSM] [--tesseract-oem MODE]
                [--pdf-renderer {auto,hocr,sandwich}] [--tesseract-timeout SECONDS] [--rotate-pages-threshold CONFIDENCE]
                [--pdfa-image-compression {auto,jpeg,lossless}] [--user-words FILE] [--user-patterns FILE] [-k] [--flowchart FLOWCHART]
                input_pdf_or_image output_pdf
ocrmypdf: error: argument --pdf-renderer: invalid choice: 'tesseract' (choose from 'auto', 'hocr', 'sandwich')

I'm pretty sure I have Tesseract installed. And if I run with hocr or sandwich as the options instead, the crashes are identical.

As another workaround, I tried opening in gscan2pdf, deleting pages which looked extremely weird (like a single large character), exporting to DJVU (due to a gscan2pdf filesize issue if you export to PDF directly), converting to PDF with ddjvu and running ocrmypdf on that PDF version, which gives an entirely different set of errors:

...
   INFO -  492: background removal skipped on mono page
  ERROR - Traceback (most recent call last):
  File "/home/gwern/.local/lib/python3.6/site-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/home/gwern/.local/lib/python3.6/site-packages/ruffus/task.py", line 544, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_pipeline.py", line 540, in preprocess_remove_background
    leptonica.remove_background(input_file, output_file)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/leptonica.py", line 861, in remove_background
    pix = pix.background_norm(tile_size=tile_size).gamma_trc(
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/leptonica.py", line 563, in background_norm
    smooth_kernel[1],
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/leptonica.py", line 148, in __init__
    raise ValueError('Tried to wrap a NULL ' + self.LEPTONICA_TYPENAME)
ValueError: Tried to wrap a NULL PIX

  ERROR - Traceback (most recent call last):
  File "/home/gwern/.local/lib/python3.6/site-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/home/gwern/.local/lib/python3.6/site-packages/ruffus/task.py", line 544, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_pipeline.py", line 540, in preprocess_remove_background
    leptonica.remove_background(input_file, output_file)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/leptonica.py", line 861, in remove_background
    pix = pix.background_norm(tile_size=tile_size).gamma_trc(
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/leptonica.py", line 563, in background_norm
    smooth_kernel[1],
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/leptonica.py", line 148, in __init__
    raise ValueError('Tried to wrap a NULL ' + self.LEPTONICA_TYPENAME)
ValueError: Tried to wrap a NULL PIX
...

A puzzling PDF. I'm not sure how to deal with these errors, so for now I'm settling for the first PDF/half being properly processed, leaving the second half alone, and just concatenating them into a single PDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants