ZeroDivisionError #253

jullit31 · 2018-04-14T17:46:10Z

When running a specific PDF (wich I cannot share here) I get a ZeroDivisionError: float division by zero error. The PDF was created from PNGs using PDF24. Other PDFs created in the same manner work fine, as does piping img2pdf's output using the same PNGs.
Running ocrmypdf -v 1 in.pdf out.pdf |& tee debug.txt gives me this:
debug.txt

I'm running Ubuntu 16.04 with the WSL.

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2018-04-14T19:51:22Z

The log is probably enough for me to fix it. You should be able to work around it with by adding the argument --pdf-renderer tesseract

…

On Sat, Apr 14, 2018, 10:46 jullit31, ***@***.***> wrote: When running a specific PDF (wich I cannot share here) I get a ZeroDivisionError: float division by zero error. The PDF was created from PNGs using PDF24. Other PDFs created in the same manner work fine, as does piping img2pdf's output using the same PNGs. Running ocrmypdf -v 1 in.pdf out.pdf |& tee debug.txt gives me this: debug.txt <https://github.com/jbarlow83/OCRmyPDF/files/1910673/debug.txt> I'm running Ubuntu 16.04 with the WSL. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#253>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvcM4vc-5bkH7wSFvdMrHFnv9cb-amCks5tojXjgaJpZM4TVKJV> .

Issue #253 - PDF that produces the error is not available, but if font_width is zero, chances are the text is nonprinting characters, so suppress it.

jullit31 · 2018-04-19T21:31:38Z

I can confirm that this solved the problem. Thanks!

gwern · 2019-03-07T14:36:16Z

I recently got a strange PDF from a university scanning service, the second of 2 PDFs. The first PDF processed without any issues, but the second one crashes with the same division by zero error as above:

ocrmypdf --version; ocrmypdf 462763_Vol2.pdf 462763_Vol2-small.pdf 
8.0.0
  ERROR - Traceback (most recent call last):
  File "/home/gwern/.local/lib/python3.6/site-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/home/gwern/.local/lib/python3.6/site-packages/ruffus/task.py", line 544, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_pipeline.py", line 171, in repair_and_parse_pdf
    output_file, detailed_page_analysis=detailed_page_analysis, log=log
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 753, in __init__
    infile, detailed_page_analysis, log=log
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 630, in _pdf_get_all_pageinfo
    page = PageInfo(pdf, n, infile, page_xml, detailed_analysis)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 640, in __init__
    self._pageinfo = _pdf_get_pageinfo(pdf, pageno, infile, xmltext)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 606, in _pdf_get_pageinfo
    xres = Decimal(max(image.xres for image in pageinfo['images']))
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 606, in <genexpr>
    xres = Decimal(max(image.xres for image in pageinfo['images']))
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 346, in xres
    return _get_dpi(self._shorthand, (self._width, self._height))[0]
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/pdfinfo/__init__.py", line 255, in _get_dpi
    scale_w = image_size[0] / image_drawn_width
ZeroDivisionError: float division by zero

However, the workaround suggested no longer seems to exist? Trying with --pdf-renderer tesseract tells me that's not an option:

usage: ocrmypdf [-h] [-l LANGUAGE] [--image-dpi DPI] [--output-type {pdfa,pdf,pdfa-1,pdfa-2,pdfa-3}] [--sidecar [FILE]] [--version] [-j N] [-q] [-v [VERBOSE]]
                [--title TITLE] [--author AUTHOR] [--subject SUBJECT] [--keywords KEYWORDS] [-r] [--remove-background] [-d] [-c] [-i] [--oversample DPI]
                [--remove-vectors] [--mask-barcodes] [--threshold] [-f] [-s] [--redo-ocr] [--skip-big MPixels] [-O {0,1,2,3}] [--jpeg-quality Q] [--png-quality Q]
                [--jbig2-lossy] [--max-image-mpixels MPixels] [--tesseract-config CFG] [--tesseract-pagesegmode PSM] [--tesseract-oem MODE]
                [--pdf-renderer {auto,hocr,sandwich}] [--tesseract-timeout SECONDS] [--rotate-pages-threshold CONFIDENCE]
                [--pdfa-image-compression {auto,jpeg,lossless}] [--user-words FILE] [--user-patterns FILE] [-k] [--flowchart FLOWCHART]
                input_pdf_or_image output_pdf
ocrmypdf: error: argument --pdf-renderer: invalid choice: 'tesseract' (choose from 'auto', 'hocr', 'sandwich')

I'm pretty sure I have Tesseract installed. And if I run with hocr or sandwich as the options instead, the crashes are identical.

As another workaround, I tried opening in gscan2pdf, deleting pages which looked extremely weird (like a single large character), exporting to DJVU (due to a gscan2pdf filesize issue if you export to PDF directly), converting to PDF with ddjvu and running ocrmypdf on that PDF version, which gives an entirely different set of errors:

...
   INFO -  492: background removal skipped on mono page
  ERROR - Traceback (most recent call last):
  File "/home/gwern/.local/lib/python3.6/site-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/home/gwern/.local/lib/python3.6/site-packages/ruffus/task.py", line 544, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_pipeline.py", line 540, in preprocess_remove_background
    leptonica.remove_background(input_file, output_file)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/leptonica.py", line 861, in remove_background
    pix = pix.background_norm(tile_size=tile_size).gamma_trc(
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/leptonica.py", line 563, in background_norm
    smooth_kernel[1],
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/leptonica.py", line 148, in __init__
    raise ValueError('Tried to wrap a NULL ' + self.LEPTONICA_TYPENAME)
ValueError: Tried to wrap a NULL PIX

  ERROR - Traceback (most recent call last):
  File "/home/gwern/.local/lib/python3.6/site-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/home/gwern/.local/lib/python3.6/site-packages/ruffus/task.py", line 544, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/_pipeline.py", line 540, in preprocess_remove_background
    leptonica.remove_background(input_file, output_file)
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/leptonica.py", line 861, in remove_background
    pix = pix.background_norm(tile_size=tile_size).gamma_trc(
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/leptonica.py", line 563, in background_norm
    smooth_kernel[1],
  File "/home/gwern/bin/miniconda2/envs/fastai/lib/python3.6/site-packages/ocrmypdf/leptonica.py", line 148, in __init__
    raise ValueError('Tried to wrap a NULL ' + self.LEPTONICA_TYPENAME)
ValueError: Tried to wrap a NULL PIX
...

A puzzling PDF. I'm not sure how to deal with these errors, so for now I'm settling for the first PDF/half being properly processed, leaving the second half alone, and just concatenating them into a single PDF.

jbarlow83 pushed a commit that referenced this issue Apr 15, 2018

hocr: avoid division by zero

2482296

Issue #253 - PDF that produces the error is not available, but if font_width is zero, chances are the text is nonprinting characters, so suppress it.

jbarlow83 closed this as completed Apr 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeroDivisionError #253

ZeroDivisionError #253

jullit31 commented Apr 14, 2018

jbarlow83 commented Apr 14, 2018 via email

jullit31 commented Apr 19, 2018

gwern commented Mar 7, 2019 •

edited

ZeroDivisionError #253

ZeroDivisionError #253

Comments

jullit31 commented Apr 14, 2018

jbarlow83 commented Apr 14, 2018 via email

jullit31 commented Apr 19, 2018

gwern commented Mar 7, 2019 • edited

gwern commented Mar 7, 2019 •

edited