Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR problem: "cannot write mode P as JPEG" exception #151

Closed
AEgit opened this issue Apr 18, 2017 · 4 comments
Closed

OCR problem: "cannot write mode P as JPEG" exception #151

AEgit opened this issue Apr 18, 2017 · 4 comments

Comments

@AEgit
Copy link

AEgit commented Apr 18, 2017

Another PDF file, which proves a bit difficult to OCR:
https://app.box.com/s/ffraogy4ayco5gc87t8kj406ww3o731v

Using

ocrmypdf -l por --force myfile.pdf myfile_ocr.pdf

it was possible to ocr the respective file. However, many pages are very poorly ocred (most sentences are missing, and the ocred parts are completely wrong). The following exceptions are thrown:

Original exception:

    Exception #1
      'builtins.OSError(cannot write mode P as JPEG)' raised in ...
       Task = def ocrmypdf.pipeline.select_visible_page_image(...):
       Job  = [[.../000004.page.png, .../000004.pp-background.png, .../000004.pp-clean.png, .../000004.pp-deskew.png] -> .../000004.image, <LoggingProxy>, <ocrmypdf.pipeline.JobContext>]

    Traceback (most recent call last):
      File "/usr/local/lib/python3.4/dist-packages/PIL/JpegImagePlugin.py", line 599, in _save
        rawmode = RAWMODE[im.mode]
    KeyError: 'P'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pipeline.py", line 534, in select_visible_page_image
        im.save(output_file, format='JPEG', dpi=dpi)
      File "/usr/local/lib/python3.4/dist-packages/PIL/Image.py", line 1826, in save
        save_handler(self, fp, filename)
      File "/usr/local/lib/python3.4/dist-packages/PIL/JpegImagePlugin.py", line 601, in _save
        raise IOError("cannot write mode %s as JPEG" % im.mode)
    OSError: cannot write mode P as JPEG

I'm just wondering, whether the poor OCR quality for that file is just related to the image quality of the document itself or whether it is related to the above exceptions?

I'm using the current ocrmypdf version 4.5.3.

@jbarlow83
Copy link
Collaborator

I couldn't reproduce it but added a likely fix anyway for 4.5.4. The error came up because page 4 is blank (possibly due to file corruption) and the logic for a blank PDF given `--force`` was incomplete.

To improve the OCR I suggest trying Tesseract 4 (alpha version) and consulting the documentation on recommended arguments with ocrmypdf for using Tess4 (--pdf-renderer tess4). If Tess 3 has not been trained with a font it performs poorly; perhaps it's not trained with that. It may also be that this is a technical paper that uses words out of the typical Portuguese dictionary it has and so it "corrects" ambiguous words to the wrong ones. The documentation also has instructions for disabling the Tesseract dictionary.

@AEgit
Copy link
Author

AEgit commented Apr 19, 2017

Thanks for the quick reply. I can confirm that the error messages no longer appear with version 4.5.4.

As you said, the fix didn't change the OCR results, so I will have to play around a bit with Tesseract 4 and see, whether it gives better results.

Thanks again for your help!

@jbarlow83
Copy link
Collaborator

You noticed a change in file size. Because I regularly run ocrmypdf on batches of >10k files, I watch any such reports closely.

With --force-ocr ocrmypdf must rasterize every page and save the rasterized output to a new file. It so happens that the input file is optimized in a way that is lost when the whole page is rasterized, so the output images are 55% larger by pixel count to preserve the original resolution. The average compression ratio is the better in the output file, but not enough to compensate for so many more pixels.

This file looks like it would work without --force-ocr.

@AEgit
Copy link
Author

AEgit commented Apr 19, 2017

Yes, sorry about that - I initially reported the increase in file size, but then realised that the old file had been ocred without the --force attribute. Indeed, when rerunning the OCR process without --force I got a similar file size. That's why I decided to edit my comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants