Compression

PDF files full of images cannot be compressed as efficiently as DjVu, sometimes leading to files that are hundreds of megabytes large. Fortunately, books are often bitonal, which allows for efficient compression like group4 or jbig2. Unfortunately, in badly digitized books the scanned images may be saved as colorful JPEG files, which can partially be mitigated using --mode bitonal (possibly for only a range of pages).

We perform compression in two stages:

The first one is the default compression provided by Pillow. For bitonal images, the PDF generation code says that, if libtiff is available, group4 compression is used.
If OCRmyPDF is installed (possibly via the ocr or compress extras), its PDF optimization can be used via the flags -O1 to -O3 (this involves no OCR). This allows us to use advanced techniques, including JBIG2 compression via jbig2enc.

If manually running OCRmyPDF, note that the optimization command suggested in the documentation (setting --tesseract-timeout to 0) may ruin existing text layers. To perform only PDF optimization you can use the following undocumented tool instead:

python -m ocrmypdf.optimize <input_file> <level> <output_file>

Usage:

Man page

Implementation details:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally