-
Notifications
You must be signed in to change notification settings - Fork 21
Compression
PDF files full of images cannot be compressed as efficiently as DjVu, sometimes leading to files that are hundreds of megabytes large. Fortunately, books are often bitonal, which allows for efficient compression like group4 or jbig2. Unfortunately, in badly digitized books the scanned images may be saved as colorful JPEG files, which can partially be mitigated using --mode bitonal (possibly for only a range of pages).
We perform compression in two stages:
-
The first one is the default compression provided by Pillow. For bitonal images, the PDF generation code says that, if
libtiffis available,group4compression is used. -
If OCRmyPDF is installed (possibly via the
ocrorcompressextras), its PDF optimization can be used via the flags-O1to-O3(this involves no OCR). This allows us to use advanced techniques, including JBIG2 compression viajbig2enc.
If manually running OCRmyPDF, note that the optimization command suggested in the documentation (setting --tesseract-timeout to 0) may ruin existing text layers. To perform only PDF optimization you can use the following undocumented tool instead:
python -m ocrmypdf.optimize <input_file> <level> <output_file>