New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive file size growth with --force-ocr #237
Comments
Partially fixed for v6 (duplicate object removal). Color coercion remains an issue. |
File size is also an issue with scanned pdfs. These tend to expand to 1 to 2 mb per page, when running ocrmypdf --force-ocr --output-type pdfa-1, so full-length books take a lot of disk space. (I tried running k2pdfopt -mode copy -dev dx afterwards, but that scrambled the ocr'd text. I also tried running a ghostscript conversion tool, but it blurred the image.) I wonder if an onboard option to rasterize the image to certain dimensions, in pixels, after ocr, would be practical. |
@MarjaE2 Possibly Please attach a sample page if you can. |
Here, or an excerpt using cpdf: https://archive.org/details/voliaukrainy3219unse It looks like an improvement for the excerpt, but I haven't tested it for the longer file. |
Same problem here, I think:
$ ocrmypdf --version |
--force-ocr
on a born digital PDF with a small color lossless compressed logo on each page, the presence of which causes the whole page to be rasterized and saved as color lossless at high DPI. File size increased 26x.Color segmentation for the whole page image would cover this case universally and would optimize other cases. Another option may be to separate inspect vector and raster content for color requirements, and rasterize separately.
This could also mean a born digital page with a monochrome image and color text might get forced to monochrome, if we're not testing for color in vector content.
--skip-ocr
also causes considerable inflation due to the presence of duplicate objectsThe text was updated successfully, but these errors were encountered: