-
-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRC #9
Comments
Comment by jbarlow83 pdfbeads (a ruby project) attempts to do that although it has issues with aligning the hidden OCR text layer with the image and some crash bugs, and the documentation is mainly in Russian. I've looked into making the changes for OCRmyPDF. It would be a major overhaul/rewrite and would call for a new PDF generation backend. |
Comment by b21e jbig2enc itself is quite stable, now it recognises also quite well the resolution of the images. There is also support for basic foreground background separation. There's a one page script in python for generation of multilayer pdf for an earlier version of jbig2enc. When this script was written the recognition of the resolution of the pdf still did not work reliably. In short for scans jbig2 is a must, but on linux this is still not available. |
Blocked as discussed in #48 |
As #48 is now solved, does it make sense to reopen? |
Jbig2enc is currently supported along with optimization (although it can be inconvenient to install since many distributions don't distribute it). Although we don't do color segmentation like jbig2's pdf.py. That's not a good solution for an application that accepts PDFs rather than images as input. By color segmentation, I mean examining every image to see if it can be separated into one dominant foreground color (usually black) and a grayscale or color image, and if that is a more efficient compression option than retaining the original. |
As far as I can see, jbig2's pdf.py does not do color segmentation itself. The only Open Source implementation that I know that does this is in fact didjvu using gamera, as mentioned in #9 (comment)
I would indeed see it as a good feature that a large, badly compressed or even uncompressed PDF (e.g. directly created by img2pdf, or by commercial scanners) is taken in, and converted to a highly optimized MRC PDF. |
jbig2 uses leptonica for color segmentation - some sort of API call that returns a foreground "black" image and a background color or grayscale image. Ocrmypdf has soft ABI-level bindings to Leptonica, which I currently want to replace... that might mean spinning off a new Leptonica for Python package with API bindings (although, that means I'd have to maintain another package that builds a binary wheel on every platform-architecture combination and depends on C libraries). There's some nasty business in didjvu uses GPL2 so we cannot use it. Gamera is not available packaged as Python wheels and needs to be manually built so it is not suitable. Thank you for the suggestions, though, as I had not heard of either.... (We actually do color segmentation in a special case - if pngquant is installed, high optimization is used, and pngquant is able to reduce the image to monochrome. In that we case, we notice a suboptimal 1-bit PNG image and convert it to jbig2. But the stars have to align perfectly. The case where does not help is say, black text on a yellowed background.) |
The segmentation didjvu does seems to be a bit more complicated:
I could imagine to have ocrmypdf call an external binary for steps 1-2 or steps 1-3 to avoid GPL problems, as it is done with unpaper. didjvu conveniently supports the "separate" option which would output a mask, covering 1-2. I have also analyzed a MRC pdf coming out of a cheap commercial multifunction printer (see below). For some reason they seem to just keep the bg as a jpeg image, and then have a fixed number of 31 single-color b/w images layered on top. Each single b/w image encodes one color, and have offsets.
|
Issue by b21e
Fri Sep 19 16:14:39 2014
Originally opened as fritz-hh/OCRmyPDF#88
Hi, especially for scans integration with jbig2enc for better compression of the textimage layer would make this software perfect.
The text was updated successfully, but these errors were encountered: