Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRC #9

Closed
OCRmyPDF-issuebot opened this issue Sep 14, 2015 · 9 comments
Closed

MRC #9

OCRmyPDF-issuebot opened this issue Sep 14, 2015 · 9 comments

Comments

@OCRmyPDF-issuebot
Copy link

Issue by b21e
Fri Sep 19 16:14:39 2014
Originally opened as fritz-hh/OCRmyPDF#88


Hi, especially for scans integration with jbig2enc for better compression of the textimage layer would make this software perfect.

@OCRmyPDF-issuebot
Copy link
Author

Comment by jbarlow83
Sat Sep 20 06:33:15 2014


pdfbeads (a ruby project) attempts to do that although it has issues with aligning the hidden OCR text layer with the image and some crash bugs, and the documentation is mainly in Russian.

I've looked into making the changes for OCRmyPDF. It would be a major overhaul/rewrite and would call for a new PDF generation backend.

@OCRmyPDF-issuebot
Copy link
Author

Comment by b21e
Sat Sep 20 07:57:02 2014


jbig2enc itself is quite stable, now it recognises also quite well the resolution of the images. There is also support for basic foreground background separation. There's a one page script in python for generation of multilayer pdf for an earlier version of jbig2enc. When this script was written the recognition of the resolution of the pdf still did not work reliably. In short for scans jbig2 is a must, but on linux this is still not available.

@OCRmyPDF-issuebot
Copy link
Author

Comment by b21e
Sat Sep 20 08:38:25 2014


If one is willing to use more than one graphics library leptonica written in c for jbig2enc and for text foreground and background separation gamera written in python for didjvu all the ingredients are already there and well tested.

@jbarlow83
Copy link
Collaborator

Blocked as discussed in #48

@blaueente
Copy link

Blocked as discussed in #48

As #48 is now solved, does it make sense to reopen?
Is anyone interested in implementing or helping to implement? @jbarlow83 : Do you think this is feasible, and would you accept patches for such a functionality?

@jbarlow83
Copy link
Collaborator

Jbig2enc is currently supported along with optimization (although it can be inconvenient to install since many distributions don't distribute it).

Although we don't do color segmentation like jbig2's pdf.py. That's not a good solution for an application that accepts PDFs rather than images as input.

By color segmentation, I mean examining every image to see if it can be separated into one dominant foreground color (usually black) and a grayscale or color image, and if that is a more efficient compression option than retaining the original.

@blaueente
Copy link

blaueente commented Jun 16, 2021

As far as I can see, jbig2's pdf.py does not do color segmentation itself. The only Open Source implementation that I know that does this is in fact didjvu using gamera, as mentioned in #9 (comment)
So, I guess the quickest or most feasible way to do this would be adapt code from didjvu and include gamera. Not sure if this is a good option, as gamera would be another dependency.
An alternative could be to make a more ore less independent "helper" program that just takes an image, does the separation, and returns 3 (unoptimized) images, leaving the pdf stuff to ocrmypdf?

Although we don't do color segmentation like jbig2's pdf.py. That's not a good solution for an application that accepts PDFs rather than images as input.

I would indeed see it as a good feature that a large, badly compressed or even uncompressed PDF (e.g. directly created by img2pdf, or by commercial scanners) is taken in, and converted to a highly optimized MRC PDF.
The advantage compared to a completely standalone MRC pdf creator would be that all the nice processing options of ocrmypdf including deskew and OCR, could be done before converting to a lower quality MRC.
Or is there a better way to do this without bloating ocrmypdf or making the process too cumbersome for the user?

@jbarlow83
Copy link
Collaborator

jbig2 uses leptonica for color segmentation - some sort of API call that returns a foreground "black" image and a background color or grayscale image. Ocrmypdf has soft ABI-level bindings to Leptonica, which I currently want to replace... that might mean spinning off a new Leptonica for Python package with API bindings (although, that means I'd have to maintain another package that builds a binary wheel on every platform-architecture combination and depends on C libraries). There's some nasty business in leptonica.py that involves redirecting stderr on the fly if you want to see what I want to eliminate.... In short it's very tempting to move to scikit-image or opencv since they are well maintained libraries with good packaging, even though not necessarily focused on document imaging.

didjvu uses GPL2 so we cannot use it. Gamera is not available packaged as Python wheels and needs to be manually built so it is not suitable. Thank you for the suggestions, though, as I had not heard of either....

(We actually do color segmentation in a special case - if pngquant is installed, high optimization is used, and pngquant is able to reduce the image to monochrome. In that we case, we notice a suboptimal 1-bit PNG image and convert it to jbig2. But the stars have to align perfectly. The case where does not help is say, black text on a yellowed background.)

@blaueente
Copy link

blaueente commented Jun 17, 2021

The segmentation didjvu does seems to be a bit more complicated:

  1. Some kind of local thresholding with multiple different algorithms, as can be seen on http://manpages.ubuntu.com/manpages/bionic/man1/didjvu.1.html (argument "-m" ) These are all purely implemented in gamera, didjvu only does organizational stuff.
    This results in a binary mask.
  2. Then didjvu itself does some kind of morphological operations with that mask and uses the result to cut out background and foreground image out of the one image, both reduced in resolution
  3. Then didjvu compresses the mask with JBIG2, and the fg/bg by calling the iw44 wavelet codec. Note that the latter seems to support a mask, and optimized such that the masked out regions would decode to whatever makes the file size minimal.
    For PDF, one could use jpeg2000 that, but I have not found an obvious option to use the mask in jpeg2000 encoders. Might be "ROI" encoding, but I didn't find a free software encoder that supports this. This would mean that the mask data is somehow encoded in the jpeg2000 file, unnecessarily increasing the filesize somewhat :( But this might turn out to be only a minimal problem.
  4. Final step is assembly, which should be, in the case of PDF, more or less trivial with the 3 images fg/mask/bg :
    use an SMask / ImageMask for the foreground and overlay that onto the background. I did some experiments, this resulted in a correct, but slow rendering, unlike djvu which is very fast.

I could imagine to have ocrmypdf call an external binary for steps 1-2 or steps 1-3 to avoid GPL problems, as it is done with unpaper. didjvu conveniently supports the "separate" option which would output a mask, covering 1-2.

I have also analyzed a MRC pdf coming out of a cheap commercial multifunction printer (see below). For some reason they seem to just keep the bg as a jpeg image, and then have a fixed number of 31 single-color b/w images layered on top. Each single b/w image encodes one color, and have offsets.
It also does have ugly problems with some letters or even text regions only encoded in the bg image, and text on colored background often in the background as a whole.
The didjvu results I have seen are much better, allowing for colored text and line graphics, except for problems with non-binary images where the separation messed it up a bit, although it never got really unreadable.

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1208  1716  rgb     3   8  jpeg   no        11  0   151   150 11.1K 0.2%
   1     1 stencil   496    30  -       1   1  ccitt  no        12  0   300   300  317B  17%
   1     2 stencil    48    28  -       1   1  ccitt  no        13  0   300   300   16B 9.5%
   1     3 stencil   928   844  -       1   1  ccitt  no        24  0   300   300  192B 0.2%
   1     4 stencil     8     6  -       1   1  ccitt  no        35  0   300   300    7B 117%
   1     5 stencil  1344  1992  -       1   1  ccitt  no        37  0   300   300 2179B 0.7%
   1     6 stencil  1408  1272  -       1   1  ccitt  no        38  0   301   300  721B 0.3%
   1     7 stencil  1536   810  -       1   1  ccitt  no        39  0   300   300  128B 0.1%
   1     8 stencil    24     8  -       1   1  ccitt  no        40  0   300   300   11B  46%
   1     9 stencil  1216   918  -       1   1  ccitt  no        41  0   300   300 1942B 1.4%
   1    10 stencil  1784  1276  -       1   1  ccitt  no        42  0   300   300 4385B 1.5%
   1    11 stencil     8     6  -       1   1  ccitt  no        14  0   300   300    7B 117%
   1    12 stencil   440   570  -       1   1  ccitt  no        15  0   300   300   85B 0.3%
   1    13 stencil     8     6  -       1   1  ccitt  no        16  0   300   300    8B 133%
   1    14 stencil  1024   494  -       1   1  ccitt  no        17  0   300   300   80B 0.1%
   1    15 stencil     8     6  -       1   1  ccitt  no        18  0   300   300    8B 133%
   1    16 stencil  1776   556  -       1   1  ccitt  no        19  0   300   300 2215B 1.8%
   1    17 stencil     8     6  -       1   1  ccitt  no        20  0   300   300    7B 117%
   1    18 stencil     8     6  -       1   1  ccitt  no        21  0   300   300    8B 133%
   1    19 stencil     8     8  -       1   1  ccitt  no        22  0   300   300    9B 112%
   1    20 stencil  1320   198  -       1   1  ccitt  no        23  0   300   300 2356B 7.2%
   1    21 stencil  1344    12  -       1   1  ccitt  no        25  0   300   300   14B 0.7%
   1    22 stencil   176     6  -       1   1  ccitt  no        26  0   300   300   34B  26%
   1    23 stencil     8     6  -       1   1  ccitt  no        27  0   300   300    8B 133%
   1    24 stencil     8     8  -       1   1  ccitt  no        28  0   300   300    9B 112%
   1    25 stencil    24    30  -       1   1  ccitt  no        29  0   300   300   33B  37%
   1    26 stencil  1752    48  -       1   1  ccitt  no        30  0   300   300 1852B  18%
   1    27 stencil     8     6  -       1   1  ccitt  no        31  0   300   300    8B 133%
   1    28 stencil     8     6  -       1   1  ccitt  no        32  0   300   300    8B 133%
   1    29 stencil    24    38  -       1   1  ccitt  no        33  0   300   300   37B  32%
   1    30 stencil   128    38  -       1   1  ccitt  no        34  0   300   300   32B 5.3%
   1    31 stencil     8     8  -       1   1  ccitt  no        36  0   300   300    9B 112%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants