not sharp #160

Golddouble · 2017-02-22T08:30:35Z

Hello,

This is my input- File:
Beispiel.PDF

According to http://pdf-analyser.edpsciences.org/ it has a 90dpi resolution.

1)
For the OCR I made the following settings:

I name the output file "output 90dpi, b-w:
Output 90dpi, b-w.pdf

2)
For the OCR I made the following settings:

I name the output file "output 90dpi, colour, loseless:
Output 90dpi, colour, loseless.pdf

So my question is:
Why is neither "Output 90dpi, b-w.pdf" nor "Output 90dpi, colour, loseless.pdf" as sharp as the input file "Beispiel.pdf" ?

Thank you.

manisandro · 2017-02-22T08:59:30Z

Uhm, I recognized with dpi 90 and saved with dpi 90, and the output (ZIP, grayscale) looks good to me:
Beispiel-90-gray.pdf

Golddouble · 2017-02-22T09:17:02Z

I made a further try also with dpi 90, ZIP and grey. (Windows 7)
It is again not sharp:
Output grey, zip.PDF

It's really strange.

Edit: But my grey- version is 3 time as large as your version (KB) (?)

manisandro · 2017-02-22T09:27:45Z

Can you make a screencast of the steps you perform to produce the PDF?

Golddouble · 2017-02-22T09:42:05Z

screencast deleted

manisandro · 2017-02-22T09:43:39Z

In the advanced image controls (button left of the OCR mode button), select 90 as dpi.

Golddouble · 2017-02-22T09:53:29Z

Ah OK. I have never payed attention to this setting. Thank you, now it works. My setting there was on 300dpi.

Question what is the differece between the DPI in the export -dialog window and the dpi in the image controls. Must the DPI in the image controls always be the same as the DPI in the input file are?

manisandro · 2017-02-22T10:00:57Z

The DPI setting in the advanced image controls toolbar is the DPI at which the input image is sampled to produced the image on which recognition is performed.
The DPI setting in the export dialog is the DPI at which the recognition image is written to the output PDF.

The purpose of the first DPI control is to be able to artificially upscale the image to improve recognition results. Default is 300 which is recommended for good OCR results. Depending on the ratio between original image dpi and sampling dpi, interpolation may produce a blurry sampled image. In theory if you choose an integer multiple of the original dpi (say 180 or 270) you should get a smoother image.

Golddouble · 2017-02-22T12:25:40Z

In theory if you choose an integer multiple of the original dpi (say 180 or 270) you should get a smoother image.

Now, I tried exactly 180. But in this case this seems not to give good resuts (?)
Output 1bit.pdf

manisandro · 2017-02-22T13:57:24Z

Mh yeah looks like the downsampling in the end adds too much blur for the conversion to monochrome to look decent. If you choose grayscale, it looks decentish.
I think it should be possible to tweak the code to directly convert/resample the original input image instead of using the already resampled OCR image.

Golddouble · 2017-02-22T14:12:10Z

I think it should be possible to tweak the code to directly convert/resample the original input image instead of using the already resampled OCR image.

Thank you.

If you choose grayscale, it looks decentish.

-> This is not true.

I tried this:
-180dpi in the advanced image controls toolbar
-saved with 90dpi, grey and loseless
Result: Output, grey loseless 180,90.pdf

When you now compare "Output, grey loseless 180,90.pdf" with the
-input file "Beispiel.PDF" from my first post or with
-the output file from you (second post) "Beispiel-90-gray.pdf":
You can see: The Output file Output, grey loseless 180,90.pdf is much less sharp.

manisandro · 2017-02-22T14:23:11Z

I said decent-ish, not decent ;)

manisandro · 2017-02-23T21:57:34Z

Please try:

https://smani.fedorapeople.org/tmp/gImageReader_3.2.1_qt5_i686.exe
https://smani.fedorapeople.org/tmp/gImageReader_3.2.1_qt5_x86_64.exe

This version samples the input image directly instead of the recognition image. This means that as far as image quality in the output is concerned it does not matter what dpi you choose in the advanced image controls. If you choose the same dpi in the output as the original source, you should get pretty much the same image again.

I'd appreciate if you could test various combinations (PDF with and without invisible text overlay) and various resolutions to check whether it behaves as you expect. Thanks!

Golddouble · 2017-02-23T22:10:19Z

Wow, thank you for making this test version.
I will test it tomorrow.

Golddouble · 2017-02-24T19:28:23Z

I like this advanced image controls. Yes, of course the text recognition is indeed sometimes much better with a higher dpi here.

Up to now during the testing the following issues attracted my attention:

1.

A) Inputfile 90dpi. Advanced image setting 90dpi. Saved with 90dpi and as OCR (no picture):
The font was chosen so that it fits good into the page.
B) Inputflie 90dpi. Advanced image setting 360dpi. Save with 90dpi and as OCR (no picture):

My expectation was, that now the size of the text in B would be the same as in A.
-But it is much smaller. Is this behaviour as wished?
-Second, each text-line in A) has exactly the same size. In B) some text-lines have a different size. For example:

Inputfile: Beispiel.pdf
Output A): Output A.pdf
Output B): Output B.pdf

The settings for both A) and B):
https://guides.github.com/features/mastering-markdown/

2.

Inputfile 300dpi (Tif). Advanced image setting 300dpi. Start text recognition. Worked as expected.
Inputfile 300dpi (Tif). Advanced image setting 600dpi. Start text recognition. -> does not start. Maybe it is to much of calculation for the programme?
Inputfile: Input.zip

3.

Inputfile 300dpi (Tif). Width: 2835pix; height: 2209pix. Advanced image setting 300dpi. Start text recognition. Saving with this settings:

My expectation was, that "width" and "height" do not change. But they did:
Outputfile: With: 8505pix; height: 6627pix

Inputfile: Input.zip
Output 3: Output 3.pdf

manisandro · 2017-02-28T23:14:51Z

The font size detection is done by tesseract. The only aspect I might be able to influence is the font-size dependence on the resolution of the OCRed image. But I'd need to check the tesseract internals.
Takes a while (30sec) to start but eventually finishes pretty quickly, albeit using up to 4GB of RAM
True, but I observe the same behavior without this change. I'll need to check.

…tead of rescaling image used for recognition (#160)

manisandro · 2017-03-23T20:49:39Z

Addendum:

For very low recognition dpis, tesseract has difficulties estimating the font size. But for higher dpis, it is pretty consistent.
This is harder. For non-PDF sources, gImageReader currently ignores the DPI of the file and just picks 100 as a basically random number (actually, the idea behind it is that 100 means 100%, so if you enter 150 it means that the image will be scaled to 150% the original size). Now it is true that many image formats (TIFF, JPG, PNG, BMP to list a few) support specifying the dpi in the metadata, but few images actually have the physically correct dpi in the metadata, but rather just the screen dpi, say 72 or 96. I'm inclined to keep the current behaviour, since it means that you are actually recognizing the input-image as-is, whereas if gImageReader tried to interpret the dpi in the metadata, chances are that you'll end up upscaling the image a factor four or more (image metadata says dpi=72, gImageReader defaults to 300dpi => input image is upscaled 4.16x), which might cause excessive ram usage to blow up because your 2100 × 2970 px (24MB in RAM) image with a bogous dpi of 72, at dpi=300 is rendered at 8736x12355 (411 MB in RAM). I'll add this explanation to the FAQ.

Closing ticket since original issue of blurry issues has been addressed as far as possible in da93d34.

manisandro added a commit that referenced this issue Mar 23, 2017

Re-render source image at output resolution when exporting to PDF ins…

da93d34

…tead of rescaling image used for recognition (#160)

manisandro closed this as completed Mar 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

not sharp #160

not sharp #160

Golddouble commented Feb 22, 2017

manisandro commented Feb 22, 2017

Golddouble commented Feb 22, 2017 •

edited

manisandro commented Feb 22, 2017

Golddouble commented Feb 22, 2017 •

edited

manisandro commented Feb 22, 2017

Golddouble commented Feb 22, 2017

manisandro commented Feb 22, 2017

Golddouble commented Feb 22, 2017

manisandro commented Feb 22, 2017

Golddouble commented Feb 22, 2017

manisandro commented Feb 22, 2017

manisandro commented Feb 23, 2017

Golddouble commented Feb 23, 2017

Golddouble commented Feb 24, 2017

manisandro commented Feb 28, 2017

manisandro commented Mar 23, 2017

not sharp #160

not sharp #160

Comments

Golddouble commented Feb 22, 2017

manisandro commented Feb 22, 2017

Golddouble commented Feb 22, 2017 • edited

manisandro commented Feb 22, 2017

Golddouble commented Feb 22, 2017 • edited

manisandro commented Feb 22, 2017

Golddouble commented Feb 22, 2017

manisandro commented Feb 22, 2017

Golddouble commented Feb 22, 2017

manisandro commented Feb 22, 2017

Golddouble commented Feb 22, 2017

manisandro commented Feb 22, 2017

manisandro commented Feb 23, 2017

Golddouble commented Feb 23, 2017

Golddouble commented Feb 24, 2017

1.

2.

3.

manisandro commented Feb 28, 2017

manisandro commented Mar 23, 2017

Golddouble commented Feb 22, 2017 •

edited

Golddouble commented Feb 22, 2017 •

edited