Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not sharp #160

Closed
Golddouble opened this issue Feb 22, 2017 · 16 comments
Closed

not sharp #160

Golddouble opened this issue Feb 22, 2017 · 16 comments

Comments

@Golddouble
Copy link

Hello,

This is my input- File:
Beispiel.PDF

According to http://pdf-analyser.edpsciences.org/ it has a 90dpi resolution.

1)
For the OCR I made the following settings:
grafik

I name the output file "output 90dpi, b-w:
Output 90dpi, b-w.pdf

2)
For the OCR I made the following settings:
grafik

I name the output file "output 90dpi, colour, loseless:
Output 90dpi, colour, loseless.pdf

So my question is:
Why is neither "Output 90dpi, b-w.pdf" nor "Output 90dpi, colour, loseless.pdf" as sharp as the input file "Beispiel.pdf" ?

Thank you.

@manisandro
Copy link
Owner

Uhm, I recognized with dpi 90 and saved with dpi 90, and the output (ZIP, grayscale) looks good to me:
Beispiel-90-gray.pdf

@Golddouble
Copy link
Author

Golddouble commented Feb 22, 2017

I made a further try also with dpi 90, ZIP and grey. (Windows 7)
It is again not sharp:
Output grey, zip.PDF

It's really strange.

Edit: But my grey- version is 3 time as large as your version (KB) (?)

@manisandro
Copy link
Owner

Can you make a screencast of the steps you perform to produce the PDF?

@Golddouble
Copy link
Author

Golddouble commented Feb 22, 2017

screencast deleted

@manisandro
Copy link
Owner

In the advanced image controls (button left of the OCR mode button), select 90 as dpi.

@Golddouble
Copy link
Author

Ah OK. I have never payed attention to this setting. Thank you, now it works. My setting there was on 300dpi.

Question what is the differece between the DPI in the export -dialog window and the dpi in the image controls. Must the DPI in the image controls always be the same as the DPI in the input file are?

@manisandro
Copy link
Owner

The DPI setting in the advanced image controls toolbar is the DPI at which the input image is sampled to produced the image on which recognition is performed.
The DPI setting in the export dialog is the DPI at which the recognition image is written to the output PDF.

The purpose of the first DPI control is to be able to artificially upscale the image to improve recognition results. Default is 300 which is recommended for good OCR results. Depending on the ratio between original image dpi and sampling dpi, interpolation may produce a blurry sampled image. In theory if you choose an integer multiple of the original dpi (say 180 or 270) you should get a smoother image.

@Golddouble
Copy link
Author

In theory if you choose an integer multiple of the original dpi (say 180 or 270) you should get a smoother image.

Now, I tried exactly 180. But in this case this seems not to give good resuts (?)
Output 1bit.pdf

@manisandro
Copy link
Owner

Mh yeah looks like the downsampling in the end adds too much blur for the conversion to monochrome to look decent. If you choose grayscale, it looks decentish.
I think it should be possible to tweak the code to directly convert/resample the original input image instead of using the already resampled OCR image.

@Golddouble
Copy link
Author

I think it should be possible to tweak the code to directly convert/resample the original input image instead of using the already resampled OCR image.

Thank you.

If you choose grayscale, it looks decentish.

-> This is not true.

I tried this:
-180dpi in the advanced image controls toolbar
-saved with 90dpi, grey and loseless
Result: Output, grey loseless 180,90.pdf

When you now compare "Output, grey loseless 180,90.pdf" with the
-input file "Beispiel.PDF" from my first post or with
-the output file from you (second post) "Beispiel-90-gray.pdf":
You can see: The Output file Output, grey loseless 180,90.pdf is much less sharp.

@manisandro
Copy link
Owner

I said decent-ish, not decent ;)

@manisandro
Copy link
Owner

Please try:

https://smani.fedorapeople.org/tmp/gImageReader_3.2.1_qt5_i686.exe
https://smani.fedorapeople.org/tmp/gImageReader_3.2.1_qt5_x86_64.exe

This version samples the input image directly instead of the recognition image. This means that as far as image quality in the output is concerned it does not matter what dpi you choose in the advanced image controls. If you choose the same dpi in the output as the original source, you should get pretty much the same image again.

I'd appreciate if you could test various combinations (PDF with and without invisible text overlay) and various resolutions to check whether it behaves as you expect. Thanks!

@Golddouble
Copy link
Author

Wow, thank you for making this test version.
I will test it tomorrow.

@Golddouble
Copy link
Author

I like this advanced image controls. Yes, of course the text recognition is indeed sometimes much better with a higher dpi here.

Up to now during the testing the following issues attracted my attention:

1.

A) Inputfile 90dpi. Advanced image setting 90dpi. Saved with 90dpi and as OCR (no picture):
The font was chosen so that it fits good into the page.
B) Inputflie 90dpi. Advanced image setting 360dpi. Save with 90dpi and as OCR (no picture):

My expectation was, that now the size of the text in B would be the same as in A.
-But it is much smaller. Is this behaviour as wished?
-Second, each text-line in A) has exactly the same size. In B) some text-lines have a different size. For example:
grafik

Inputfile: Beispiel.pdf
Output A): Output A.pdf
Output B): Output B.pdf

The settings for both A) and B):
https://guides.github.com/features/mastering-markdown/

grafik

2.

Inputfile 300dpi (Tif). Advanced image setting 300dpi. Start text recognition. Worked as expected.
Inputfile 300dpi (Tif). Advanced image setting 600dpi. Start text recognition. -> does not start. Maybe it is to much of calculation for the programme?
Inputfile: Input.zip

3.

Inputfile 300dpi (Tif). Width: 2835pix; height: 2209pix. Advanced image setting 300dpi. Start text recognition. Saving with this settings:
grafik

My expectation was, that "width" and "height" do not change. But they did:
Outputfile: With: 8505pix; height: 6627pix

Inputfile: Input.zip
Output 3: Output 3.pdf

@manisandro
Copy link
Owner

  1. The font size detection is done by tesseract. The only aspect I might be able to influence is the font-size dependence on the resolution of the OCRed image. But I'd need to check the tesseract internals.

  2. Takes a while (30sec) to start but eventually finishes pretty quickly, albeit using up to 4GB of RAM

  3. True, but I observe the same behavior without this change. I'll need to check.

manisandro added a commit that referenced this issue Mar 23, 2017
…tead of rescaling image used for recognition (#160)
@manisandro
Copy link
Owner

Addendum:

  1. For very low recognition dpis, tesseract has difficulties estimating the font size. But for higher dpis, it is pretty consistent.
  2. This is harder. For non-PDF sources, gImageReader currently ignores the DPI of the file and just picks 100 as a basically random number (actually, the idea behind it is that 100 means 100%, so if you enter 150 it means that the image will be scaled to 150% the original size). Now it is true that many image formats (TIFF, JPG, PNG, BMP to list a few) support specifying the dpi in the metadata, but few images actually have the physically correct dpi in the metadata, but rather just the screen dpi, say 72 or 96. I'm inclined to keep the current behaviour, since it means that you are actually recognizing the input-image as-is, whereas if gImageReader tried to interpret the dpi in the metadata, chances are that you'll end up upscaling the image a factor four or more (image metadata says dpi=72, gImageReader defaults to 300dpi => input image is upscaled 4.16x), which might cause excessive ram usage to blow up because your 2100 × 2970 px (24MB in RAM) image with a bogous dpi of 72, at dpi=300 is rendered at 8736x12355 (411 MB in RAM). I'll add this explanation to the FAQ.

Closing ticket since original issue of blurry issues has been addressed as far as possible in da93d34.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants