Line detection with different font sizes #46

zuphilip · 2015-06-15T13:19:13Z

The header line (title) of a document is often written in larger font as the normal text. I experienced that ocropus sometimes cuts a larger font size line into two lines (which are then recognized into nonsense). If the header font is not too much larger (twice seems okay), then the splitting up in lines is okay. But the problem occurs if the header font is 3 times the size of the normal font (36pt and 12pt). E.g. ocropus-gpageseg of 0002 bin

where the headline is split up into three lines:

i.e.

Can the parameters of ocropus-gpagesegavoid such a behaviour? Or line detection tweaked in general?

The text was updated successfully, but these errors were encountered:

tmbdev · 2015-06-15T17:57:55Z

ocropus-gpageseg assumes that text lines are roughly the same scale. In return, it can detect even touching text lines in noisy documents pretty well. But that's only one of many strategies and possible tradeoffs. Your documents look like they are quite clean but have large variations in font size.

The best way to do text line recognition reliably is probably to run multiple different line detectors and combine their outputs.

As a simple version of that, you could try to run ocropus-gpageseg at different scales, try to recognize all the candidate text lines from the different parameter settings, and throw away those that give gibberish either due to being merged or split up.

Obviously, that is not going to be cheap. But ultimately, the only arbiter of whether a text line has been correctly segmented is whether you can recognize it, so for general purpose text line segmentation, invoking a recognizer somewhere is necessary.

For Latin script, you can also try to classify individual connected components as text/non-text and then attempt to group those together.

I'm planning on releasing a 2D LSTM based segmenter at some point, but that will still take a while.

zuphilip · 2017-04-20T20:59:58Z

Actually, in my example above the layout segmentation is perfect with ocropus-gpageseg --vscale 2.

Fixes ocropus-archive#46

tmbdev closed this as completed Jun 15, 2015

amitdo mentioned this issue Apr 13, 2017

Gpageseg with different size is not correct #200

Open

amitdo mentioned this issue Apr 20, 2017

Improve textline finding for Arabic and other languages with many diacritics tesseract-ocr/tesseract#657

Open

kba pushed a commit to kba/ocropy that referenced this issue Dec 16, 2017

Add version switch to kraken/ketos

3345f71

Fixes ocropus-archive#46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Line detection with different font sizes #46

Line detection with different font sizes #46

zuphilip commented Jun 15, 2015

tmbdev commented Jun 15, 2015

zuphilip commented Apr 20, 2017

Line detection with different font sizes #46

Line detection with different font sizes #46

Comments

zuphilip commented Jun 15, 2015

tmbdev commented Jun 15, 2015

zuphilip commented Apr 20, 2017