Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line detection with different font sizes #46

Closed
zuphilip opened this issue Jun 15, 2015 · 2 comments
Closed

Line detection with different font sizes #46

zuphilip opened this issue Jun 15, 2015 · 2 comments

Comments

@zuphilip
Copy link
Collaborator

The header line (title) of a document is often written in larger font as the normal text. I experienced that ocropus sometimes cuts a larger font size line into two lines (which are then recognized into nonsense). If the header font is not too much larger (twice seems okay), then the splitting up in lines is okay. But the problem occurs if the header font is 3 times the size of the normal font (36pt and 12pt). E.g. ocropus-gpageseg of 0002 bin

where the headline is split up into three lines:


i.e.





Can the parameters of ocropus-gpagesegavoid such a behaviour? Or line detection tweaked in general?

@tmbdev
Copy link
Collaborator

tmbdev commented Jun 15, 2015

ocropus-gpageseg assumes that text lines are roughly the same scale. In return, it can detect even touching text lines in noisy documents pretty well. But that's only one of many strategies and possible tradeoffs. Your documents look like they are quite clean but have large variations in font size.

The best way to do text line recognition reliably is probably to run multiple different line detectors and combine their outputs.

As a simple version of that, you could try to run ocropus-gpageseg at different scales, try to recognize all the candidate text lines from the different parameter settings, and throw away those that give gibberish either due to being merged or split up.

Obviously, that is not going to be cheap. But ultimately, the only arbiter of whether a text line has been correctly segmented is whether you can recognize it, so for general purpose text line segmentation, invoking a recognizer somewhere is necessary.

For Latin script, you can also try to classify individual connected components as text/non-text and then attempt to group those together.

I'm planning on releasing a 2D LSTM based segmenter at some point, but that will still take a while.

@zuphilip
Copy link
Collaborator Author

Actually, in my example above the layout segmentation is perfect with ocropus-gpageseg --vscale 2.

kba pushed a commit to kba/ocropy that referenced this issue Dec 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants