-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Line detection with different font sizes #46
Comments
ocropus-gpageseg assumes that text lines are roughly the same scale. In return, it can detect even touching text lines in noisy documents pretty well. But that's only one of many strategies and possible tradeoffs. Your documents look like they are quite clean but have large variations in font size. The best way to do text line recognition reliably is probably to run multiple different line detectors and combine their outputs. As a simple version of that, you could try to run ocropus-gpageseg at different scales, try to recognize all the candidate text lines from the different parameter settings, and throw away those that give gibberish either due to being merged or split up. Obviously, that is not going to be cheap. But ultimately, the only arbiter of whether a text line has been correctly segmented is whether you can recognize it, so for general purpose text line segmentation, invoking a recognizer somewhere is necessary. For Latin script, you can also try to classify individual connected components as text/non-text and then attempt to group those together. I'm planning on releasing a 2D LSTM based segmenter at some point, but that will still take a while. |
Actually, in my example above the layout segmentation is perfect with |
The header line (title) of a document is often written in larger font as the normal text. I experienced that ocropus sometimes cuts a larger font size line into two lines (which are then recognized into nonsense). If the header font is not too much larger (twice seems okay), then the splitting up in lines is okay. But the problem occurs if the header font is 3 times the size of the normal font (36pt and 12pt). E.g.
ocropus-gpageseg
of 0002 binwhere the headline is split up into three lines:
i.e.
Can the parameters of
ocropus-gpageseg
avoid such a behaviour? Or line detection tweaked in general?The text was updated successfully, but these errors were encountered: