New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan #236
Conversation
- reads mask from base-path with ".mask.png" extension as binary - white areas are treated like column separators (can be used to prevent further processing of the area)
This is both simple and useful, esp. for images with static layout. Is backwards-compatible too. We should add tests/examples and document it. |
Yes, this looks really interesting! I haven't had the time to look at it. Really sorry for that :-( I will try to do soon and feel free to ping me again... |
tldr: Thank you very much @lehzwo for this nice PR! ✨ This looks very useful and is just a small addition to the code as @kba already pointed out. We are IMO ready to merge this now. @kba Can you do the final review and merge it then? I updated the PR and add a test case. However, I think that the smoke test here may not detect many errors. Currently, the output of the test results looks like:
I didn't adjust the maximum number of lines, because the ordering of the lines takes quite some time. Thus, it results with this ERROR, but still a valid test case for us.
Yes, this another nice application of such masks. I cannot think of any negative effect for the segmentation. The "smearing" step will not go into the white areas of the mask, but everywhere else where it would go in normal run as well. This looks like exactly the behavior we want. @lehzwo Did you have anything specific in mind here? I rather would go further from here: with these "masks" it is also possible to run the recognition only in a limited area by just coloring everything else white. Thus, one can for example recognize only one column and ignoring the remaining parts of the page. The only concern I have at the moment is, that this feature is now very hidden. We certainly should document it well, e.g. in the wiki https://github.com/tmbdev/ocropy/wiki/Page-Segmentation . But maybe it also makes sens to add an additional parameter for it. With an additional parameter we can then change the default name as you suggest, but more important IMO is that the feature appears in |
Intention
As stated in #171 @jze had the idea of applying a mask to an image if the column separator detection did not work as expected.
This behavior could be interesting for #208 as well as #89.
Current behavior of the mask
A white on black mask can be applied to an image file. The white areas of the mask are interpreted as column separators and are merged with the column separators that have been identified by OCRopus itself.
The merged separators are then treated as before. If no mask is found for an image, ocropus-gpagesegs behavior is not changed.
Naming convention
The file which specifies the mask is saved as the basename of the file it shall be applied on, extended by ".mask.png".
e.g. the basename of the file is ".../book/0003"
in this case the mask is stored as ".../book/0003.mask.png"
Further usages
The current behavior of column separators identified by OCRopus is, that they delete the pixels beneath them. The column separators specified by the mask currently behave the same way. Due to this the mask can also be used to ignore areas of pixels of a scan. This might screw up the segmentation, though!
This way images as in the example of #38 or other areas could be ignored.
Potential improvements/ changes
The usage of the mask to ignore areas might lead to segmentation problems if not used with care. Due to this it might be desirable to be able to apply different kind of masks like. A mask that is extended by ".mask.sep.png" for column separators and another mask ".mask.ia.png" that ignores areas of the scan.
This would then lead to multiple mask files per image. Another way to achieve this behavior might be to work with different colors. In this case only one mask file is used. This would work for the actions "ignore area" and "apply separator" but is not as extendible since each pixel can only have one color.
Another potential improvement would be to introduce a new parameter to ocropus-gpageseg where a "master-mask" (/master masks, if multiple masks are allowed) can be passed that is applied to every image, processed by ocropus-gpageseg. This way unwanted recurring patterns can be easily ignored on each scan without renaming/ copying an existing mask.
Examples
Mask for first image of #208 as well as lineseeds with and without mask. Mask is used to identify correct separator and prevent identification of wrong separators in the text of the scan.
Params:
ocropus-nlbin -b 0.2
with mask:
ocropus-gpageseg --debug -n --minscale 11
without mask:
ocropus-gpageseg -b --maxseps 5 --debug -n --minscale 11
Disabled colum separator identification by OCRopus when the mask is used since we only want to apply the separators specified by the mask in this case.
Mask for image of #89 as well as lineseeds with and without mask. Mask is used to prevent "holes" in the separators (would have been enough to mark holes in the mask instead of complete separators since the specified areas are merged with the separators found by OCRopus).
Params:
ocropus-nlbin -b 0.2
ocropus-gpageseg --maxcolseps 0 -b --maxseps 10 --sepwiden 50 -d -n
Mask for image of #38 as well as lineseeds with and without mask. Mask is used to ignore image on scan.
Params:
ocropus-nlbin -b 0.2
ocropus-gpageseg --maxseps 1000 --maxcolseps 1000 -n -d --maxlines 10000