Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan #236

Merged
merged 6 commits into from Feb 19, 2018

Conversation

lehzwo
Copy link
Contributor

@lehzwo lehzwo commented Jul 21, 2017

Intention

As stated in #171 @jze had the idea of applying a mask to an image if the column separator detection did not work as expected.
This behavior could be interesting for #208 as well as #89.

Current behavior of the mask

A white on black mask can be applied to an image file. The white areas of the mask are interpreted as column separators and are merged with the column separators that have been identified by OCRopus itself.
The merged separators are then treated as before. If no mask is found for an image, ocropus-gpagesegs behavior is not changed.

Naming convention

The file which specifies the mask is saved as the basename of the file it shall be applied on, extended by ".mask.png".

e.g. the basename of the file is ".../book/0003"
in this case the mask is stored as ".../book/0003.mask.png"

Further usages

The current behavior of column separators identified by OCRopus is, that they delete the pixels beneath them. The column separators specified by the mask currently behave the same way. Due to this the mask can also be used to ignore areas of pixels of a scan. This might screw up the segmentation, though!

This way images as in the example of #38 or other areas could be ignored.

Potential improvements/ changes

The usage of the mask to ignore areas might lead to segmentation problems if not used with care. Due to this it might be desirable to be able to apply different kind of masks like. A mask that is extended by ".mask.sep.png" for column separators and another mask ".mask.ia.png" that ignores areas of the scan.
This would then lead to multiple mask files per image. Another way to achieve this behavior might be to work with different colors. In this case only one mask file is used. This would work for the actions "ignore area" and "apply separator" but is not as extendible since each pixel can only have one color.

Another potential improvement would be to introduce a new parameter to ocropus-gpageseg where a "master-mask" (/master masks, if multiple masks are allowed) can be passed that is applied to every image, processed by ocropus-gpageseg. This way unwanted recurring patterns can be easily ignored on each scan without renaming/ copying an existing mask.

Examples

Mask for first image of #208 as well as lineseeds with and without mask. Mask is used to identify correct separator and prevent identification of wrong separators in the text of the scan.
Params:
ocropus-nlbin -b 0.2
with mask:
ocropus-gpageseg --debug -n --minscale 11

without mask:
ocropus-gpageseg -b --maxseps 5 --debug -n --minscale 11

Disabled colum separator identification by OCRopus when the mask is used since we only want to apply the separators specified by the mask in this case.

0001 mask
lineseeds_applied_mask
lineseeds_without_mask

Mask for image of #89 as well as lineseeds with and without mask. Mask is used to prevent "holes" in the separators (would have been enough to mark holes in the mask instead of complete separators since the specified areas are merged with the separators found by OCRopus).
Params:
ocropus-nlbin -b 0.2
ocropus-gpageseg --maxcolseps 0 -b --maxseps 10 --sepwiden 50 -d -n

0001 mask
lineseeds_applied_mask
lineseeds_without_mask

Mask for image of #38 as well as lineseeds with and without mask. Mask is used to ignore image on scan.
Params:
ocropus-nlbin -b 0.2
ocropus-gpageseg --maxseps 1000 --maxcolseps 1000 -n -d --maxlines 10000

0001 mask
lineseeds_applied_mask
lineseeds_without_mask

Lenno and others added 3 commits January 6, 2017 17:26
- reads mask from base-path with ".mask.png" extension as binary
- white areas are treated like column separators (can be used to prevent further processing of the area)
@kba
Copy link
Collaborator

kba commented Sep 29, 2017

This is both simple and useful, esp. for images with static layout. Is backwards-compatible too. We should add tests/examples and document it.

@kba kba requested a review from zuphilip September 29, 2017 14:17
@zuphilip
Copy link
Collaborator

zuphilip commented Sep 29, 2017

Yes, this looks really interesting! I haven't had the time to look at it. Really sorry for that :-( I will try to do soon and feel free to ping me again...

@zuphilip
Copy link
Collaborator

tldr: Thank you very much @lehzwo for this nice PR! ✨ This looks very useful and is just a small addition to the code as @kba already pointed out. We are IMO ready to merge this now. @kba Can you do the final review and merge it then?


I updated the PR and add a test case. However, I think that the smoke test here may not detect many errors. Currently, the output of the test results looks like:

INFO:
INFO:  ########## ./ocropus-gpageseg temp/table.bin.png --debug -n --minscale
INFO:
INFO:  temp/table.bin.png
INFO:  scale 7.416198
INFO:  computing segmentation
INFO:  computing column separators
INFO:  considering at most 0 whitespace column separators
INFO:  debug _1thresh.png
INFO:  debug _2grad.png
INFO:  debug _3seps.png
INFO:  debug _4seps.png
INFO:  debug _colwsseps.png
INFO:  debug _masked_seps.png
INFO:  computing lines
INFO:  debug _cleaned.png
INFO:  debug _lineseeds.png
INFO:  debug _seeds.png
INFO:  propagating labels
INFO:  spreading labels
ERROR:  temp/table.bin.png: too many lines 613

I didn't adjust the maximum number of lines, because the ordering of the lines takes quite some time. Thus, it results with this ERROR, but still a valid test case for us.

The current behavior of column separators identified by OCRopus is, that they delete the pixels beneath them. The column separators specified by the mask currently behave the same way. Due to this the mask can also be used to ignore areas of pixels of a scan. This might screw up the segmentation, though!

Yes, this another nice application of such masks. I cannot think of any negative effect for the segmentation. The "smearing" step will not go into the white areas of the mask, but everywhere else where it would go in normal run as well. This looks like exactly the behavior we want. @lehzwo Did you have anything specific in mind here? I rather would go further from here: with these "masks" it is also possible to run the recognition only in a limited area by just coloring everything else white. Thus, one can for example recognize only one column and ignoring the remaining parts of the page.

The only concern I have at the moment is, that this feature is now very hidden. We certainly should document it well, e.g. in the wiki https://github.com/tmbdev/ocropy/wiki/Page-Segmentation . But maybe it also makes sens to add an additional parameter for it. With an additional parameter we can then change the default name as you suggest, but more important IMO is that the feature appears in --help. But maybe we can do this later in another PR...

@kba kba merged commit e9b6121 into ocropus-archive:master Feb 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants