ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan #236

lehzwo · 2017-07-21T12:19:13Z

Intention

As stated in #171 @jze had the idea of applying a mask to an image if the column separator detection did not work as expected.
This behavior could be interesting for #208 as well as #89.

Current behavior of the mask

A white on black mask can be applied to an image file. The white areas of the mask are interpreted as column separators and are merged with the column separators that have been identified by OCRopus itself.
The merged separators are then treated as before. If no mask is found for an image, ocropus-gpagesegs behavior is not changed.

Naming convention

The file which specifies the mask is saved as the basename of the file it shall be applied on, extended by ".mask.png".

e.g. the basename of the file is ".../book/0003"
in this case the mask is stored as ".../book/0003.mask.png"

Further usages

The current behavior of column separators identified by OCRopus is, that they delete the pixels beneath them. The column separators specified by the mask currently behave the same way. Due to this the mask can also be used to ignore areas of pixels of a scan. This might screw up the segmentation, though!

This way images as in the example of #38 or other areas could be ignored.

Potential improvements/ changes

The usage of the mask to ignore areas might lead to segmentation problems if not used with care. Due to this it might be desirable to be able to apply different kind of masks like. A mask that is extended by ".mask.sep.png" for column separators and another mask ".mask.ia.png" that ignores areas of the scan.
This would then lead to multiple mask files per image. Another way to achieve this behavior might be to work with different colors. In this case only one mask file is used. This would work for the actions "ignore area" and "apply separator" but is not as extendible since each pixel can only have one color.

Another potential improvement would be to introduce a new parameter to ocropus-gpageseg where a "master-mask" (/master masks, if multiple masks are allowed) can be passed that is applied to every image, processed by ocropus-gpageseg. This way unwanted recurring patterns can be easily ignored on each scan without renaming/ copying an existing mask.

Examples

Mask for first image of #208 as well as lineseeds with and without mask. Mask is used to identify correct separator and prevent identification of wrong separators in the text of the scan.
Params:
ocropus-nlbin -b 0.2
with mask:
ocropus-gpageseg --debug -n --minscale 11
without mask:
ocropus-gpageseg -b --maxseps 5 --debug -n --minscale 11

Disabled colum separator identification by OCRopus when the mask is used since we only want to apply the separators specified by the mask in this case.

Mask for image of #89 as well as lineseeds with and without mask. Mask is used to prevent "holes" in the separators (would have been enough to mark holes in the mask instead of complete separators since the specified areas are merged with the separators found by OCRopus).
Params:
ocropus-nlbin -b 0.2
ocropus-gpageseg --maxcolseps 0 -b --maxseps 10 --sepwiden 50 -d -n

Mask for image of #38 as well as lineseeds with and without mask. Mask is used to ignore image on scan.
Params:
ocropus-nlbin -b 0.2
ocropus-gpageseg --maxseps 1000 --maxcolseps 1000 -n -d --maxlines 10000

- reads mask from base-path with ".mask.png" extension as binary - white areas are treated like column separators (can be used to prevent further processing of the area)

kba · 2017-09-29T14:17:09Z

This is both simple and useful, esp. for images with static layout. Is backwards-compatible too. We should add tests/examples and document it.

zuphilip · 2017-09-29T14:26:11Z

Yes, this looks really interesting! I haven't had the time to look at it. Really sorry for that :-( I will try to do soon and feel free to ping me again...

zuphilip · 2017-12-30T13:54:26Z

tldr: Thank you very much @lehzwo for this nice PR! ✨ This looks very useful and is just a small addition to the code as @kba already pointed out. We are IMO ready to merge this now. @kba Can you do the final review and merge it then?

I updated the PR and add a test case. However, I think that the smoke test here may not detect many errors. Currently, the output of the test results looks like:

INFO:
INFO:  ########## ./ocropus-gpageseg temp/table.bin.png --debug -n --minscale
INFO:
INFO:  temp/table.bin.png
INFO:  scale 7.416198
INFO:  computing segmentation
INFO:  computing column separators
INFO:  considering at most 0 whitespace column separators
INFO:  debug _1thresh.png
INFO:  debug _2grad.png
INFO:  debug _3seps.png
INFO:  debug _4seps.png
INFO:  debug _colwsseps.png
INFO:  debug _masked_seps.png
INFO:  computing lines
INFO:  debug _cleaned.png
INFO:  debug _lineseeds.png
INFO:  debug _seeds.png
INFO:  propagating labels
INFO:  spreading labels
ERROR:  temp/table.bin.png: too many lines 613

I didn't adjust the maximum number of lines, because the ordering of the lines takes quite some time. Thus, it results with this ERROR, but still a valid test case for us.

The current behavior of column separators identified by OCRopus is, that they delete the pixels beneath them. The column separators specified by the mask currently behave the same way. Due to this the mask can also be used to ignore areas of pixels of a scan. This might screw up the segmentation, though!

Yes, this another nice application of such masks. I cannot think of any negative effect for the segmentation. The "smearing" step will not go into the white areas of the mask, but everywhere else where it would go in normal run as well. This looks like exactly the behavior we want. @lehzwo Did you have anything specific in mind here? I rather would go further from here: with these "masks" it is also possible to run the recognition only in a limited area by just coloring everything else white. Thus, one can for example recognize only one column and ignoring the remaining parts of the page.

The only concern I have at the moment is, that this feature is now very hidden. We certainly should document it well, e.g. in the wiki https://github.com/tmbdev/ocropy/wiki/Page-Segmentation . But maybe it also makes sens to add an additional parameter for it. With an additional parameter we can then change the default name as you suggest, but more important IMO is that the feature appears in --help. But maybe we can do this later in another PR...

Lenno and others added 3 commits January 6, 2017 17:26

Apply mask to image.

8227c4e

- reads mask from base-path with ".mask.png" extension as binary - white areas are treated like column separators (can be used to prevent further processing of the area)

Fixed return on error

b851d0e

Merge apply_mask branch to master

d63014d

kba requested a review from zuphilip September 29, 2017 14:17

Merge github.com:tmbdev/ocropy

aee43ee

zuphilip added the ✨ enhancement label Dec 8, 2017

zuphilip added 2 commits December 27, 2017 00:39

Delete ocropus-gpageseg.orig, fix numpy calls

af2dd4c

Add test case for using a mask in page segmentation

289a58f

zuphilip approved these changes Dec 30, 2017

View reviewed changes

kba merged commit e9b6121 into ocropus-archive:master Feb 19, 2018

IgorMunizS mentioned this pull request Feb 21, 2018

How can I improve ocropus accuracy? #296

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan #236

ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan #236

lehzwo commented Jul 21, 2017

kba commented Sep 29, 2017

zuphilip commented Sep 29, 2017 •

edited

zuphilip commented Dec 30, 2017

ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan #236

ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan #236

Conversation

lehzwo commented Jul 21, 2017

Intention

Current behavior of the mask

Naming convention

Further usages

Potential improvements/ changes

Examples

kba commented Sep 29, 2017

zuphilip commented Sep 29, 2017 • edited

zuphilip commented Dec 30, 2017

zuphilip commented Sep 29, 2017 •

edited