Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistently sorted coordinate points for certain words in kant_aufklaerung sample #11

Closed
kba opened this issue Aug 6, 2018 · 12 comments

Comments

@kba
Copy link
Member

kba commented Aug 6, 2018

E.g. all Word with ID word_* in https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml, such as https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml#L71

Everywhere else, coordinates are sorted clock-wise starting with top-left but these coordinates start with bottom-right.

Can this be fixed upstream? If not, we could adapt the coordinate translation utilities in core.

@tboenig @bertsky

@finkf
Copy link

finkf commented Aug 14, 2018

One more thing: are there really 4 Points needed? If we use rectangles top-left and bottom-right would be sufficient.

@kba
Copy link
Member Author

kba commented Aug 14, 2018

I interpret page:Coords to be points of a polygon, not necessarily a rectangle. c.f. https://github.com/PRImA-Research-Lab/PAGE-XML/blob/master/pagecontent/schema/pagecontent.xsd#L441

A two-coordinate tuple could be a special case but translating between all these representations is confusing enough as it is IMHO.

@finkf
Copy link

finkf commented Aug 14, 2018

Yes OK.
But then I would suggest not to rely on any ordering of the points. If you have more than 4 points or less than 4 (technically a triangle is a polygon, too) You need a more robust way to calculate the according bounding boxes anyway.

@finkf
Copy link

finkf commented Aug 14, 2018

And you cannot check/enforce the ordering in a schema AFAIK.

@kba
Copy link
Member Author

kba commented Aug 14, 2018

The problem behind this issue was a segfault in tesseract for certain words IIRC.

I wouldn't want to enforce this via schema, I was just curious how this happens since the coordinates are shifted only in these specific cases.

Good point about polygons and bounding boxes, so far we do not have support for bounding polygons with boxes at all because we assume coordinates to be coordinates.

@cneud @wrznr @tboenig Do we have samples of non-rectangular text blocks to test?

@bertsky
Copy link
Contributor

bertsky commented Aug 14, 2018

This point is probably decisive. Even if we would now fix this upstream, we can never be sure how components evolve. Since we have no (simple and expectable) way to check whether the clock-wise-tl-starting assumption is fulfilled, some day things will go wrong.

That goes for weaker assumptions too: We cannot enforce clock-wise or even ordered path for points via XML schema.

So how much more expensive would a robust solution in utils.py be? Every PAGE element has coordinates, every page goes through several processing steps involving core's functions.

@cneud
Copy link
Member

cneud commented Aug 14, 2018

@kba Yes, we have plenty examples (incl. some with PAGE ground truth) of non-rectangular text blocks, will try to upload some samples over the next few days.

@finkf
Copy link

finkf commented Aug 14, 2018

A simple programmatical solution for this would be to calculate the min/max x and y coordinates over all points. I do have a simple fix for this -- if you are interested in it.

@kba
Copy link
Member Author

kba commented Aug 14, 2018

So how much more expensive would a robust solution in utils.py be?

Not that much I guess. I'll send a PR.

@kba
Copy link
Member Author

kba commented Aug 14, 2018

I do have a simple fix for this -- if you are interested in it.

Didn't see this before. Contributions welcome :)

@finkf
Copy link

finkf commented Aug 14, 2018

pull request

@kba
Copy link
Member Author

kba commented Aug 22, 2018

For posterity's sake: The original problem was a bug in Transkribus that has been fixed up-stream and will be rolled out in the next release. HT @tboenig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants