Inconsistently sorted coordinate points for certain words in kant_aufklaerung sample #11

kba · 2018-08-06T13:35:09Z

E.g. all Word with ID word_* in https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml, such as https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml#L71

Everywhere else, coordinates are sorted clock-wise starting with top-left but these coordinates start with bottom-right.

Can this be fixed upstream? If not, we could adapt the coordinate translation utilities in core.

@tboenig @bertsky

The text was updated successfully, but these errors were encountered:

finkf · 2018-08-14T13:17:04Z

One more thing: are there really 4 Points needed? If we use rectangles top-left and bottom-right would be sufficient.

kba · 2018-08-14T13:43:55Z

I interpret page:Coords to be points of a polygon, not necessarily a rectangle. c.f. https://github.com/PRImA-Research-Lab/PAGE-XML/blob/master/pagecontent/schema/pagecontent.xsd#L441

A two-coordinate tuple could be a special case but translating between all these representations is confusing enough as it is IMHO.

finkf · 2018-08-14T13:56:42Z

Yes OK.
But then I would suggest not to rely on any ordering of the points. If you have more than 4 points or less than 4 (technically a triangle is a polygon, too) You need a more robust way to calculate the according bounding boxes anyway.

finkf · 2018-08-14T13:59:16Z

And you cannot check/enforce the ordering in a schema AFAIK.

kba · 2018-08-14T14:13:02Z

The problem behind this issue was a segfault in tesseract for certain words IIRC.

I wouldn't want to enforce this via schema, I was just curious how this happens since the coordinates are shifted only in these specific cases.

Good point about polygons and bounding boxes, so far we do not have support for bounding polygons with boxes at all because we assume coordinates to be coordinates.

@cneud @wrznr @tboenig Do we have samples of non-rectangular text blocks to test?

bertsky · 2018-08-14T14:13:38Z

This point is probably decisive. Even if we would now fix this upstream, we can never be sure how components evolve. Since we have no (simple and expectable) way to check whether the clock-wise-tl-starting assumption is fulfilled, some day things will go wrong.

That goes for weaker assumptions too: We cannot enforce clock-wise or even ordered path for points via XML schema.

So how much more expensive would a robust solution in utils.py be? Every PAGE element has coordinates, every page goes through several processing steps involving core's functions.

cneud · 2018-08-14T14:16:10Z

@kba Yes, we have plenty examples (incl. some with PAGE ground truth) of non-rectangular text blocks, will try to upload some samples over the next few days.

finkf · 2018-08-14T14:20:38Z

A simple programmatical solution for this would be to calculate the min/max x and y coordinates over all points. I do have a simple fix for this -- if you are interested in it.

kba · 2018-08-14T15:18:51Z

So how much more expensive would a robust solution in utils.py be?

Not that much I guess. I'll send a PR.

kba · 2018-08-14T15:19:30Z

I do have a simple fix for this -- if you are interested in it.

Didn't see this before. Contributions welcome :)

finkf · 2018-08-14T16:18:06Z

pull request

kba · 2018-08-22T10:24:25Z

For posterity's sake: The original problem was a bug in Transkribus that has been fixed up-stream and will be rolled out in the next release. HT @tboenig

kba closed this as completed in OCR-D/core@523aa8a Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistently sorted coordinate points for certain words in kant_aufklaerung sample #11

Inconsistently sorted coordinate points for certain words in kant_aufklaerung sample #11

kba commented Aug 6, 2018

finkf commented Aug 14, 2018

kba commented Aug 14, 2018 •

edited

Loading

finkf commented Aug 14, 2018

finkf commented Aug 14, 2018

kba commented Aug 14, 2018 •

edited

Loading

bertsky commented Aug 14, 2018

cneud commented Aug 14, 2018

finkf commented Aug 14, 2018

kba commented Aug 14, 2018

kba commented Aug 14, 2018

finkf commented Aug 14, 2018

kba commented Aug 22, 2018

Inconsistently sorted coordinate points for certain words in kant_aufklaerung sample #11

Inconsistently sorted coordinate points for certain words in kant_aufklaerung sample #11

Comments

kba commented Aug 6, 2018

finkf commented Aug 14, 2018

kba commented Aug 14, 2018 • edited Loading

finkf commented Aug 14, 2018

finkf commented Aug 14, 2018

kba commented Aug 14, 2018 • edited Loading

bertsky commented Aug 14, 2018

cneud commented Aug 14, 2018

finkf commented Aug 14, 2018

kba commented Aug 14, 2018

kba commented Aug 14, 2018

finkf commented Aug 14, 2018

kba commented Aug 22, 2018

kba commented Aug 14, 2018 •

edited

Loading

kba commented Aug 14, 2018 •

edited

Loading