hocr-check complains assert doc.xpath("//meta[@name='ocr-id']")!=[] #59

CharlesNepote · 2016-09-15T14:23:07Z

Can be reproduced with both tesseract and gImageReader hOCR files.
manisandro/gImageReader#101

Does the script end with this error or is it still checking the other issues?

kba · 2016-09-15T14:59:23Z

I would recommend you to use hocr-spec-python. It's an explicit replacement for hocr-check which is more or less nonsensical at the moment.

kba · 2016-09-15T15:00:06Z

You can try it online here: http://digi.bib.uni-mannheim.de/ocr-fileformat/#validate

CharlesNepote · 2016-09-15T15:52:43Z

Every file I try to upload to hocr-spec-python online -- http://digi.bib.uni-mannheim.de/ocr-fileformat/#validate -- ends with "NameError: global name 'KeyErrora' is not defined" (from tesseract or gImageReader).

kba · 2016-09-15T15:56:13Z

KeyErrora is a typo in https://github.com/kba/hocr-spec-python/blob/dc6377b8b46d1cb81a8571ad71c22746758bb1a2/hocr_spec/spec.py#L410. I'll fix it, thanks.

kba · 2016-09-15T16:06:30Z

Should be fixed and deployed. BTW: If you have any samples, gladly complicated ones, you're welcome to contribute them to ocr-fileformat-samples.

zuphilip · 2016-09-15T17:06:55Z

However, the main problem here we should also fix in hocr-check:

ocr-id should be replaced with ocr-system in the check itself and the tests
Then we are again conform with the specs.
Also current Ocropus outputs looks like this: https://github.com/kba/ocr-fileformat-samples/blob/master/samples/hocr/1.1/433934212_0017.html

I can do this change, but just want to make sure, that we agree here.

kba · 2016-09-15T18:22:27Z

Before you implement this further, I have a branch somewhere where I've done that, let me check.

Several actually.

kba · 2016-09-15T18:33:52Z

Ah, no I did not actually fix it, so yeah: ocr-id should be ocr-system. I would probably remove the containment checks and concentrate on the bounding box overlap check since that is actually an interesting check. And I don't think it's in the specs BTW.

Should we maybe use hocr-spec-python as a base to reorganize this (#42)? The parsing of title= attributes could come in handy.

CharlesNepote · 2016-09-15T19:00:03Z

I agree that ocr-id should be ocr-system.

zuphilip · 2016-09-15T19:46:33Z

I created a simple fix in PR #61 . We can also improve on that the ocr-check but maybe we should then open another issue for that to discuss further. I don't understand your comment, @kba, completely and I am not sure if I agree...

kba · 2016-09-15T20:11:09Z

Checking whether ocr_line is within ocr_page does not tell you much. Of course it should be checked but it's rather minor and only checking three classes seems arbitrary.

What's more important is that ocr_column is not in the spec anymore, this should be ocr_carea.

See https://github.com/kba/hocr-spec/blob/master/hocr-spec.md#ocr_column

Fix metadata check: ocr-id -> ocr-system, #59

kba mentioned this issue Sep 15, 2016

Typo leads to error in property parsing kba/hocr-spec-python#1

Closed

zuphilip added a commit to UB-Mannheim/hocr-tools that referenced this issue Sep 15, 2016

Fix metadata check: ocr-id -> ocr-system, ocropus#59

89a8e1d

zuphilip added a commit to UB-Mannheim/hocr-tools that referenced this issue Sep 15, 2016

Fix metadata check: ocr-id -> ocr-system, ocropus#59

398dc31

zuphilip added the bug label Sep 15, 2016

kba added a commit to kba/hocr-tools that referenced this issue Sep 15, 2016

ocr_column should be ocr_carea, ocropus#59, ocropus#61

d30c086

See https://github.com/kba/hocr-spec/blob/master/hocr-spec.md#ocr_column

kba mentioned this issue Sep 15, 2016

ocr_column should be ocr_carea #62

Merged

zuphilip added a commit to UB-Mannheim/hocr-tools that referenced this issue Sep 16, 2016

Fix metadata check: ocr-id -> ocr-system, ocropus#59

c25cc74

zuphilip added a commit that referenced this issue Sep 16, 2016

Merge pull request #61 from UB-Mannheim/id-system

fff7c99

Fix metadata check: ocr-id -> ocr-system, #59

zuphilip closed this as completed Sep 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hocr-check complains assert doc.xpath("//meta[@name='ocr-id']")!=[] #59

hocr-check complains assert doc.xpath("//meta[@name='ocr-id']")!=[] #59

CharlesNepote commented Sep 15, 2016

kba commented Sep 15, 2016

kba commented Sep 15, 2016

CharlesNepote commented Sep 15, 2016 •

edited

Loading

kba commented Sep 15, 2016

kba commented Sep 15, 2016 •

edited

Loading

zuphilip commented Sep 15, 2016

kba commented Sep 15, 2016 •

edited

Loading

kba commented Sep 15, 2016

CharlesNepote commented Sep 15, 2016

zuphilip commented Sep 15, 2016 •

edited

Loading

kba commented Sep 15, 2016

hocr-check complains assert doc.xpath("//meta[@name='ocr-id']")!=[] #59

hocr-check complains assert doc.xpath("//meta[@name='ocr-id']")!=[] #59

Comments

CharlesNepote commented Sep 15, 2016

kba commented Sep 15, 2016

kba commented Sep 15, 2016

CharlesNepote commented Sep 15, 2016 • edited Loading

kba commented Sep 15, 2016

kba commented Sep 15, 2016 • edited Loading

zuphilip commented Sep 15, 2016

kba commented Sep 15, 2016 • edited Loading

kba commented Sep 15, 2016

CharlesNepote commented Sep 15, 2016

zuphilip commented Sep 15, 2016 • edited Loading

kba commented Sep 15, 2016

CharlesNepote commented Sep 15, 2016 •

edited

Loading

kba commented Sep 15, 2016 •

edited

Loading

kba commented Sep 15, 2016 •

edited

Loading

zuphilip commented Sep 15, 2016 •

edited

Loading