Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hocr-check complains assert doc.xpath("//meta[@name='ocr-id']")!=[] #59

Closed
CharlesNepote opened this issue Sep 15, 2016 · 11 comments
Closed
Labels

Comments

@CharlesNepote
Copy link

Can be reproduced with both tesseract and gImageReader hOCR files.
manisandro/gImageReader#101

Does the script end with this error or is it still checking the other issues?

@kba
Copy link
Contributor

kba commented Sep 15, 2016

I would recommend you to use hocr-spec-python. It's an explicit replacement for hocr-check which is more or less nonsensical at the moment.

@kba
Copy link
Contributor

kba commented Sep 15, 2016

You can try it online here: http://digi.bib.uni-mannheim.de/ocr-fileformat/#validate

@CharlesNepote
Copy link
Author

CharlesNepote commented Sep 15, 2016

Every file I try to upload to hocr-spec-python online -- http://digi.bib.uni-mannheim.de/ocr-fileformat/#validate -- ends with "NameError: global name 'KeyErrora' is not defined" (from tesseract or gImageReader).

@kba
Copy link
Contributor

kba commented Sep 15, 2016

@kba
Copy link
Contributor

kba commented Sep 15, 2016

Should be fixed and deployed. BTW: If you have any samples, gladly complicated ones, you're welcome to contribute them to ocr-fileformat-samples.

@zuphilip
Copy link
Collaborator

However, the main problem here we should also fix in hocr-check:

I can do this change, but just want to make sure, that we agree here.

@kba
Copy link
Contributor

kba commented Sep 15, 2016

Before you implement this further, I have a branch somewhere where I've done that, let me check.

Several actually.

@kba
Copy link
Contributor

kba commented Sep 15, 2016

Ah, no I did not actually fix it, so yeah: ocr-id should be ocr-system. I would probably remove the containment checks and concentrate on the bounding box overlap check since that is actually an interesting check. And I don't think it's in the specs BTW.

Should we maybe use hocr-spec-python as a base to reorganize this (#42)? The parsing of title= attributes could come in handy.

@CharlesNepote
Copy link
Author

I agree that ocr-id should be ocr-system.

zuphilip added a commit to UB-Mannheim/hocr-tools that referenced this issue Sep 15, 2016
zuphilip added a commit to UB-Mannheim/hocr-tools that referenced this issue Sep 15, 2016
@zuphilip zuphilip added the bug label Sep 15, 2016
@zuphilip
Copy link
Collaborator

zuphilip commented Sep 15, 2016

I created a simple fix in PR #61 . We can also improve on that the ocr-check but maybe we should then open another issue for that to discuss further. I don't understand your comment, @kba, completely and I am not sure if I agree...

@kba
Copy link
Contributor

kba commented Sep 15, 2016

Checking whether ocr_line is within ocr_page does not tell you much. Of course it should be checked but it's rather minor and only checking three classes seems arbitrary.

What's more important is that ocr_column is not in the spec anymore, this should be ocr_carea.

zuphilip added a commit to UB-Mannheim/hocr-tools that referenced this issue Sep 16, 2016
zuphilip added a commit that referenced this issue Sep 16, 2016
Fix metadata check: ocr-id -> ocr-system, #59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants