-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LayoutLMv2FeatureExtractor now supports non-English languages when applying Tesseract OCR. #14514
LayoutLMv2FeatureExtractor now supports non-English languages when applying Tesseract OCR. #14514
Conversation
…utlmv2.py, which is used in pytesseract.image_to_data.
… is used when calling apply_tesseract
src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py
Outdated
Show resolved
Hide resolved
… the ocr_lang argument should be a language code.
src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py
Outdated
Show resolved
Hide resolved
…v2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Can you verify the slow tests (which are not run by the CI) are passing as well? i.e. |
I assume you meant I am developing on Windows, therefore my options when installing tesseract are limited to available installer versions. After moving from v5.0.0 to v4.1.0, which is the closest to v4.1.1, the version used to get bboxes in the As for
|
src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py
Outdated
Show resolved
Hide resolved
…v2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding.
Regarding the failing tests: these probably have to do with the slight different version of Tesseract. So it's OK to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, concise addition! Thanks @Xargonus, LGTM
…plying Tesseract OCR. (huggingface#14514) * Added the lang argument to apply_tesseract in feature_extraction_layoutlmv2.py, which is used in pytesseract.image_to_data. * Added ocr_lang argument to LayoutLMv2FeatureExtractor.__init__, which is used when calling apply_tesseract * Updated the documentation of the LayoutLMv2FeatureExtractor * Specified in the documentation of the LayoutLMv2FeatureExtractor that the ocr_lang argument should be a language code. * Update src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Split comment into two lines to adhere to the max line size limit. * Update src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
What does this PR do?
This PR adds an additional
ocr_lang
argument to the __init__ method of LayoutLMv2FeatureExtractor which specifies which Teserract model to use when applying Tesseract OCR.Fixes #14511
@NielsRogge