Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LayoutLMv2FeatureExtractor now supports non-English languages when applying Tesseract OCR. #14514

Merged

Conversation

Xargonus
Copy link
Contributor

What does this PR do?

This PR adds an additional ocr_lang argument to the __init__ method of LayoutLMv2FeatureExtractor which specifies which Teserract model to use when applying Tesseract OCR.

Fixes #14511

@NielsRogge

… the ocr_lang argument should be a language code.
…v2.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
@NielsRogge
Copy link
Contributor

NielsRogge commented Nov 24, 2021

Can you verify the slow tests (which are not run by the CI) are passing as well?

i.e. RUN_SLOW=yes pytest tests/test_feature_extraction_layoutlmv2.py and RUN_SLOW=yes pytest tests/test_feature_processor_layoutlmv2.py

@Xargonus
Copy link
Contributor Author

Xargonus commented Nov 24, 2021

I assume you meant RUN_SLOW=yes pytest tests/test_feature_extraction_layoutlmv2.py and RUN_SLOW=yes pytest tests/test_processor_layoutlmv2.py?

I am developing on Windows, therefore my options when installing tesseract are limited to available installer versions. After moving from v5.0.0 to v4.1.0, which is the closest to v4.1.1, the version used to get bboxes in the test_feature_extraction_layoutlmv2.py, all of these tests ran successfully (There are no slow ones here).

As for test_processor_layoutlmv2.py, test_processor_case1 failed for me, even when staying in the master branch. I was unable to reproduce the environment used to develop these tests on my computer. The reason is probably the different Tesseract version or model. The error comes from this line:

self.assertSequenceEqual(decoding, expected_decoding)

…v2.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding.

Regarding the failing tests: these probably have to do with the slight different version of Tesseract. So it's OK to me.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, concise addition! Thanks @Xargonus, LGTM

@LysandreJik LysandreJik merged commit 4ee0b75 into huggingface:master Nov 29, 2021
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
…plying Tesseract OCR. (huggingface#14514)

* Added the lang argument to apply_tesseract in feature_extraction_layoutlmv2.py, which is used in pytesseract.image_to_data.

* Added ocr_lang argument to LayoutLMv2FeatureExtractor.__init__, which is used when calling apply_tesseract

* Updated the documentation of the LayoutLMv2FeatureExtractor

* Specified in the documentation of the LayoutLMv2FeatureExtractor that the ocr_lang argument should be a language code.

* Update src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Split comment into two lines to adhere to the max line size limit.

* Update src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LayoutXLMProcessor applies the english Tesseract model
3 participants