LayoutLMv2FeatureExtractor now supports non-English languages when applying Tesseract OCR. #14514

Xargonus · 2021-11-24T12:24:40Z

What does this PR do?

This PR adds an additional ocr_lang argument to the __init__ method of LayoutLMv2FeatureExtractor which specifies which Teserract model to use when applying Tesseract OCR.

Fixes #14511

@NielsRogge

…utlmv2.py, which is used in pytesseract.image_to_data.

… is used when calling apply_tesseract

src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py

… the ocr_lang argument should be a language code.

src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py

…v2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

NielsRogge · 2021-11-24T15:42:32Z

Can you verify the slow tests (which are not run by the CI) are passing as well?

i.e. RUN_SLOW=yes pytest tests/test_feature_extraction_layoutlmv2.py and RUN_SLOW=yes pytest tests/test_feature_processor_layoutlmv2.py

Xargonus · 2021-11-24T16:44:33Z

I assume you meant RUN_SLOW=yes pytest tests/test_feature_extraction_layoutlmv2.py and RUN_SLOW=yes pytest tests/test_processor_layoutlmv2.py?

I am developing on Windows, therefore my options when installing tesseract are limited to available installer versions. After moving from v5.0.0 to v4.1.0, which is the closest to v4.1.1, the version used to get bboxes in the test_feature_extraction_layoutlmv2.py, all of these tests ran successfully (There are no slow ones here).

As for test_processor_layoutlmv2.py, test_processor_case1 failed for me, even when staying in the master branch. I was unable to reproduce the environment used to develop these tests on my computer. The reason is probably the different Tesseract version or model. The error comes from this line:

transformers/tests/test_processor_layoutlmv2.py

Line 210 in 3772af4

self.assertSequenceEqual(decoding, expected_decoding)

src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py

…v2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

NielsRogge

LGTM! Thanks for adding.

Regarding the failing tests: these probably have to do with the slight different version of Tesseract. So it's OK to me.

LysandreJik

Great, concise addition! Thanks @Xargonus, LGTM

…plying Tesseract OCR. (huggingface#14514) * Added the lang argument to apply_tesseract in feature_extraction_layoutlmv2.py, which is used in pytesseract.image_to_data. * Added ocr_lang argument to LayoutLMv2FeatureExtractor.__init__, which is used when calling apply_tesseract * Updated the documentation of the LayoutLMv2FeatureExtractor * Specified in the documentation of the LayoutLMv2FeatureExtractor that the ocr_lang argument should be a language code. * Update src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Split comment into two lines to adhere to the max line size limit. * Update src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

Xargonus added 3 commits November 24, 2021 13:01

Added the lang argument to apply_tesseract in feature_extraction_layo…

3ad64ca

…utlmv2.py, which is used in pytesseract.image_to_data.

Added ocr_lang argument to LayoutLMv2FeatureExtractor.__init__, which…

95fe37d

… is used when calling apply_tesseract

Updated the documentation of the LayoutLMv2FeatureExtractor

1d3f3e3

NielsRogge reviewed Nov 24, 2021

View reviewed changes

src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py Outdated Show resolved Hide resolved

Specified in the documentation of the LayoutLMv2FeatureExtractor that…

ccfaa07

… the ocr_lang argument should be a language code.

NielsRogge reviewed Nov 24, 2021

View reviewed changes

src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py Outdated Show resolved Hide resolved

Update src/transformers/models/layoutlmv2/feature_extraction_layoutlm…

83d6608

…v2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

Split comment into two lines to adhere to the max line size limit.

d550ea5

NielsRogge requested a review from LysandreJik November 25, 2021 14:35

NielsRogge reviewed Nov 25, 2021

View reviewed changes

src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py Outdated Show resolved Hide resolved

Update src/transformers/models/layoutlmv2/feature_extraction_layoutlm…

2ae5d82

…v2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

NielsRogge approved these changes Nov 25, 2021

View reviewed changes

LysandreJik approved these changes Nov 29, 2021

View reviewed changes

LysandreJik merged commit 4ee0b75 into huggingface:master Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LayoutLMv2FeatureExtractor now supports non-English languages when applying Tesseract OCR. #14514

LayoutLMv2FeatureExtractor now supports non-English languages when applying Tesseract OCR. #14514

Xargonus commented Nov 24, 2021

NielsRogge commented Nov 24, 2021 •

edited

Loading

Xargonus commented Nov 24, 2021 •

edited

Loading

NielsRogge left a comment •

edited

Loading

LysandreJik left a comment

LayoutLMv2FeatureExtractor now supports non-English languages when applying Tesseract OCR. #14514

LayoutLMv2FeatureExtractor now supports non-English languages when applying Tesseract OCR. #14514

Conversation

Xargonus commented Nov 24, 2021

What does this PR do?

NielsRogge commented Nov 24, 2021 • edited Loading

Xargonus commented Nov 24, 2021 • edited Loading

NielsRogge left a comment • edited Loading

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

NielsRogge commented Nov 24, 2021 •

edited

Loading

Xargonus commented Nov 24, 2021 •

edited

Loading

NielsRogge left a comment •

edited

Loading