Creating a custom plugin for improved rotation #712

C-monC · 2021-01-11T06:20:47Z

Hi,

Where is the get_orientation's returned value passed to tesseract and how do you execute the ocr when making your own plugin?

I would like to ocr the page in all 4 orientations if the rotation confidence is low. Then compare the total English words and only add the highest to the output pdf. I'm struggling to find the file in which the ocr is executed and where the pdf is generated.

Kind regards,
Simon

jbarlow83 · 2021-01-11T18:13:17Z

The default plugins are in src/ocrmypdf/builtin_plugins/ and some plugins are used for testing (e.g. to bypass OCR for a test) in tests/plugins/. Have a look at the file tesseract_cache.py for a template; this "subclass" the standard OCR engine class to add caching for the test suite.

At low confidence, the input is probably quite noisy and will give a low number of word matches in all orientations. Your strategy might help in some cases but will probably give results quite similar to the existing orientation confidence.

I would instead invest effort in cleaning the input image before sending it to Tesseract. Sometimes you can apply domain specific knowledge - you probably know something about your input that Tesseract doesn't. That's usually where you can help it most.

C-monC · 2021-01-12T06:39:45Z

Thank you for the insight. Do you think adding --user-words will improve the ocr significantly? The documents are full of engineering jargon/acronyms.
I see the documentation does not have a section on adding user words but you mentioned it in a previous release several years ago.

C-monC · 2021-01-12T06:49:55Z

I see there is the option for parsing keywords and user_words. I assume the user_words points to your word list like in Tesseract's docs. What is the difference between the two?

jbarlow83 · 2021-01-12T19:35:22Z

--keywords is a list of metadata keywords to attach to the document. There is no parsing - your keywords are added verbatim.

The improvement from a word list is modest, not significant. ocrmypdf just passes the information on to Tesseract. --user-words means use a Tesseract word list.

C-monC · 2021-01-14T05:22:47Z

Thank you for the help. The main rotation issues come from powerpoint slides that are rotated - The text is very clear but I believe the background gradients/colouring is throwing tesseract.

jbarlow83 · 2021-01-14T07:11:58Z

In that case I suggest looking into how ocrmypdf implements the --threshold function and doing this for rotation. Tesseract isn't that good at thresholding color to binary (where OCR happens).

…

On Wed., Jan. 13, 2021, 21:23 Simon Harvey, ***@***.***> wrote: Thank you for the help. The main rotation issues come from powerpoint slides that are rotated - The text is very clear but I believe the background gradients/colouring is throwing tesseract. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#712 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAN5YM5337BWQGZ36JDAEO3SZZ5TLANCNFSM4V5C57BA> .

C-monC closed this as completed Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating a custom plugin for improved rotation #712

Creating a custom plugin for improved rotation #712

C-monC commented Jan 11, 2021

jbarlow83 commented Jan 11, 2021

C-monC commented Jan 12, 2021

C-monC commented Jan 12, 2021

jbarlow83 commented Jan 12, 2021

C-monC commented Jan 14, 2021

jbarlow83 commented Jan 14, 2021 via email

Creating a custom plugin for improved rotation #712

Creating a custom plugin for improved rotation #712

Comments

C-monC commented Jan 11, 2021

jbarlow83 commented Jan 11, 2021

C-monC commented Jan 12, 2021

C-monC commented Jan 12, 2021

jbarlow83 commented Jan 12, 2021

C-monC commented Jan 14, 2021

jbarlow83 commented Jan 14, 2021 via email