Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a custom plugin for improved rotation #712

Closed
C-monC opened this issue Jan 11, 2021 · 6 comments
Closed

Creating a custom plugin for improved rotation #712

C-monC opened this issue Jan 11, 2021 · 6 comments

Comments

@C-monC
Copy link

C-monC commented Jan 11, 2021

Hi,

Where is the get_orientation's returned value passed to tesseract and how do you execute the ocr when making your own plugin?

I would like to ocr the page in all 4 orientations if the rotation confidence is low. Then compare the total English words and only add the highest to the output pdf. I'm struggling to find the file in which the ocr is executed and where the pdf is generated.

Kind regards,
Simon

@jbarlow83
Copy link
Collaborator

The default plugins are in src/ocrmypdf/builtin_plugins/ and some plugins are used for testing (e.g. to bypass OCR for a test) in tests/plugins/. Have a look at the file tesseract_cache.py for a template; this "subclass" the standard OCR engine class to add caching for the test suite.

At low confidence, the input is probably quite noisy and will give a low number of word matches in all orientations. Your strategy might help in some cases but will probably give results quite similar to the existing orientation confidence.

I would instead invest effort in cleaning the input image before sending it to Tesseract. Sometimes you can apply domain specific knowledge - you probably know something about your input that Tesseract doesn't. That's usually where you can help it most.

@C-monC
Copy link
Author

C-monC commented Jan 12, 2021

Thank you for the insight. Do you think adding --user-words will improve the ocr significantly? The documents are full of engineering jargon/acronyms.
I see the documentation does not have a section on adding user words but you mentioned it in a previous release several years ago.

@C-monC
Copy link
Author

C-monC commented Jan 12, 2021

I see there is the option for parsing keywords and user_words. I assume the user_words points to your word list like in Tesseract's docs. What is the difference between the two?

@jbarlow83
Copy link
Collaborator

--keywords is a list of metadata keywords to attach to the document. There is no parsing - your keywords are added verbatim.

The improvement from a word list is modest, not significant. ocrmypdf just passes the information on to Tesseract. --user-words means use a Tesseract word list.

@C-monC
Copy link
Author

C-monC commented Jan 14, 2021

Thank you for the help. The main rotation issues come from powerpoint slides that are rotated - The text is very clear but I believe the background gradients/colouring is throwing tesseract.

@C-monC C-monC closed this as completed Jan 14, 2021
@jbarlow83
Copy link
Collaborator

jbarlow83 commented Jan 14, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants