New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating a custom plugin for improved rotation #712
Comments
The default plugins are in src/ocrmypdf/builtin_plugins/ and some plugins are used for testing (e.g. to bypass OCR for a test) in tests/plugins/. Have a look at the file At low confidence, the input is probably quite noisy and will give a low number of word matches in all orientations. Your strategy might help in some cases but will probably give results quite similar to the existing orientation confidence. I would instead invest effort in cleaning the input image before sending it to Tesseract. Sometimes you can apply domain specific knowledge - you probably know something about your input that Tesseract doesn't. That's usually where you can help it most. |
Thank you for the insight. Do you think adding --user-words will improve the ocr significantly? The documents are full of engineering jargon/acronyms. |
I see there is the option for parsing keywords and user_words. I assume the user_words points to your word list like in Tesseract's docs. What is the difference between the two? |
The improvement from a word list is modest, not significant. ocrmypdf just passes the information on to Tesseract. --user-words means use a Tesseract word list. |
Thank you for the help. The main rotation issues come from powerpoint slides that are rotated - The text is very clear but I believe the background gradients/colouring is throwing tesseract. |
In that case I suggest looking into how ocrmypdf implements the --threshold
function and doing this for rotation. Tesseract isn't that good at
thresholding color to binary (where OCR happens).
…On Wed., Jan. 13, 2021, 21:23 Simon Harvey, ***@***.***> wrote:
Thank you for the help. The main rotation issues come from powerpoint
slides that are rotated - The text is very clear but I believe the
background gradients/colouring is throwing tesseract.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#712 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN5YM5337BWQGZ36JDAEO3SZZ5TLANCNFSM4V5C57BA>
.
|
Hi,
Where is the get_orientation's returned value passed to tesseract and how do you execute the ocr when making your own plugin?
I would like to ocr the page in all 4 orientations if the rotation confidence is low. Then compare the total English words and only add the highest to the output pdf. I'm struggling to find the file in which the ocr is executed and where the pdf is generated.
Kind regards,
Simon
The text was updated successfully, but these errors were encountered: