New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: Handle RTL languages better #1665
Conversation
This is great! @OmarWazzan perhaps youre able / interested to test this out a bit? |
Hi @shamoon! Happy to help, unfortunately I’m on a work trip for another week or two. Can test it when I’m back Thank you for submitting the fix stumpy, greatly appreciate it |
Hi @shamoon! I'm back and able to test. What would be the best way to do so? |
I created a sample It does assume just a Linux installation, not Portainer, Synology, etc. It's a few steps, but if you're will to test it out, here's the steps I would do:
|
@OmarWazzan any luck? |
Hi @shamoon, unfortunately had a thing pop up and have not been able to get to it yet. I run my instance on unraid LSIO container, so need to spin up a VM to get it running |
No worries thanks 🙏 |
… back to forced OCR, which handles RTL text better
ea90da0
to
34fc3df
Compare
Pull Request Test Coverage Report for Build 3578094965
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no harm in going for it with this. We will get more feedback after a release and shouldn't affect much else other than the relevant cases
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns. |
Proposed change
For a digitally born PDF, the text is extracted using
pdfminer.six
. Unfortunately, they have a long open issue about supporting RTL languages. Fortunately, Tesseract via OCRMyPDF handles RTL languages.So, with this PR, if the detected language for text extracted via
pdfminer.six
is an RTL (or at least the common ones), the processing will force OCR of the document, which produces a sidecar file with the content, formatted correctly.Original Document:
1.9.1 Content:
This branch:
The OCR isn't amazing, but at least to me, it's pretty clear the ordering is fixed.
Fixes #1163
Type of change
Checklist:
pre-commit
hooks, see documentation.