Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Handle RTL languages better #1665

Merged
merged 1 commit into from Dec 30, 2022
Merged

Bugfix: Handle RTL languages better #1665

merged 1 commit into from Dec 30, 2022

Conversation

stumpylog
Copy link
Member

Proposed change

For a digitally born PDF, the text is extracted using pdfminer.six. Unfortunately, they have a long open issue about supporting RTL languages. Fortunately, Tesseract via OCRMyPDF handles RTL languages.

So, with this PR, if the detected language for text extracted via pdfminer.six is an RTL (or at least the common ones), the processing will force OCR of the document, which produces a sidecar file with the content, formatted correctly.

Original Document:
image
1.9.1 Content:
image
This branch:
image

The OCR isn't amazing, but at least to me, it's pretty clear the ordering is fixed.

Fixes #1163

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Other (please explain)

Checklist:

  • I have read & agree with the contributing guidelines.
  • If applicable, I have tested my code for new features & regressions on both mobile & desktop devices, using the latest version of major browsers.
  • If applicable, I have checked that all tests pass, see documentation.
  • I have run all pre-commit hooks, see documentation.
  • I have made corresponding changes to the documentation as needed.
  • I have checked my modifications for any breaking changes.

@stumpylog stumpylog requested a review from a team as a code owner September 27, 2022 14:53
@github-actions github-actions bot added the enhancement New feature label Sep 27, 2022
@stumpylog stumpylog added bug Bug report or a Bug-fix and removed enhancement New feature labels Sep 27, 2022
@stumpylog stumpylog linked an issue Sep 27, 2022 that may be closed by this pull request
@stumpylog stumpylog self-assigned this Sep 27, 2022
@shamoon
Copy link
Member

shamoon commented Sep 27, 2022

This is great! @OmarWazzan perhaps youre able / interested to test this out a bit?

@OmarWazzan
Copy link

OmarWazzan commented Sep 27, 2022

Hi @shamoon!

Happy to help, unfortunately I’m on a work trip for another week or two. Can test it when I’m back

Thank you for submitting the fix stumpy, greatly appreciate it

@shamoon shamoon added this to the Next Release milestone Oct 1, 2022
@OmarWazzan
Copy link

Hi @shamoon!

I'm back and able to test. What would be the best way to do so?

@stumpylog
Copy link
Member Author

I created a sample docker-compose.yml here: https://gist.github.com/stumpylog/0b66002be64e87da3f3419525fac3ff8

It does assume just a Linux installation, not Portainer, Synology, etc. It's a few steps, but if you're will to test it out, here's the steps I would do:

  1. Drop that into a bare folder
  2. create the folders redis-data, data and media.
  3. adjust PAPERLESS_OCR_LANGUAGE as well, I left those commented out, but there
  4. start with docker-compose up
  5. Once started, it will be available at port 8180, ie http://localhost:8180
  6. Test it out with lots of RTL docs
  7. When finished Ctrl+C to stop the containers
  8. Feel free to remove them and the network

@shamoon
Copy link
Member

shamoon commented Oct 20, 2022

@OmarWazzan any luck?

@OmarWazzan
Copy link

Hi @shamoon, unfortunately had a thing pop up and have not been able to get to it yet.

I run my instance on unraid LSIO container, so need to spin up a VM to get it running

@shamoon
Copy link
Member

shamoon commented Oct 21, 2022

No worries thanks 🙏

@shamoon shamoon removed this from the Next Release milestone Nov 10, 2022
@stumpylog stumpylog added this to the v1.10.1 milestone Nov 29, 2022
… back to forced OCR, which handles RTL text better
@coveralls
Copy link

Pull Request Test Coverage Report for Build 3578094965

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 22 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.03%) to 92.764%

Files with Coverage Reduction New Missed Lines %
paperless_tesseract/parsers.py 22 88.3%
Totals Coverage Status
Change from base Build 3569042433: 0.03%
Covered Lines: 5051
Relevant Lines: 5445

💛 - Coveralls

@shamoon shamoon modified the milestones: v1.10.1, Next Release Dec 2, 2022
Copy link
Member

@shamoon shamoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no harm in going for it with this. We will get more feedback after a release and shouldn't affect much else other than the relevant cases

@stumpylog stumpylog merged commit a2b7687 into dev Dec 30, 2022
@stumpylog stumpylog deleted the feature-fix-1163-rtl branch December 30, 2022 00:02
@github-actions
Copy link
Contributor

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bug report or a Bug-fix
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] RTL Import is reversed
4 participants