Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HELP] Inconsistent Reading order #1091

Closed
emtee14 opened this issue Apr 6, 2023 · 2 comments
Closed

[HELP] Inconsistent Reading order #1091

emtee14 opened this issue Apr 6, 2023 · 2 comments

Comments

@emtee14
Copy link

emtee14 commented Apr 6, 2023

Describe the bug
What's the problem?
When processing two PDFs that are visibly identical layout-wise except for different text value one pdf gives the text as a simple left-to-right block of text whilst the other break the text into two columns. I just wanted to know if there is a way to make it read all of it as a block of text and not try to interpret columns.

I've linked two of the PDFs that have the issue. On the 2nd page of both is the balance sheet I'm trying to read.

Expected behaviour (text read as a block) - bit.ly/3zG1LJO
Undesired behaviour (text split into two columns) - bit.ly/3UbvO5Q

  • OS: MacOS
  • Python version: 3.10
  • OCRmyPDF version: 14.0.4
  • Platform: ARM
@jbarlow83
Copy link
Collaborator

The reading order is inferred by the PDF viewer, unless there is markup in the PDF to indicate the appropriate reading order (such as making a tagged PDF). Adding this information is beyond the capabilities of OCR engines at the moment.

@emtee14
Copy link
Author

emtee14 commented Apr 6, 2023

Ah okay thank you. I'm using pypdf to then extract the text so that must be where the issue lies

@emtee14 emtee14 closed this as completed Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants