[HELP] Inconsistent Reading order #1091

emtee14 · 2023-04-06T01:08:04Z

Describe the bug
What's the problem?
When processing two PDFs that are visibly identical layout-wise except for different text value one pdf gives the text as a simple left-to-right block of text whilst the other break the text into two columns. I just wanted to know if there is a way to make it read all of it as a block of text and not try to interpret columns.

I've linked two of the PDFs that have the issue. On the 2nd page of both is the balance sheet I'm trying to read.

Expected behaviour (text read as a block) - bit.ly/3zG1LJO
Undesired behaviour (text split into two columns) - bit.ly/3UbvO5Q

OS: MacOS
Python version: 3.10
OCRmyPDF version: 14.0.4
Platform: ARM

jbarlow83 · 2023-04-06T01:26:16Z

The reading order is inferred by the PDF viewer, unless there is markup in the PDF to indicate the appropriate reading order (such as making a tagged PDF). Adding this information is beyond the capabilities of OCR engines at the moment.

emtee14 · 2023-04-06T01:31:26Z

Ah okay thank you. I'm using pypdf to then extract the text so that must be where the issue lies

emtee14 closed this as completed Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HELP] Inconsistent Reading order #1091

[HELP] Inconsistent Reading order #1091

emtee14 commented Apr 6, 2023

jbarlow83 commented Apr 6, 2023

emtee14 commented Apr 6, 2023

[HELP] Inconsistent Reading order #1091

[HELP] Inconsistent Reading order #1091

Comments

emtee14 commented Apr 6, 2023

jbarlow83 commented Apr 6, 2023

emtee14 commented Apr 6, 2023