Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling boxes flow makes lines be in wrong order within text boxes #411

Closed
jstockwin opened this issue Apr 6, 2020 · 1 comment · Fixed by #412
Closed

Disabling boxes flow makes lines be in wrong order within text boxes #411

jstockwin opened this issue Apr 6, 2020 · 1 comment · Fixed by #412

Comments

@jstockwin
Copy link
Member

Bug report

When passing boxes_flow as None, we don't run the full advanced layout analysis, but rather the order of text boxes will depend on their position on the page only. This is intentional.

If we were passing boxes flow, we'd group the text boxes and then call analyze on each group (here). This filters down so that analyze is called on the text boxes themselves. When boxes_flow=None, we don't call analyze on the text boxes, which results in the lines coming in the wrong order as they don't get sorted.

Note that boxes_flow is not used in the analyze method of text boxes, it is only used for groups of text boxes (which we never have if boxes flow is disabled).

To fix this, we just need to make sure that analyze is always called on the text boxes, even if we don't group them.

Example PDF

Example code:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LAParams

la_params = LAParams(boxes_flow=None)

for page in extract_pages("example.pdf", laparams=la_params, caching=False):
    print("*****OUTPUT:*****")
    for element in page:
        print(element)

Example output:

*****OUTPUT:*****
<LTTextBoxHorizontal(-1) 58.100,343.402,106.178,406.702 'Text 3Text 2Text 1'>

Expected output:

*****OUTPUT:*****
<LTTextBoxHorizontal(-1) 58.100,343.402,106.178,406.702 'Text 1\nText 2\nText 3\n'>
@rcyost
Copy link

rcyost commented Nov 21, 2022

pdfminersix couldn't get the order correct until I passed None to boxes_flow, seems a bit buggy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants