Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ordering of textlines within a textbox when boxes_flow is disabled #412

Merged
merged 2 commits into from
May 9, 2020

Conversation

jstockwin
Copy link
Member

@jstockwin jstockwin commented Apr 6, 2020

Pull request

Closes #411

Note: It does work to add the analyze call inside the get_key function, but that felt a bit strange since it has nothing to do with getting the key, therefore I decided to do it outside of this.

How Has This Been Tested?

Example PDF

Example code:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LAParams

la_params = LAParams(boxes_flow=None)

for page in extract_pages("example.pdf", laparams=la_params, caching=False):
    print("*****OUTPUT:*****")
    for element in page:
        print(element)

Before fix:

*****OUTPUT:*****
<LTTextBoxHorizontal(-1) 58.100,343.402,106.178,406.702 'Text 3Text 2Text 1'>

After fix:

*****OUTPUT:*****
<LTTextBoxHorizontal(-1) 58.100,343.402,106.178,406.702 'Text 1\nText 2\nText 3\n'>

Checklist

  • I have added tests that prove my fix is effective or that my feature
    works -> Actually the tests I added for without boxes flow kind of show the issue, you can see the output changes (less \n). The output is now consistent.
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial
    version
  • I have updated the README.md or I am verified that this
    is not necessary
  • I have updated the readthedocs documentation or I
    verified that this is not necessary
  • I have added a consice human-readable description of the change to
    CHANGELOG.md

@jstockwin
Copy link
Member Author

Not sure why the tests still say pending... Travis is green: https://travis-ci.org/github/pdfminer/pdfminer.six/builds/671715428

I actually had this issue on a different project the other day...

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Maybe you can add your test pdf (with "Test 1 Test 2 Test3") as a test case.

@jstockwin
Copy link
Member Author

@pietermarsman Cool, I've added my test PDF, and added some tests for when boxes flow is disabled. I also threw in some bonus tests for the line margin parameter, since I know the line margin is 0.2 in this case.

I'll try and get round to adding some more tests, but probably not part of this PR.

@jstockwin
Copy link
Member Author

Travis didn't build for my new commit... not sure how to fix that

@jstockwin
Copy link
Member Author

@pietermarsman I have just rebased to resolve conflicts in the changelog, and Travis has now done its job. Any chance this can get merged?

@pietermarsman pietermarsman merged commit 7254530 into pdfminer:develop May 9, 2020
@pietermarsman
Copy link
Member

Thanks @jstockwin!

(Sorry for the delay)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disabling boxes flow makes lines be in wrong order within text boxes
2 participants