Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prevent pdfminer.six from executing layout algorithm to create textboxes? #892

Closed
yeus opened this issue Jun 16, 2023 · 2 comments
Closed
Labels
component: converter Related to any PDFLayoutAnalyzer type: question

Comments

@yeus
Copy link

yeus commented Jun 16, 2023

Is there a way to to prevent pdfminer.six from executing the layout algorithm? So that one only gets a list of lines/graphics/image elements etc.. I have several PDFs where the layout algorithm takes a loooooong time. Simply because there are so many tables & textboxes distributed all over it. Also check this issue: euske/pdfminer#61

But as I don't need the layout algorithm. It would be sufficient for me to simply iterate over the page without textboxes..

how would I do that? right now I am using this: https://pdfminersix.readthedocs.io/en/latest/reference/highlevel.html#extract-pages

Is it somehow possible with this here?: https://pdfminersix.readthedocs.io/en/latest/tutorial/composable.html

@P0L3
Copy link

P0L3 commented Jan 12, 2024

Seems like your problem could be solved with passing None to boxes_flow attribute of LaParams() object as: LAParams(boxes_flow=None), conclusion drawn from #411.

@pietermarsman pietermarsman added type: question component: converter Related to any PDFLayoutAnalyzer labels Jan 12, 2024
@yeus
Copy link
Author

yeus commented Jan 29, 2024

Seems like your problem could be solved with passing None to boxes_flow attribute of LaParams() object as: LAParams(boxes_flow=None), conclusion drawn from #411.

Hi,

Just want to reply that this seems to work quiet well. I haven't removed all the heavy lifting yet, but to give you a better idea, I did some timings:

The following was done with "vanilla" LAParams()

Time taken for page 10: 0.368527889251709 seconds, elements:602
Time taken for page 16: 23.002509593963623 seconds, elements:31153
Time taken for page 17: 14.153702735900879 seconds, elements:31262
Time taken for page 18: 0.3653285503387451 seconds, elements:576
Time taken for page 19: 0.36687445640563965 seconds, elements:596

with: LAParams(detect_vertical=False)

Time taken for page 10: 0.3694620132446289 seconds, elements:602
Time taken for page 16: 22.085140705108643 seconds, elements:31153
Time taken for page 17: 13.33229398727417 seconds, elements:31262
Time taken for page 18: 0.36687588691711426 seconds, elements:576
Time taken for page 19: 0.359846830368042 seconds, elements:596

Then with LAParams(boxes_flow=None):

Time taken for page 10: 0.3904867172241211 seconds, elements:602
Time taken for page 16: 3.1319127082824707 seconds, elements:31153
Time taken for page 17: 3.055600643157959 seconds, elements:31262
Time taken for page 18: 0.248185396194458 seconds, elements:576
Time taken for page 19: 0.24678397178649902 seconds, elements:596

LAParams(detect_vertical=False, boxes_flow=None),

Time taken for page 10: 0.3619980812072754 seconds, elements:602
Time taken for page 16: 3.106546640396118 seconds, elements:31153
Time taken for page 17: 3.0158188343048096 seconds, elements:31262
Time taken for page 18: 0.24485182762145996 seconds, elements:576
Time taken for page 19: 0.2494044303894043 seconds, elements:596

so the improvement is quiet big. detect_vertical=False also seems to have minimal influence on the efficiency.

Any other ideas how we could speed this up :)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: converter Related to any PDFLayoutAnalyzer type: question
Projects
None yet
Development

No branches or pull requests

3 participants