How to prevent pdfminer.six from executing layout algorithm to create textboxes? #892

yeus · 2023-06-16T23:55:47Z

Is there a way to to prevent pdfminer.six from executing the layout algorithm? So that one only gets a list of lines/graphics/image elements etc.. I have several PDFs where the layout algorithm takes a loooooong time. Simply because there are so many tables & textboxes distributed all over it. Also check this issue: euske/pdfminer#61

But as I don't need the layout algorithm. It would be sufficient for me to simply iterate over the page without textboxes..

how would I do that? right now I am using this: https://pdfminersix.readthedocs.io/en/latest/reference/highlevel.html#extract-pages

Is it somehow possible with this here?: https://pdfminersix.readthedocs.io/en/latest/tutorial/composable.html

P0L3 · 2024-01-12T07:35:59Z

Seems like your problem could be solved with passing None to boxes_flow attribute of LaParams() object as: LAParams(boxes_flow=None), conclusion drawn from #411.

yeus · 2024-01-29T20:57:56Z

Seems like your problem could be solved with passing None to boxes_flow attribute of LaParams() object as: LAParams(boxes_flow=None), conclusion drawn from #411.

Hi,

Just want to reply that this seems to work quiet well. I haven't removed all the heavy lifting yet, but to give you a better idea, I did some timings:

The following was done with "vanilla" LAParams()

Time taken for page 10: 0.368527889251709 seconds, elements:602
Time taken for page 16: 23.002509593963623 seconds, elements:31153
Time taken for page 17: 14.153702735900879 seconds, elements:31262
Time taken for page 18: 0.3653285503387451 seconds, elements:576
Time taken for page 19: 0.36687445640563965 seconds, elements:596

with: LAParams(detect_vertical=False)

Time taken for page 10: 0.3694620132446289 seconds, elements:602
Time taken for page 16: 22.085140705108643 seconds, elements:31153
Time taken for page 17: 13.33229398727417 seconds, elements:31262
Time taken for page 18: 0.36687588691711426 seconds, elements:576
Time taken for page 19: 0.359846830368042 seconds, elements:596

Then with LAParams(boxes_flow=None):

Time taken for page 10: 0.3904867172241211 seconds, elements:602
Time taken for page 16: 3.1319127082824707 seconds, elements:31153
Time taken for page 17: 3.055600643157959 seconds, elements:31262
Time taken for page 18: 0.248185396194458 seconds, elements:576
Time taken for page 19: 0.24678397178649902 seconds, elements:596

LAParams(detect_vertical=False, boxes_flow=None),

Time taken for page 10: 0.3619980812072754 seconds, elements:602
Time taken for page 16: 3.106546640396118 seconds, elements:31153
Time taken for page 17: 3.0158188343048096 seconds, elements:31262
Time taken for page 18: 0.24485182762145996 seconds, elements:576
Time taken for page 19: 0.2494044303894043 seconds, elements:596

so the improvement is quiet big. detect_vertical=False also seems to have minimal influence on the efficiency.

Any other ideas how we could speed this up :)?

pietermarsman closed this as completed Jan 12, 2024

pietermarsman added type: question component: converter Related to any PDFLayoutAnalyzer labels Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to prevent pdfminer.six from executing layout algorithm to create textboxes? #892

How to prevent pdfminer.six from executing layout algorithm to create textboxes? #892

yeus commented Jun 16, 2023

P0L3 commented Jan 12, 2024 •

edited

Loading

yeus commented Jan 29, 2024 •

edited

Loading

How to prevent pdfminer.six from executing layout algorithm to create textboxes? #892

How to prevent pdfminer.six from executing layout algorithm to create textboxes? #892

Comments

yeus commented Jun 16, 2023

P0L3 commented Jan 12, 2024 • edited Loading

yeus commented Jan 29, 2024 • edited Loading

P0L3 commented Jan 12, 2024 •

edited

Loading

yeus commented Jan 29, 2024 •

edited

Loading