You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there a way to to prevent pdfminer.six from executing the layout algorithm? So that one only gets a list of lines/graphics/image elements etc.. I have several PDFs where the layout algorithm takes a loooooong time. Simply because there are so many tables & textboxes distributed all over it. Also check this issue: euske/pdfminer#61
But as I don't need the layout algorithm. It would be sufficient for me to simply iterate over the page without textboxes..
Seems like your problem could be solved with passing None to boxes_flow attribute of LaParams() object as: LAParams(boxes_flow=None), conclusion drawn from #411.
Seems like your problem could be solved with passing None to boxes_flow attribute of LaParams() object as: LAParams(boxes_flow=None), conclusion drawn from #411.
Hi,
Just want to reply that this seems to work quiet well. I haven't removed all the heavy lifting yet, but to give you a better idea, I did some timings:
The following was done with "vanilla" LAParams()
Time taken for page 10: 0.368527889251709 seconds, elements:602
Time taken for page 16: 23.002509593963623 seconds, elements:31153
Time taken for page 17: 14.153702735900879 seconds, elements:31262
Time taken for page 18: 0.3653285503387451 seconds, elements:576
Time taken for page 19: 0.36687445640563965 seconds, elements:596
with: LAParams(detect_vertical=False)
Time taken for page 10: 0.3694620132446289 seconds, elements:602
Time taken for page 16: 22.085140705108643 seconds, elements:31153
Time taken for page 17: 13.33229398727417 seconds, elements:31262
Time taken for page 18: 0.36687588691711426 seconds, elements:576
Time taken for page 19: 0.359846830368042 seconds, elements:596
Then with LAParams(boxes_flow=None):
Time taken for page 10: 0.3904867172241211 seconds, elements:602
Time taken for page 16: 3.1319127082824707 seconds, elements:31153
Time taken for page 17: 3.055600643157959 seconds, elements:31262
Time taken for page 18: 0.248185396194458 seconds, elements:576
Time taken for page 19: 0.24678397178649902 seconds, elements:596
LAParams(detect_vertical=False, boxes_flow=None),
Time taken for page 10: 0.3619980812072754 seconds, elements:602
Time taken for page 16: 3.106546640396118 seconds, elements:31153
Time taken for page 17: 3.0158188343048096 seconds, elements:31262
Time taken for page 18: 0.24485182762145996 seconds, elements:576
Time taken for page 19: 0.2494044303894043 seconds, elements:596
so the improvement is quiet big. detect_vertical=False also seems to have minimal influence on the efficiency.
Is there a way to to prevent pdfminer.six from executing the layout algorithm? So that one only gets a list of lines/graphics/image elements etc.. I have several PDFs where the layout algorithm takes a loooooong time. Simply because there are so many tables & textboxes distributed all over it. Also check this issue: euske/pdfminer#61
But as I don't need the layout algorithm. It would be sufficient for me to simply iterate over the page without textboxes..
how would I do that? right now I am using this: https://pdfminersix.readthedocs.io/en/latest/reference/highlevel.html#extract-pages
Is it somehow possible with this here?: https://pdfminersix.readthedocs.io/en/latest/tutorial/composable.html
The text was updated successfully, but these errors were encountered: