Reading columns correctly. #25

bes827 · 2021-11-09T00:15:26Z

I would like to thank you for working on this great package. It's extremely useful and has plenty of applications. I hope you continue to work and maintain it.

I noted that one of the limitations (as you mentioned) is text fragmentation when the text in pdf are in columns (eg most scientific articles). I came across this function tabulizer::extract_text(file) which can read multiple columns. I wonder if you can use something similar in your package to fix that issue. This tabulizer function will also still also cause issues with tables and images/table captions but at least will get the flow of the main text correct.

thank you

The text was updated successfully, but these errors were encountered:

lebebr01 · 2021-11-16T16:27:58Z

Thanks for the comments. I, with one of my graduate students, are currently working on expanding this package and a companion package. One of the elements we are working to improve is this feature. I don't plan to use the tabulizer package as it has some pretty strict dependencies (ie, Java). However, look for some improvements coming soon to multiple column PDFs.

lebebr01 self-assigned this Nov 16, 2021

lebebr01 added the enhancement label Nov 16, 2021

lebebr01 added this to the v0.4 milestone Nov 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading columns correctly. #25

Reading columns correctly. #25

bes827 commented Nov 9, 2021

lebebr01 commented Nov 16, 2021

Reading columns correctly. #25

Reading columns correctly. #25

Comments

bes827 commented Nov 9, 2021

lebebr01 commented Nov 16, 2021