Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading columns correctly. #25

Open
bes827 opened this issue Nov 9, 2021 · 1 comment
Open

Reading columns correctly. #25

bes827 opened this issue Nov 9, 2021 · 1 comment
Assignees
Milestone

Comments

@bes827
Copy link

bes827 commented Nov 9, 2021

I would like to thank you for working on this great package. It's extremely useful and has plenty of applications. I hope you continue to work and maintain it.

I noted that one of the limitations (as you mentioned) is text fragmentation when the text in pdf are in columns (eg most scientific articles). I came across this function tabulizer::extract_text(file) which can read multiple columns. I wonder if you can use something similar in your package to fix that issue. This tabulizer function will also still also cause issues with tables and images/table captions but at least will get the flow of the main text correct.

thank you

@lebebr01
Copy link
Owner

Thanks for the comments. I, with one of my graduate students, are currently working on expanding this package and a companion package. One of the elements we are working to improve is this feature. I don't plan to use the tabulizer package as it has some pretty strict dependencies (ie, Java). However, look for some improvements coming soon to multiple column PDFs.

@lebebr01 lebebr01 self-assigned this Nov 16, 2021
@lebebr01 lebebr01 added this to the v0.4 milestone Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants