Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refextract fails to extract from two-columned layout pdf #85

Closed
Apurv3377 opened this issue Sep 7, 2021 · 2 comments
Closed

Refextract fails to extract from two-columned layout pdf #85

Apurv3377 opened this issue Sep 7, 2021 · 2 comments

Comments

@Apurv3377
Copy link

Input PDF has two-columned layout. Refextract outputs empty array of references.

from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1710.11035.pdf')
print(references[0])

Input PDF has one-columned layout. Refextract works fine.

from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1509.03588.pdf')
print(references[0])

How can I allow refextract to parse both type of layouts?

Thank you.

@michamos
Copy link
Contributor

michamos commented Sep 7, 2021

I don't think the issue is related to the layout. Two-column layout should work just fine usually. Refextract is not meant to be a general-purpose reference extraction tool but has been tuned to work well for High-Energy Physics and related fields. If citations styles are very different, it will get into trouble. In this case, I believe it's due to the heading being called Bibliographic references which is not expected:

titles = [u'references',
u'r\u00C9f\u00E9rences',
u'r\u00C9f\u00C9rences',
u'r\xb4ef\xb4erences',
u'bibliography',
u'bibliographie',
u'literaturverzeichnis',
u'citations',
u'refs',
u'publications'
u'r\u00E9fs',
u'r\u00C9fs',
u'reference',
u'r\u00E9f\u00E9rence',
u'r\u00C9f\u00C9rence']
. Additionally, there are no markers at the beginning of each reference, so it might struggle to separate them. If you're looking for a general-purpose tool, I would look into https://github.com/kermitt2/grobid instead.

@Apurv3377
Copy link
Author

Thanks for the prompt and elaborated response. It answers the other doubt I had also. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants