Refextract fails to extract from two-columned layout pdf #85

Apurv3377 · 2021-09-07T08:05:57Z

Input PDF has two-columned layout. Refextract outputs empty array of references.

from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1710.11035.pdf')
print(references[0])

Input PDF has one-columned layout. Refextract works fine.

from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1509.03588.pdf')
print(references[0])

How can I allow refextract to parse both type of layouts?

Thank you.

The text was updated successfully, but these errors were encountered:

michamos · 2021-09-07T08:19:33Z

I don't think the issue is related to the layout. Two-column layout should work just fine usually. Refextract is not meant to be a general-purpose reference extraction tool but has been tuned to work well for High-Energy Physics and related fields. If citations styles are very different, it will get into trouble. In this case, I believe it's due to the heading being called Bibliographic references which is not expected:

refextract/refextract/references/regexs.py

Lines 696 to 710 in 24418cd

    
           titles = [u'references', 
        
                     u'r\u00C9f\u00E9rences', 
        
                     u'r\u00C9f\u00C9rences', 
        
                     u'r\xb4ef\xb4erences', 
        
                     u'bibliography', 
        
                     u'bibliographie', 
        
                     u'literaturverzeichnis', 
        
                     u'citations', 
        
                     u'refs', 
        
                     u'publications' 
        
                     u'r\u00E9fs', 
        
                     u'r\u00C9fs', 
        
                     u'reference', 
        
                     u'r\u00E9f\u00E9rence', 
        
                     u'r\u00C9f\u00C9rence']

. Additionally, there are no markers at the beginning of each reference, so it might struggle to separate them. If you're looking for a general-purpose tool, I would look into https://github.com/kermitt2/grobid instead.

Apurv3377 · 2021-09-07T18:05:47Z

Thanks for the prompt and elaborated response. It answers the other doubt I had also. :)

Apurv3377 closed this as completed Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refextract fails to extract from two-columned layout pdf #85

Refextract fails to extract from two-columned layout pdf #85

Apurv3377 commented Sep 7, 2021

michamos commented Sep 7, 2021

Apurv3377 commented Sep 7, 2021

Refextract fails to extract from two-columned layout pdf #85

Refextract fails to extract from two-columned layout pdf #85

Comments

Apurv3377 commented Sep 7, 2021

michamos commented Sep 7, 2021

Apurv3377 commented Sep 7, 2021