arXiv identifiers not extracted #275

dksanyal · 2018-01-09T12:36:30Z

Hi, I found that arXiv identifiers are not extracted from PDF files in most cases (both in the online version
and in our local build). I have uploaded 2 example PDFs where this problem occurs.
(I have seen only a very few cases where it succeeds.) Kindly let me know if GROBID supports extraction of arXiv ids, and, if not, is there any plan to support it in the near future
1801.00857.pdf
1801.02609.pdf

.

kermitt2 · 2018-01-12T08:19:09Z

Hello, arXiv identifiers are well supported by the bibliographical reference models (nearly 2000 annotated examples haven been added with arXiv ids), however the header model has not been yet updated to support them similarly. So arXiv present in the header of the PDf will in general not been extracted for the moment.

I've planned to add that when I will update the header model, which is long over due... as it is quite a lot of work (training data will need to be updated too), I need to find enough free time to launch the effort - it should done in the first half of this year, so rather mid-term future.

dksanyal · 2018-01-12T10:36:58Z

Thank you Sir for your response and for meticulously maintaining such a complex tool for the community.
We hope to see the updated header model soon!

philgooch · 2018-08-02T08:16:14Z

I find regex extraction for well-defined identifiers such as arXiv ids and DOI works well, rather than training a model to detect them.

For modern arXiv ids, something like Pattern.compile("(?i)arXiv:\\d{4}[.]\\d{4,5}(v\\d+)?") works well.

getAllBlocksClean() in the Grobid Document class will give you the raw text of the document, you can then restrict the arXiv id search to the first few lines of the document to reduce the risk of picking up an arXiv id from the bibliography.

kermitt2 · 2018-08-02T17:28:37Z

hello @philgooch !

DOI detection works exactly like this in the header since something like 4-5 years.

This is working very well for very discriminant identifiers indeed.

The advantage of integrating these regex in the ML model as feature (there is no ML model trained specifically for any type of identifier in GROBID) like in the citation parser, is that it helps also the prediction of other structures around, it makes the process more robust to PDF noise and it could work for more ambiguous identifiers like ISSN - so it's more general. For the citations and arxiv ids, this approach was better performing than only using a regex on the citation string independently from the CRF (but not a lot better neither if I remember well...). For the header, we will see!

kermitt2 · 2020-08-12T12:33:26Z

arXiv ids are now well supported both for citations (since quite a long time) and in the metadata header. In the original examples PDF of this issue:

1801.00857.pdf ->

...
                    <monogr>
                        <imprint>
                            <date type="published" when="2018-01-02">2 Jan 2018</date>
                        </imprint>
                    </monogr>
                    <idno type="arXiv">arXiv:1801.00857v1[stat.ML]</idno>
                </biblStruct>

1801.02609.pdf ->

                    <monogr>
                        <imprint>
                            <date type="published" when="2018-01-08">8 Jan 2018</date>
                        </imprint>
                    </monogr>
                    <idno type="arXiv">arXiv:1801.02609v1[cs.NI]</idno>
                </biblStruct>

elonzh · 2021-09-14T07:04:35Z

Case https://arxiv.org/pdf/2004.07180.pdf seems not working.

bnewbold · 2021-11-13T01:51:08Z

@elonzh I'm not sure which part you meant, the header or the citations. I tried with GROBID 0.7.0+.

The grey text watermark does get segmented incorrectly. Instead of part of the header it ends up as a figure in the body:

            <figure
                xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0">
                <head></head>
                <label></label>
                <figDesc>2 SPECTER: Scientific Paper Embeddings using Citationinformed TransformERs arXiv:2004.07180v4 [cs.CL] 20 May 2020</figDesc>
            </figure>

Separately, a number of the citations in the paper have, to me, strangely formatted references to arxiv.org pre-prints. They have abs/ in front of the identifier, which is both not a valid section, and also mixes the "old" and "new" identifier styles. GROBID's behavior in this corner case seems reasonable to me. Here are some snipped examples:

<note type="raw_reference">Erik Holmer and Andreas Marfurt. 2018. Explaining away syntactic structure in semantic document rep- resentations. ArXiv, abs/1806.01620.</note>
<idno>abs/1806.01620</idno>

<note type="raw_reference">Chanwoo Jeong, Sion Jang, Hyuna Shin, Eun- jeong Lucy Park, and Sungchul Choi. 2019. A context-aware citation recommendation model with bert and graph convolutional networks. ArXiv, abs/1903.06464.</note>
<idno>abs/1903.06464</idno>

kermitt2 added the enhancement label Jul 5, 2019

kermitt2 mentioned this issue May 10, 2020

[WIP] Full update of the header model #580

Merged

lfoppiano added the implemented The issue has been implemented label Aug 12, 2020

kermitt2 removed the enhancement label Aug 13, 2020

bnewbold mentioned this issue Nov 13, 2021

expand scope of arxiv identifier matcher, and fix some training data annotations #858

Open

lfoppiano closed this as completed Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arXiv identifiers not extracted #275

arXiv identifiers not extracted #275

dksanyal commented Jan 9, 2018

kermitt2 commented Jan 12, 2018

dksanyal commented Jan 12, 2018

philgooch commented Aug 2, 2018

kermitt2 commented Aug 2, 2018

kermitt2 commented Aug 12, 2020

elonzh commented Sep 14, 2021

bnewbold commented Nov 13, 2021

arXiv identifiers not extracted #275

arXiv identifiers not extracted #275

Comments

dksanyal commented Jan 9, 2018

kermitt2 commented Jan 12, 2018

dksanyal commented Jan 12, 2018

philgooch commented Aug 2, 2018

kermitt2 commented Aug 2, 2018

kermitt2 commented Aug 12, 2020

elonzh commented Sep 14, 2021

bnewbold commented Nov 13, 2021