Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arXiv identifiers not extracted #275

Closed
dksanyal opened this issue Jan 9, 2018 · 7 comments
Closed

arXiv identifiers not extracted #275

dksanyal opened this issue Jan 9, 2018 · 7 comments
Labels
implemented The issue has been implemented

Comments

@dksanyal
Copy link

dksanyal commented Jan 9, 2018

Hi, I found that arXiv identifiers are not extracted from PDF files in most cases (both in the online version
and in our local build). I have uploaded 2 example PDFs where this problem occurs.
(I have seen only a very few cases where it succeeds.) Kindly let me know if GROBID supports extraction of arXiv ids, and, if not, is there any plan to support it in the near future
1801.00857.pdf
1801.02609.pdf

.

@kermitt2
Copy link
Owner

Hello, arXiv identifiers are well supported by the bibliographical reference models (nearly 2000 annotated examples haven been added with arXiv ids), however the header model has not been yet updated to support them similarly. So arXiv present in the header of the PDf will in general not been extracted for the moment.

I've planned to add that when I will update the header model, which is long over due... as it is quite a lot of work (training data will need to be updated too), I need to find enough free time to launch the effort - it should done in the first half of this year, so rather mid-term future.

@dksanyal
Copy link
Author

Thank you Sir for your response and for meticulously maintaining such a complex tool for the community.
We hope to see the updated header model soon!

@philgooch
Copy link
Contributor

I find regex extraction for well-defined identifiers such as arXiv ids and DOI works well, rather than training a model to detect them.

For modern arXiv ids, something like Pattern.compile("(?i)arXiv:\\d{4}[.]\\d{4,5}(v\\d+)?") works well.

getAllBlocksClean() in the Grobid Document class will give you the raw text of the document, you can then restrict the arXiv id search to the first few lines of the document to reduce the risk of picking up an arXiv id from the bibliography.

@kermitt2
Copy link
Owner

kermitt2 commented Aug 2, 2018

hello @philgooch !

DOI detection works exactly like this in the header since something like 4-5 years.

This is working very well for very discriminant identifiers indeed.

The advantage of integrating these regex in the ML model as feature (there is no ML model trained specifically for any type of identifier in GROBID) like in the citation parser, is that it helps also the prediction of other structures around, it makes the process more robust to PDF noise and it could work for more ambiguous identifiers like ISSN - so it's more general. For the citations and arxiv ids, this approach was better performing than only using a regex on the citation string independently from the CRF (but not a lot better neither if I remember well...). For the header, we will see!

@kermitt2
Copy link
Owner

arXiv ids are now well supported both for citations (since quite a long time) and in the metadata header. In the original examples PDF of this issue:

1801.00857.pdf ->

...
                    <monogr>
                        <imprint>
                            <date type="published" when="2018-01-02">2 Jan 2018</date>
                        </imprint>
                    </monogr>
                    <idno type="arXiv">arXiv:1801.00857v1[stat.ML]</idno>
                </biblStruct>

1801.02609.pdf ->

                    <monogr>
                        <imprint>
                            <date type="published" when="2018-01-08">8 Jan 2018</date>
                        </imprint>
                    </monogr>
                    <idno type="arXiv">arXiv:1801.02609v1[cs.NI]</idno>
                </biblStruct>

@lfoppiano lfoppiano added the implemented The issue has been implemented label Aug 12, 2020
@elonzh
Copy link
Contributor

elonzh commented Sep 14, 2021

Case https://arxiv.org/pdf/2004.07180.pdf seems not working.

@bnewbold
Copy link
Contributor

@elonzh I'm not sure which part you meant, the header or the citations. I tried with GROBID 0.7.0+.

The grey text watermark does get segmented incorrectly. Instead of part of the header it ends up as a figure in the body:

            <figure
                xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0">
                <head></head>
                <label></label>
                <figDesc>2 SPECTER: Scientific Paper Embeddings using Citationinformed TransformERs arXiv:2004.07180v4 [cs.CL] 20 May 2020</figDesc>
            </figure>

Separately, a number of the citations in the paper have, to me, strangely formatted references to arxiv.org pre-prints. They have abs/ in front of the identifier, which is both not a valid section, and also mixes the "old" and "new" identifier styles. GROBID's behavior in this corner case seems reasonable to me. Here are some snipped examples:

<note type="raw_reference">Erik Holmer and Andreas Marfurt. 2018. Explaining away syntactic structure in semantic document rep- resentations. ArXiv, abs/1806.01620.</note>
<idno>abs/1806.01620</idno>

<note type="raw_reference">Chanwoo Jeong, Sion Jang, Hyuna Shin, Eun- jeong Lucy Park, and Sungchul Choi. 2019. A context-aware citation recommendation model with bert and graph convolutional networks. ArXiv, abs/1903.06464.</note>
<idno>abs/1903.06464</idno>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
implemented The issue has been implemented
Projects
None yet
Development

No branches or pull requests

6 participants