-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arXiv identifiers not extracted #275
Comments
Hello, arXiv identifiers are well supported by the bibliographical reference models (nearly 2000 annotated examples haven been added with arXiv ids), however the header model has not been yet updated to support them similarly. So arXiv present in the header of the PDf will in general not been extracted for the moment. I've planned to add that when I will update the header model, which is long over due... as it is quite a lot of work (training data will need to be updated too), I need to find enough free time to launch the effort - it should done in the first half of this year, so rather mid-term future. |
Thank you Sir for your response and for meticulously maintaining such a complex tool for the community. |
I find regex extraction for well-defined identifiers such as arXiv ids and DOI works well, rather than training a model to detect them. For modern arXiv ids, something like
|
hello @philgooch ! DOI detection works exactly like this in the header since something like 4-5 years. This is working very well for very discriminant identifiers indeed. The advantage of integrating these regex in the ML model as feature (there is no ML model trained specifically for any type of identifier in GROBID) like in the citation parser, is that it helps also the prediction of other structures around, it makes the process more robust to PDF noise and it could work for more ambiguous identifiers like ISSN - so it's more general. For the citations and arxiv ids, this approach was better performing than only using a regex on the citation string independently from the CRF (but not a lot better neither if I remember well...). For the header, we will see! |
arXiv ids are now well supported both for citations (since quite a long time) and in the metadata header. In the original examples PDF of this issue: 1801.00857.pdf -> ...
<monogr>
<imprint>
<date type="published" when="2018-01-02">2 Jan 2018</date>
</imprint>
</monogr>
<idno type="arXiv">arXiv:1801.00857v1[stat.ML]</idno>
</biblStruct> 1801.02609.pdf -> <monogr>
<imprint>
<date type="published" when="2018-01-08">8 Jan 2018</date>
</imprint>
</monogr>
<idno type="arXiv">arXiv:1801.02609v1[cs.NI]</idno>
</biblStruct> |
Case https://arxiv.org/pdf/2004.07180.pdf seems not working. |
@elonzh I'm not sure which part you meant, the header or the citations. I tried with GROBID 0.7.0+. The grey text watermark does get segmented incorrectly. Instead of part of the header it ends up as a figure in the body:
Separately, a number of the citations in the paper have, to me, strangely formatted references to arxiv.org pre-prints. They have
|
Hi, I found that arXiv identifiers are not extracted from PDF files in most cases (both in the online version
and in our local build). I have uploaded 2 example PDFs where this problem occurs.
(I have seen only a very few cases where it succeeds.) Kindly let me know if GROBID supports extraction of arXiv ids, and, if not, is there any plan to support it in the near future
1801.00857.pdf
1801.02609.pdf
.
The text was updated successfully, but these errors were encountered: