New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Line number support #572
Conversation
I've started testing it. I first built the latest version of pdfalto from the branch line number. NOTE: I used the Then I copied the binary in the
Am I doing all correctly? |
I am not sure my last push for pdfalto was completed.
|
OK there were some commits not being pushed. It works fine now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested few articles and checked the code and it looks good to me. Is not so easy to find articles with line numbers. In order to compare it properly, I merged locally this branch with the latest master (f47cc10).
Here some examples:
-
https://www.biorxiv.org/content/10.1101/2020.04.21.054221v1
The numbered lines are gone, however, there are mismatches in the affiliation, the master version is able to link some of the affiliations with authors, while the master version is able to link some of them. See results here -
https://www.biorxiv.org/content/10.1101/2020.04.21.054221v1
There are no line numbers, just tested because of the affiliation list might have the identifier as numbers. There are two discrepancies withmaster
, the first affiliation
and a table (the supplementary table) mistakenly extracted as a citation. See results here -
https://www.biorxiv.org/content/10.1101/2020.04.21.054429v1
Just one affiliation extracted wrongly. Not sure it's relevant. See results here
@lfoppiano thanks for the test Luca! The links for the first PDF have some problems? The pdf link is the same PDF as the second example, and the result points to the results of the third example. For the second example, I tested with this branch, but I see the valid affiliations without discrepancies with the master: <author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">Maria</forename>
<surname>Vasilarou</surname>
</persName>
<affiliation key="aff0">
<orgName type="department" key="dep1">Institute of Molecular Biology and Biotechnology (IMBB)</orgName>
<orgName type="department" key="dep2">Foundation for Research and Technology Hellas (FORTH)</orgName>
</affiliation>
<affiliation key="aff1">
<orgName type="department">Department of Biology</orgName>
<orgName type="institution">University of Crete</orgName>
<address>
<settlement>Crete</settlement>
<country key="GR">Greece</country>
</address>
</affiliation>
</author> |
Sorry. I must have made a mess with the copy-paste of the various links. I try to correct them now.
The link is wrong, I meant this:
The explanation is correct, but the link is indeed wrong, this should be the correct zip file:
Could you provide the git ref from the commits that you used to compare? I compared |
I've checked the differences between mac and linux and here an example: I have no clue where these can be generated from, but it seems that mac leaves out some characters too... MullenJSSv18i03.linux.xml.zip The grobid output, however seems not having such discrepancies. Most of the differences are in formulas and tables, but they are very few. Here the grobid output of mac and linux: |
I'm adding another test https://iopscience.iop.org/article/10.1088/1361-6668/aabddb/pdf (I have used the submitted manuscript that contains line numbers). I'm attaching both results (with line support and without). It seems that is improving in general, however I see many differences in the citations extraction part: |
This is a generic support for preprints/manuscripts with line numbers, via pdfalto.
Line numbers are identified via x/y clustering and some additional constraints by pdfalto. We can choose to get or not the line numbers with the parameter
noLineNumber
of pdfalto (branch line-number).The following PR provides support for line numbers on the right, on the left, any increments, partial line numbers in a page, page with and without line numbers in the same document and has some mechanisms to avoid false positives (in particular for bibliographical reference sections and tables with numbers vertically aligned).
No regression with PMC 1942 set.
This is ready to be tested on Linux. pdfalto will need to be compiled for other platforms before merging.
Happy for feedback and remaining error cases for refinements if necessary.