Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Line number support #572

Merged
merged 7 commits into from Aug 11, 2020
Merged

[WIP] Line number support #572

merged 7 commits into from Aug 11, 2020

Conversation

kermitt2
Copy link
Owner

This is a generic support for preprints/manuscripts with line numbers, via pdfalto.

Line numbers are identified via x/y clustering and some additional constraints by pdfalto. We can choose to get or not the line numbers with the parameter noLineNumber of pdfalto (branch line-number).

The following PR provides support for line numbers on the right, on the left, any increments, partial line numbers in a page, page with and without line numbers in the same document and has some mechanisms to avoid false positives (in particular for bibliographical reference sections and tables with numbers vertically aligned).

No regression with PMC 1942 set.

This is ready to be tested on Linux. pdfalto will need to be compiled for other platforms before merging.

Happy for feedback and remaining error cases for refinements if necessary.

@kermitt2 kermitt2 requested a review from lfoppiano April 20, 2020 04:33
@lfoppiano
Copy link
Collaborator

lfoppiano commented Apr 20, 2020

I've started testing it.

I first built the latest version of pdfalto from the branch line number. NOTE: I used the HEAD of xpdf-4.0

Then I copied the binary in the grobid-home and I started the service to test some files. It seems that there is a problem between grobid and the new version of pdfalto.

ERROR [2020-04-20 11:11:33,451] org.grobid.core.process.ProcessPdfToXml: pdftoxml process finished with error code: 99. [/Users/lfoppiano/development/projects/grobid/grobid-home/pdf2xml/mac-64/pdfalto_server, -noImageInline, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/origin339189738859558337.pdf, /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/ZUuPYKJ7pr.lxml]
ERROR [2020-04-20 11:11:33,452] org.grobid.core.process.ProcessPdfToXml: pdftoxml return message: 
pdfalto version 0.3
Usage: pdfalto [options] <PDF-file> [<xml-file>]
  -f <int>                      : first page to convert
  -l <int>                      : last page to convert
  -verbose                      : display pdf attributes
  -noText                       : do not extract textual objects
  -noImage                      : do not extract Images (Bitmap and Vectorial)
  -noImageInline                : do not include images inline in the stream
  -outline                      : create an outline file xml
  -annotation                   : create an annotations file xml
  -blocks                       : add blocks informations within the structure
  -readingOrder                 : blocks follow the reading order
  -charReadingOrderAttr         : include TYPE attribute to String elements to indicate right-to-left reading order (not valid ALTO)
  -fullFontName                 : fonts names are not normalized
  -nsURI <string>               : add the specified namespace URI
  -opw <string>                 : owner password (for encrypted files)
  -upw <string>                 : user password (for encrypted files)
  -q                            : don't print any messages or errors
  -v                            : print version info
  -h                            : print usage information
  -help                         : print usage information
  --help                        : print usage information
  -?                            : print usage information
  --saveconf <string>           : save all command line parameters in the specified XML <file>
  -conf <string>                : configuration file to use in place of xpdfrc
  -filesLimit <int>             : limit of asset files be extracted

ERROR [2020-04-20 11:11:33,456] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! org.grobid.core.exceptions.GrobidException: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 99

Am I doing all correctly?

@kermitt2
Copy link
Owner Author

kermitt2 commented Apr 20, 2020

I am not sure my last push for pdfalto was completed.
The pdfalto line-number branch -h should give you the following:

Usage: pdfalto [options] <PDF-file> [<xml-file>]
  -f <int>                      : first page to convert
  -l <int>                      : last page to convert
  -verbose                      : display pdf attributes
  -noImage                      : do not extract Images (Bitmap and Vectorial)
  -noImageInline                : do not include images inline in the stream
  -outline                      : create an outline file xml
  -annotation                   : create an annotations file xml
  -noLineNumbers                : do not output line numbers added in manuscript-style textual documents
  -readingOrder                 : blocks follow the reading order
  -noText                       : do not extract textual objects (might be useful, but non-valid ALTO)
  -charReadingOrderAttr         : include TYPE attribute to String elements to indicate right-to-left reading order (might be useful, but non-valid ALTO)
  -fullFontName                 : fonts names are not normalized
  -nsURI <string>               : add the specified namespace URI
  -opw <string>                 : owner password (for encrypted files)
  -upw <string>                 : user password (for encrypted files)
  -filesLimit <int>             : limit of asset files be extracted
  -q                            : don't print any messages or errors
  -v                            : print version info
  -h                            : print usage information
  -help                         : print usage information
  --help                        : print usage information
  -?                            : print usage information

@lfoppiano
Copy link
Collaborator

OK there were some commits not being pushed. It works fine now.

@kermitt2 kermitt2 added this to the 0.6.1 milestone Apr 21, 2020
@kermitt2 kermitt2 changed the title Line number support [WIP] [WIP] Line number support Apr 21, 2020
@coveralls
Copy link

coveralls commented Apr 21, 2020

Coverage Status

Coverage increased (+0.03%) to 38.291% when pulling d63b914 on line_number_support into 4bb6fb6 on master.

Copy link
Collaborator

@lfoppiano lfoppiano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested few articles and checked the code and it looks good to me. Is not so easy to find articles with line numbers. In order to compare it properly, I merged locally this branch with the latest master (f47cc10).

Here some examples:

@kermitt2
Copy link
Owner Author

@lfoppiano thanks for the test Luca! The links for the first PDF have some problems? The pdf link is the same PDF as the second example, and the result points to the results of the third example.

For the second example, I tested with this branch, but I see the valid affiliations without discrepancies with the master:

                       <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Maria</forename>
                                <surname>Vasilarou</surname>
                            </persName>
                            <affiliation key="aff0">
                                <orgName type="department" key="dep1">Institute of Molecular Biology and Biotechnology (IMBB)</orgName>
                                <orgName type="department" key="dep2">Foundation for Research and Technology Hellas (FORTH)</orgName>
                            </affiliation>
                            <affiliation key="aff1">
                                <orgName type="department">Department of Biology</orgName>
                                <orgName type="institution">University of Crete</orgName>
                                <address>
                                    <settlement>Crete</settlement>
                                    <country key="GR">Greece</country>
                                </address>
                            </affiliation>
                        </author>

@lfoppiano
Copy link
Collaborator

lfoppiano commented Apr 27, 2020

@lfoppiano thanks for the test Luca! The links for the first PDF have some problems? The pdf link is the same PDF as the second example, and the result points to the results of the third example.

Sorry. I must have made a mess with the copy-paste of the various links.

I try to correct them now.

Here some examples:

* https://www.biorxiv.org/content/10.1101/2020.04.21.054221v1

The link is wrong, I meant this:

  The numbered lines are gone, however, there are mismatches in the affiliation, the master version is able to link some of the affiliations with authors, while the master version is able to link some of them. See results [here](https://github.com/kermitt2/grobid/files/4526121/2020.04.21.054429v1.full.pdf.results.zip)

The explanation is correct, but the link is indeed wrong, this should be the correct zip file:
2020.04.21.054122v1.full.pdf.results.zip

For the second example, I tested with this branch, but I see the valid affiliations without discrepancies with the master:

Could you provide the git ref from the commits that you used to compare?

I compared master at ref f47cc10 with a custom merge of masterf47cc10 in line_number_support at ref 9aaa39a (the built binary of pdfalto was based on ref 43343b7).

@lfoppiano
Copy link
Collaborator

lfoppiano commented May 8, 2020

I've checked the differences between mac and linux and here an example:

image

I have no clue where these can be generated from, but it seems that mac leaves out some characters too...

MullenJSSv18i03.linux.xml.zip
MullenJSSv18i03.mac.xml.zip

The grobid output, however seems not having such discrepancies. Most of the differences are in formulas and tables, but they are very few.

image

Here the grobid output of mac and linux:

Archive.zip

@lfoppiano
Copy link
Collaborator

I'm adding another test https://iopscience.iop.org/article/10.1088/1361-6668/aabddb/pdf (I have used the submitted manuscript that contains line numbers). I'm attaching both results (with line support and without).

It seems that is improving in general, however I see many differences in the citations extraction part:

Archive 2.zip

@lfoppiano lfoppiano linked an issue Jun 23, 2020 that may be closed by this pull request
@kermitt2 kermitt2 merged commit 933837b into master Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fail to parse nicely documents containing row numbers at their margin
3 participants