[WIP] Line number support #572

kermitt2 · 2020-04-20T03:33:13Z

This is a generic support for preprints/manuscripts with line numbers, via pdfalto.

Line numbers are identified via x/y clustering and some additional constraints by pdfalto. We can choose to get or not the line numbers with the parameter noLineNumber of pdfalto (branch line-number).

The following PR provides support for line numbers on the right, on the left, any increments, partial line numbers in a page, page with and without line numbers in the same document and has some mechanisms to avoid false positives (in particular for bibliographical reference sections and tables with numbers vertically aligned).

No regression with PMC 1942 set.

This is ready to be tested on Linux. pdfalto will need to be compiled for other platforms before merging.

Happy for feedback and remaining error cases for refinements if necessary.

lfoppiano · 2020-04-20T11:15:09Z

I've started testing it.

I first built the latest version of pdfalto from the branch line number. NOTE: I used the HEAD of xpdf-4.0

Then I copied the binary in the grobid-home and I started the service to test some files. It seems that there is a problem between grobid and the new version of pdfalto.

ERROR [2020-04-20 11:11:33,451] org.grobid.core.process.ProcessPdfToXml: pdftoxml process finished with error code: 99. [/Users/lfoppiano/development/projects/grobid/grobid-home/pdf2xml/mac-64/pdfalto_server, -noImageInline, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/origin339189738859558337.pdf, /Users/lfoppiano/development/projects/grobid/grobid-home/tmp/ZUuPYKJ7pr.lxml]
ERROR [2020-04-20 11:11:33,452] org.grobid.core.process.ProcessPdfToXml: pdftoxml return message: 
pdfalto version 0.3
Usage: pdfalto [options] <PDF-file> [<xml-file>]
  -f <int>                      : first page to convert
  -l <int>                      : last page to convert
  -verbose                      : display pdf attributes
  -noText                       : do not extract textual objects
  -noImage                      : do not extract Images (Bitmap and Vectorial)
  -noImageInline                : do not include images inline in the stream
  -outline                      : create an outline file xml
  -annotation                   : create an annotations file xml
  -blocks                       : add blocks informations within the structure
  -readingOrder                 : blocks follow the reading order
  -charReadingOrderAttr         : include TYPE attribute to String elements to indicate right-to-left reading order (not valid ALTO)
  -fullFontName                 : fonts names are not normalized
  -nsURI <string>               : add the specified namespace URI
  -opw <string>                 : owner password (for encrypted files)
  -upw <string>                 : user password (for encrypted files)
  -q                            : don't print any messages or errors
  -v                            : print version info
  -h                            : print usage information
  -help                         : print usage information
  --help                        : print usage information
  -?                            : print usage information
  --saveconf <string>           : save all command line parameters in the specified XML <file>
  -conf <string>                : configuration file to use in place of xpdfrc
  -filesLimit <int>             : limit of asset files be extracted

ERROR [2020-04-20 11:11:33,456] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! org.grobid.core.exceptions.GrobidException: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 99

Am I doing all correctly?

kermitt2 · 2020-04-20T20:45:01Z

I am not sure my last push for pdfalto was completed.
The pdfalto line-number branch -h should give you the following:

Usage: pdfalto [options] <PDF-file> [<xml-file>]
  -f <int>                      : first page to convert
  -l <int>                      : last page to convert
  -verbose                      : display pdf attributes
  -noImage                      : do not extract Images (Bitmap and Vectorial)
  -noImageInline                : do not include images inline in the stream
  -outline                      : create an outline file xml
  -annotation                   : create an annotations file xml
  -noLineNumbers                : do not output line numbers added in manuscript-style textual documents
  -readingOrder                 : blocks follow the reading order
  -noText                       : do not extract textual objects (might be useful, but non-valid ALTO)
  -charReadingOrderAttr         : include TYPE attribute to String elements to indicate right-to-left reading order (might be useful, but non-valid ALTO)
  -fullFontName                 : fonts names are not normalized
  -nsURI <string>               : add the specified namespace URI
  -opw <string>                 : owner password (for encrypted files)
  -upw <string>                 : user password (for encrypted files)
  -filesLimit <int>             : limit of asset files be extracted
  -q                            : don't print any messages or errors
  -v                            : print version info
  -h                            : print usage information
  -help                         : print usage information
  --help                        : print usage information
  -?                            : print usage information

lfoppiano · 2020-04-21T01:51:24Z

OK there were some commits not being pushed. It works fine now.

coveralls · 2020-04-21T03:43:10Z

Coverage increased (+0.03%) to 38.291% when pulling d63b914 on line_number_support into 4bb6fb6 on master.

lfoppiano

I tested few articles and checked the code and it looks good to me. Is not so easy to find articles with line numbers. In order to compare it properly, I merged locally this branch with the latest master (f47cc10).

Here some examples:

https://www.biorxiv.org/content/10.1101/2020.04.21.054221v1
The numbered lines are gone, however, there are mismatches in the affiliation, the master version is able to link some of the affiliations with authors, while the master version is able to link some of them. See results here
https://www.biorxiv.org/content/10.1101/2020.04.21.054221v1
There are no line numbers, just tested because of the affiliation list might have the identifier as numbers. There are two discrepancies with master, the first affiliation

and a table (the supplementary table) mistakenly extracted as a citation. See results here
https://www.biorxiv.org/content/10.1101/2020.04.21.054429v1
Just one affiliation extracted wrongly. Not sure it's relevant. See results here

kermitt2 · 2020-04-24T14:40:23Z

@lfoppiano thanks for the test Luca! The links for the first PDF have some problems? The pdf link is the same PDF as the second example, and the result points to the results of the third example.

For the second example, I tested with this branch, but I see the valid affiliations without discrepancies with the master:

                       <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Maria</forename>
                                <surname>Vasilarou</surname>
                            </persName>
                            <affiliation key="aff0">
                                <orgName type="department" key="dep1">Institute of Molecular Biology and Biotechnology (IMBB)</orgName>
                                <orgName type="department" key="dep2">Foundation for Research and Technology Hellas (FORTH)</orgName>
                            </affiliation>
                            <affiliation key="aff1">
                                <orgName type="department">Department of Biology</orgName>
                                <orgName type="institution">University of Crete</orgName>
                                <address>
                                    <settlement>Crete</settlement>
                                    <country key="GR">Greece</country>
                                </address>
                            </affiliation>
                        </author>

lfoppiano · 2020-04-27T01:02:39Z

@lfoppiano thanks for the test Luca! The links for the first PDF have some problems? The pdf link is the same PDF as the second example, and the result points to the results of the third example.

Sorry. I must have made a mess with the copy-paste of the various links.

I try to correct them now.

Here some examples:

* https://www.biorxiv.org/content/10.1101/2020.04.21.054221v1

The link is wrong, I meant this:

https://www.biorxiv.org/content/10.1101/2020.04.21.054122v1.full

  The numbered lines are gone, however, there are mismatches in the affiliation, the master version is able to link some of the affiliations with authors, while the master version is able to link some of them. See results [here](https://github.com/kermitt2/grobid/files/4526121/2020.04.21.054429v1.full.pdf.results.zip)

The explanation is correct, but the link is indeed wrong, this should be the correct zip file:
2020.04.21.054122v1.full.pdf.results.zip

For the second example, I tested with this branch, but I see the valid affiliations without discrepancies with the master:

Could you provide the git ref from the commits that you used to compare?

I compared master at ref f47cc10 with a custom merge of masterf47cc10 in line_number_support at ref 9aaa39a (the built binary of pdfalto was based on ref 43343b7).

lfoppiano · 2020-05-08T11:49:18Z

I've checked the differences between mac and linux and here an example:

I have no clue where these can be generated from, but it seems that mac leaves out some characters too...

MullenJSSv18i03.linux.xml.zip
MullenJSSv18i03.mac.xml.zip

The grobid output, however seems not having such discrepancies. Most of the differences are in formulas and tables, but they are very few.

Here the grobid output of mac and linux:

Archive.zip

lfoppiano · 2020-05-12T00:02:57Z

I'm adding another test https://iopscience.iop.org/article/10.1088/1361-6668/aabddb/pdf (I have used the submitted manuscript that contains line numbers). I'm attaching both results (with line support and without).

It seems that is improving in general, however I see many differences in the citations extraction part:

Archive 2.zip

kermitt2 added 4 commits April 19, 2020 03:24

add line number support via pdfalto

f383c3a

update pdfalto

f7edbf9

Review pdfalto parameters; review training data involving line numbers

b46e17a

update resources ad pdfalto

5c84f7f

kermitt2 requested a review from lfoppiano April 20, 2020 04:33

kermitt2 added this to the 0.6.1 milestone Apr 21, 2020

kermitt2 added the enhancement label Apr 21, 2020

kermitt2 changed the title ~~Line number support [WIP]~~ [WIP] Line number support Apr 21, 2020

kermitt2 mentioned this pull request Apr 21, 2020

Supercript feature in fulltext #573

Merged

Adding pdfalto for mac

9aaa39a

lfoppiano approved these changes Apr 24, 2020

View reviewed changes

kermitt2 added 2 commits April 24, 2020 16:23

Merge branch 'master' into line_number_support

1884155

update pdfalto

d63b914

kermitt2 mentioned this pull request Apr 25, 2020

Recommended way of annotating line numbers #560

Closed

lfoppiano linked an issue Jun 23, 2020 that may be closed by this pull request

Fail to parse nicely documents containing row numbers at their margin #297

Closed

kermitt2 merged commit 933837b into master Aug 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Line number support #572

[WIP] Line number support #572

kermitt2 commented Apr 20, 2020

lfoppiano commented Apr 20, 2020 •

edited

kermitt2 commented Apr 20, 2020 •

edited

lfoppiano commented Apr 21, 2020

coveralls commented Apr 21, 2020 •

edited

lfoppiano left a comment •

edited

kermitt2 commented Apr 24, 2020

lfoppiano commented Apr 27, 2020 •

edited

lfoppiano commented May 8, 2020 •

edited

lfoppiano commented May 12, 2020

[WIP] Line number support #572

[WIP] Line number support #572

Conversation

kermitt2 commented Apr 20, 2020

lfoppiano commented Apr 20, 2020 • edited

kermitt2 commented Apr 20, 2020 • edited

lfoppiano commented Apr 21, 2020

coveralls commented Apr 21, 2020 • edited

lfoppiano left a comment • edited

Choose a reason for hiding this comment

kermitt2 commented Apr 24, 2020

lfoppiano commented Apr 27, 2020 • edited

lfoppiano commented May 8, 2020 • edited

lfoppiano commented May 12, 2020

lfoppiano commented Apr 20, 2020 •

edited

kermitt2 commented Apr 20, 2020 •

edited

coveralls commented Apr 21, 2020 •

edited

lfoppiano left a comment •

edited

lfoppiano commented Apr 27, 2020 •

edited

lfoppiano commented May 8, 2020 •

edited