Merge ff0bd69 into ec78709

kermitt2 · Aug 24, 2020 · 7283c69 · 7283c69
2 parents ec78709 + ff0bd69
commit 7283c69
Show file tree

Hide file tree

Showing 84 changed files with 4,610 additions and 1,053 deletions.
diff --git a/build.gradle b/build.gradle
@@ -246,6 +246,8 @@ project("grobid-core") {
         implementation "joda-time:joda-time:2.9.9"
         implementation "org.apache.lucene:lucene-analyzers-common:4.5.1"
         implementation 'black.ninia:jep:3.8.2'
+        implementation 'org.apache.opennlp:opennlp-tools:1.9.1'
+        implementation group: 'org.jruby', name: 'jruby-complete', version: '9.2.13.0'
 
         shadedLib "org.apache.lucene:lucene-analyzers-common:4.5.1"
     }

diff --git a/doc/Coordinates-in-PDF.md b/doc/Coordinates-in-PDF.md
@@ -4,13 +4,14 @@
 
 ### Limitations
 
-As of April 2017, GROBID version 0.4.2 and higher, coordinate areas can be obtained for the following document substructures: 
+Since April 2017, GROBID version 0.4.2 and higher, coordinate areas can be obtained for the following document substructures: 
 
-* ```persName``` for a complete author name,
-* ```figure``` for figure AND table,
 * ```ref``` for bibliographical, figure, table and formula reference markers - for example (_Toto and al. 1999_), see _Fig. 1_, as shown by _formula (245)_, etc.,
 * ```biblStruct``` for a bibliographical reference,
-* ```formula``` for mathematical equations.
+* ```persName``` for a complete author name,
+* ```figure``` for figure AND table,
+* ```formula``` for mathematical equations,
+* ```s``` for optional sentence structure. 
 
 However, there is normally no particular limitation to the type of structures which can have their coordinates in the results, the implementation is on-going, see [issue #69](https://github.com/kermitt2/grobid/issues/69), and it is expected that more or less any structures could be associated with their coordinates in the orginal PDF. 
 

diff --git a/doc/Grobid-batch.md b/doc/Grobid-batch.md
@@ -60,6 +60,8 @@ WARNING: the expected extension of the PDF files to be processed is .pdf
 
 * -teiCoordinates: output a subset of the identified structures with coordinates in the original PDF, by default no coordinates are present
 
+* -segmentSentences: add sentence segmentation level structures for paragraphs in the TEI XML result, by default no sentence segmentation is done 
+
 Example:
 ```bash
 > java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText 

diff --git a/doc/Grobid-service.md b/doc/Grobid-service.md
@@ -158,6 +158,7 @@ Convert the complete input document into TEI XML format (header, body and biblio
 |           |                       |                      | `includeRawCitations`  | optional      | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). |
 |           |                       |                      | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result).  |
 |           |                       |                      | `teiCoordinates`       | optional      | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details |
+|           |                       |                      | `segmentSentences`       | optional      | Paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |
 
 Response status codes:
 
@@ -171,6 +172,8 @@ Response status codes:
 
 A `503` error with the default parallel mode normally means that all the threads available to GROBID are currently used. The client need to re-send the query after a wait time that will allow the server to free some threads. The wait time depends on the service and the capacities of the server, we suggest 5-10 seconds for the `processFulltextDocument` service.
 
+The optional sentence segmentation in the TEI XML result is based on the algorithm selected in the Grobid property file (under `grobid-home/config/grobid.properties`). As of August 2020, available segmenters are the [Pragmatic_Segmenter](https://github.com/diasks2/pragmatic_segmenter) (recommended) and [OpenNLP sentence detector](https://opennlp.apache.org/docs/1.5.3/manual/opennlp.html#tools.sentdetect).
+
 You can test this service with the **cURL** command lines, for instance fulltext extraction (header, body and citations) from a PDF file in the current directory:
 
 ```console
@@ -195,6 +198,12 @@ Regarding the bibliographical references, it is possible to include the original
 curl -v --form input=@./thefile.pdf --form includeRawCitations=1 localhost:8070/api/processFulltextDocument
 ```
 
+Example with requested additional sentence segmentation of the paragraph with bounding box coordinates of the sentence structures:
+
+```console
+curl -v --form input=@./0thefile.pdf  --form segmentSentences=1 --form teiCoordinates=s localhost:8070/api/processFulltextDocument
+```
+
 #### /api/processReferences
 
 Extract and convert all the bibliographical references present in the input document into TEI XML or [BibTeX] format.

diff --git a/grobid-core/src/main/java/org/grobid/core/data/Figure.java b/grobid-core/src/main/java/org/grobid/core/data/Figure.java
@@ -336,6 +336,23 @@ public String toTEI(GrobidAnalysisConfig config, Document doc, TEIFormatter form
                 //Element desc = XmlBuilderUtils.teiElement("figDesc",
                 //    LayoutTokensUtil.normalizeText(caption.toString()));
             }
+
+            if (desc != null && config.isWithSentenceSegmentation()) {
+                formatter.segmentIntoSentences(desc, this.captionLayoutTokens, config);
+
+                // we need a sentence segmentation of the figure caption, for that we need to introduce 
+                // a <div>, then a <p>
+                desc.setLocalName("p");
+
+                Element div = XmlBuilderUtils.teiElement("div");
+                div.appendChild(desc);
+
+                Element figDesc = XmlBuilderUtils.teiElement("figDesc");                
+                figDesc.appendChild(div);
+
+                desc = figDesc;
+            }
+
             figureElement.appendChild(desc);
         }
         if ((graphicObjects != null) && (graphicObjects.size() > 0)) {

diff --git a/grobid-core/src/main/java/org/grobid/core/data/Table.java b/grobid-core/src/main/java/org/grobid/core/data/Table.java
@@ -130,6 +130,22 @@ public String toTEI(GrobidAnalysisConfig config, Document doc, TEIFormatter form
                     } else {
                         desc.appendChild(textNode(clusterContent));
                     }
+
+                    if (desc != null && config.isWithSentenceSegmentation()) {
+                        formatter.segmentIntoSentences(desc, this.captionLayoutTokens, config);
+
+                        // we need a sentence segmentation of the figure caption, for that we need to introduce 
+                        // a <div>, then a <p>
+                        desc.setLocalName("p");
+
+                        Element div = XmlBuilderUtils.teiElement("div");
+                        div.appendChild(desc);
+
+                        Element figDesc = XmlBuilderUtils.teiElement("figDesc");                
+                        figDesc.appendChild(div);
+
+                        desc = figDesc;
+                    }
                 }
             } else {
                 desc.appendChild(LayoutTokensUtil.normalizeText(caption.toString()).trim());

diff --git a/grobid-core/src/main/java/org/grobid/core/document/BasicStructureBuilder.java b/grobid-core/src/main/java/org/grobid/core/document/BasicStructureBuilder.java
@@ -11,7 +11,6 @@
 import org.grobid.core.layout.Block;
 import org.grobid.core.layout.Cluster;
 import org.grobid.core.layout.LayoutToken;
-//import org.grobid.core.utilities.Pair;
 import org.grobid.core.utilities.TextUtilities;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -75,7 +74,7 @@ public class BasicStructureBuilder {
      * @param doc a document
      * @return if found numbering
      */
-    public boolean filterLineNumber(Document doc) {
+    /*public boolean filterLineNumber(Document doc) {
         // we first test if we have a line numbering by checking if we have an increasing integer
         // at the begin or the end of each block
         boolean numberBeginLine = false;
@@ -182,7 +181,7 @@ public boolean filterLineNumber(Document doc) {
         }
 
         return foundNumbering;
-    }
+    }*/
 
     /**
      * Cluster the blocks following the font, style and size aspects

diff --git a/grobid-core/src/main/java/org/grobid/core/document/Document.java b/grobid-core/src/main/java/org/grobid/core/document/Document.java
@@ -586,250 +586,6 @@ protected static int getCoordItem(ElementCounter<Integer> cnt, boolean getMin) {
         return res;
     }
 
-    /**
-     *  Identify the header parts with heuristics
-     */
-    public SortedSet<DocumentPiece> getDocumentPartsWithHeuristics() {
-        String theHeader = getHeader();
-        if ((theHeader == null) || (theHeader.trim().length() <= 1)) {
-            theHeader = getHeaderLastHope();
-        }
-
-        SortedSet<DocumentPiece> theDocumentParts = new TreeSet<DocumentPiece>();
-
-        DocumentPointer start = null;
-        DocumentPointer end = null;
-        Integer previousBlocknum = null;
-        for (Integer blocknum : blockDocumentHeaders) {
-            Block block = blocks.get(blocknum);
-
-            if (previousBlocknum != null && blocknum != previousBlocknum+1) {
-                // new piece
-                DocumentPiece thePiece = new DocumentPiece(start, end);
-                theDocumentParts.add(thePiece);
-                start = null;
-            }
-
-            if (start == null) {
-                start = new DocumentPointer(this, blocknum, block.getStartToken());
-                end = new DocumentPointer(this, blocknum, block.getEndToken());
-            } else {
-                // the block is adjacent to the previous one, so we just extend the end pointer
-                end = new DocumentPointer(this, blocknum, block.getEndToken());
-            } 
-
-            previousBlocknum = blocknum;
-        }
-
-        // add last piece
-        if (start != null && end != null) {
-            DocumentPiece thePiece = new DocumentPiece(start, end);
-            theDocumentParts.add(thePiece);
-        }
-
-        return theDocumentParts;
-    }
-
-    /**
-     * heuristics to get the header section...
-     * -> it is now covered by the CRF segmentation model
-     */
-    @Deprecated
-    public String getHeader() {
-        //if (firstPass)
-        //BasicStructureBuilder.firstPass(this);
-
-        // try first to find the introduction in a safe way
-        String tmpRes = getHeaderByIntroduction();
-        if (tmpRes != null) {
-            if (tmpRes.trim().length() > 0) {
-                return tmpRes;
-            }
-        }
-
-        // we apply a heuristics based on the size of first blocks
-        String res = null;
-        beginBody = -1;
-        StringBuilder accumulated = new StringBuilder();
-        int i = 0;
-        int nbLarge = 0;
-        boolean abstractCandidate = false;
-        for (Block block : blocks) {
-            String localText = block.getText();
-            if ((localText == null) || (localText.startsWith("@"))) {
-                accumulated.append("\n");
-                continue;
-            }
-            localText = localText.trim();
-            localText = localText.replace("  ", " ");
-
-            Matcher ma0 = BasicStructureBuilder.abstract_.matcher(localText);
-            if ((block.getNbTokens() > 60) || (ma0.find())) {
-                if (!abstractCandidate) {
-                    // first large block, it should be the abstract
-                    abstractCandidate = true;
-                } else if (beginBody == -1) {
-                    // second large block, it should be the first paragraph of
-                    // the body
-                    beginBody = i;
-                    for (int j = 0; j <= i + 1; j++) {
-                        Integer inte = j;
-                        if (blockDocumentHeaders == null)
-                            blockDocumentHeaders = new ArrayList<Integer>();
-                        if (!blockDocumentHeaders.contains(inte))
-                            blockDocumentHeaders.add(inte);
-                    }
-                    res = accumulated.toString();
-                    nbLarge = 1;
-                } else if (block.getNbTokens() > 60) {
-                    nbLarge++;
-                    if (nbLarge > 5) {
-                        return res;
-                    }
-                }
-            } else {
-                Matcher m = BasicStructureBuilder.introduction
-                        .matcher(localText);
-                if (abstractCandidate) {
-                    if (m.find()) {
-                        // we clearly found the begining of the body
-                        beginBody = i;
-                        for (int j = 0; j <= i; j++) {
-                            Integer inte = j;
-                            if (blockDocumentHeaders == null)
-                                blockDocumentHeaders = new ArrayList<Integer>();
-                            if (!blockDocumentHeaders.contains(inte)) {
-                                blockDocumentHeaders.add(inte);
-                            }
-                        }
-                        return accumulated.toString();
-                    } else if (beginBody != -1) {
-                        if (localText.startsWith("(1|I|A)\\.\\s")) {
-                            beginBody = i;
-                            for (int j = 0; j <= i; j++) {
-                                Integer inte = j;
-                                if (blockDocumentHeaders == null)
-                                    blockDocumentHeaders = new ArrayList<Integer>();
-                                if (!blockDocumentHeaders.contains(inte))
-                                    blockDocumentHeaders.add(inte);
-                            }
-                            return accumulated.toString();
-                        }
-                    }
-                } else {
-                    if (m.find()) {
-                        // we clearly found the begining of the body with the
-                        // introduction section
-                        beginBody = i;
-                        for (int j = 0; j <= i; j++) {
-                            Integer inte = j;
-                            if (blockDocumentHeaders == null)
-                                blockDocumentHeaders = new ArrayList<Integer>();
-                            if (!blockDocumentHeaders.contains(inte))
-                                blockDocumentHeaders.add(inte);
-                        }
-                        res = accumulated.toString();
-                    }
-                }
-            }
-
-            if ((i > 6) && (i > (blocks.size() * 0.6))) {
-                if (beginBody != -1) {
-                    return res;
-                } else
-                    return null;
-            }
-
-            accumulated.append(localText).append("\n");
-            i++;
-        }
-
-        return res;
-    }
-
-    /**
-     * We return the first page as header estimation... better than nothing when
-     * nothing is not acceptable.
-     * <p/>
-     * -> now covered by the CRF segmentation model
-     */
-    @Deprecated
-    public String getHeaderLastHope() {
-        String res;
-        StringBuilder accumulated = new StringBuilder();
-        int i = 0;
-        if ((pages == null) || (pages.size() == 0)) {
-            return null;
-        }
-        for (Page page : pages) {
-            if ((page.getBlocks() == null) || (page.getBlocks().size() == 0))
-                continue;
-            for (Block block : page.getBlocks()) {
-                String localText = block.getText();
-                if ((localText == null) || (localText.startsWith("@"))) {
-                    accumulated.append("\n");
-                    continue;
-                }
-                localText = localText.trim();
-                localText = localText.replace("  ", " ");
-                accumulated.append(localText);
-                Integer inte = Integer.valueOf(i);
-                if (blockDocumentHeaders == null)
-                    blockDocumentHeaders = new ArrayList<Integer>();
-                if (!blockDocumentHeaders.contains(inte))
-                    blockDocumentHeaders.add(inte);
-                i++;
-            }
-            beginBody = i;
-            break;
-        }
-
-        return accumulated.toString();
-    }
-
-    /**
-     * We try to match the introduction section in a safe way, and consider if
-     * minimum requirements are met the blocks before this position as header.
-     * <p/>
-     * -> now covered by the CRF segmentation model
-     */
-    @Deprecated
-    public String getHeaderByIntroduction() {
-        String res;
-        StringBuilder accumulated = new StringBuilder();
-        int i = 0;
-        for (Block block : blocks) {
-            String localText = block.getText();
-            if ((localText == null) || (localText.startsWith("@"))) {
-                accumulated.append("\n");
-                continue;
-            }
-            localText = localText.trim();
-
-            Matcher m = BasicStructureBuilder.introductionStrict
-                    .matcher(localText);
-            if (m.find()) {
-                accumulated.append(localText);
-                beginBody = i;
-                for (int j = 0; j < i + 1; j++) {
-                    Integer inte = j;
-                    if (blockDocumentHeaders == null)
-                        blockDocumentHeaders = new ArrayList<Integer>();
-                    if (!blockDocumentHeaders.contains(inte))
-                        blockDocumentHeaders.add(inte);
-                }
-                res = accumulated.toString();
-
-                return res;
-            }
-
-            accumulated.append(localText);
-            i++;
-        }
-
-        return null;
-    }
-
     /**
      * Return all blocks without markers.
      * <p/>