Skip to content

Commit

Permalink
Merge ff0bd69 into ec78709
Browse files Browse the repository at this point in the history
  • Loading branch information
kermitt2 committed Aug 24, 2020
2 parents ec78709 + ff0bd69 commit 7283c69
Show file tree
Hide file tree
Showing 84 changed files with 4,610 additions and 1,053 deletions.
2 changes: 2 additions & 0 deletions build.gradle
Expand Up @@ -246,6 +246,8 @@ project("grobid-core") {
implementation "joda-time:joda-time:2.9.9"
implementation "org.apache.lucene:lucene-analyzers-common:4.5.1"
implementation 'black.ninia:jep:3.8.2'
implementation 'org.apache.opennlp:opennlp-tools:1.9.1'
implementation group: 'org.jruby', name: 'jruby-complete', version: '9.2.13.0'

shadedLib "org.apache.lucene:lucene-analyzers-common:4.5.1"
}
Expand Down
9 changes: 5 additions & 4 deletions doc/Coordinates-in-PDF.md
Expand Up @@ -4,13 +4,14 @@

### Limitations

As of April 2017, GROBID version 0.4.2 and higher, coordinate areas can be obtained for the following document substructures:
Since April 2017, GROBID version 0.4.2 and higher, coordinate areas can be obtained for the following document substructures:

* ```persName``` for a complete author name,
* ```figure``` for figure AND table,
* ```ref``` for bibliographical, figure, table and formula reference markers - for example (_Toto and al. 1999_), see _Fig. 1_, as shown by _formula (245)_, etc.,
* ```biblStruct``` for a bibliographical reference,
* ```formula``` for mathematical equations.
* ```persName``` for a complete author name,
* ```figure``` for figure AND table,
* ```formula``` for mathematical equations,
* ```s``` for optional sentence structure.

However, there is normally no particular limitation to the type of structures which can have their coordinates in the results, the implementation is on-going, see [issue #69](https://github.com/kermitt2/grobid/issues/69), and it is expected that more or less any structures could be associated with their coordinates in the orginal PDF.

Expand Down
2 changes: 2 additions & 0 deletions doc/Grobid-batch.md
Expand Up @@ -60,6 +60,8 @@ WARNING: the expected extension of the PDF files to be processed is .pdf

* -teiCoordinates: output a subset of the identified structures with coordinates in the original PDF, by default no coordinates are present

* -segmentSentences: add sentence segmentation level structures for paragraphs in the TEI XML result, by default no sentence segmentation is done

Example:
```bash
> java -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText
Expand Down
9 changes: 9 additions & 0 deletions doc/Grobid-service.md
Expand Up @@ -158,6 +158,7 @@ Convert the complete input document into TEI XML format (header, body and biblio
| | | | `includeRawCitations` | optional | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). |
| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). |
| | | | `teiCoordinates` | optional | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details |
| | | | `segmentSentences` | optional | Paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |

Response status codes:

Expand All @@ -171,6 +172,8 @@ Response status codes:

A `503` error with the default parallel mode normally means that all the threads available to GROBID are currently used. The client need to re-send the query after a wait time that will allow the server to free some threads. The wait time depends on the service and the capacities of the server, we suggest 5-10 seconds for the `processFulltextDocument` service.

The optional sentence segmentation in the TEI XML result is based on the algorithm selected in the Grobid property file (under `grobid-home/config/grobid.properties`). As of August 2020, available segmenters are the [Pragmatic_Segmenter](https://github.com/diasks2/pragmatic_segmenter) (recommended) and [OpenNLP sentence detector](https://opennlp.apache.org/docs/1.5.3/manual/opennlp.html#tools.sentdetect).

You can test this service with the **cURL** command lines, for instance fulltext extraction (header, body and citations) from a PDF file in the current directory:

```console
Expand All @@ -195,6 +198,12 @@ Regarding the bibliographical references, it is possible to include the original
curl -v --form input=@./thefile.pdf --form includeRawCitations=1 localhost:8070/api/processFulltextDocument
```

Example with requested additional sentence segmentation of the paragraph with bounding box coordinates of the sentence structures:

```console
curl -v --form input=@./0thefile.pdf --form segmentSentences=1 --form teiCoordinates=s localhost:8070/api/processFulltextDocument
```

#### /api/processReferences

Extract and convert all the bibliographical references present in the input document into TEI XML or [BibTeX] format.
Expand Down
17 changes: 17 additions & 0 deletions grobid-core/src/main/java/org/grobid/core/data/Figure.java
Expand Up @@ -336,6 +336,23 @@ public String toTEI(GrobidAnalysisConfig config, Document doc, TEIFormatter form
//Element desc = XmlBuilderUtils.teiElement("figDesc",
// LayoutTokensUtil.normalizeText(caption.toString()));
}

if (desc != null && config.isWithSentenceSegmentation()) {
formatter.segmentIntoSentences(desc, this.captionLayoutTokens, config);

// we need a sentence segmentation of the figure caption, for that we need to introduce
// a <div>, then a <p>
desc.setLocalName("p");

Element div = XmlBuilderUtils.teiElement("div");
div.appendChild(desc);

Element figDesc = XmlBuilderUtils.teiElement("figDesc");
figDesc.appendChild(div);

desc = figDesc;
}

figureElement.appendChild(desc);
}
if ((graphicObjects != null) && (graphicObjects.size() > 0)) {
Expand Down
16 changes: 16 additions & 0 deletions grobid-core/src/main/java/org/grobid/core/data/Table.java
Expand Up @@ -130,6 +130,22 @@ public String toTEI(GrobidAnalysisConfig config, Document doc, TEIFormatter form
} else {
desc.appendChild(textNode(clusterContent));
}

if (desc != null && config.isWithSentenceSegmentation()) {
formatter.segmentIntoSentences(desc, this.captionLayoutTokens, config);

// we need a sentence segmentation of the figure caption, for that we need to introduce
// a <div>, then a <p>
desc.setLocalName("p");

Element div = XmlBuilderUtils.teiElement("div");
div.appendChild(desc);

Element figDesc = XmlBuilderUtils.teiElement("figDesc");
figDesc.appendChild(div);

desc = figDesc;
}
}
} else {
desc.appendChild(LayoutTokensUtil.normalizeText(caption.toString()).trim());
Expand Down
Expand Up @@ -11,7 +11,6 @@
import org.grobid.core.layout.Block;
import org.grobid.core.layout.Cluster;
import org.grobid.core.layout.LayoutToken;
//import org.grobid.core.utilities.Pair;
import org.grobid.core.utilities.TextUtilities;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
Expand Down Expand Up @@ -75,7 +74,7 @@ public class BasicStructureBuilder {
* @param doc a document
* @return if found numbering
*/
public boolean filterLineNumber(Document doc) {
/*public boolean filterLineNumber(Document doc) {
// we first test if we have a line numbering by checking if we have an increasing integer
// at the begin or the end of each block
boolean numberBeginLine = false;
Expand Down Expand Up @@ -182,7 +181,7 @@ public boolean filterLineNumber(Document doc) {
}
return foundNumbering;
}
}*/

/**
* Cluster the blocks following the font, style and size aspects
Expand Down
244 changes: 0 additions & 244 deletions grobid-core/src/main/java/org/grobid/core/document/Document.java
Expand Up @@ -586,250 +586,6 @@ protected static int getCoordItem(ElementCounter<Integer> cnt, boolean getMin) {
return res;
}

/**
* Identify the header parts with heuristics
*/
public SortedSet<DocumentPiece> getDocumentPartsWithHeuristics() {
String theHeader = getHeader();
if ((theHeader == null) || (theHeader.trim().length() <= 1)) {
theHeader = getHeaderLastHope();
}

SortedSet<DocumentPiece> theDocumentParts = new TreeSet<DocumentPiece>();

DocumentPointer start = null;
DocumentPointer end = null;
Integer previousBlocknum = null;
for (Integer blocknum : blockDocumentHeaders) {
Block block = blocks.get(blocknum);

if (previousBlocknum != null && blocknum != previousBlocknum+1) {
// new piece
DocumentPiece thePiece = new DocumentPiece(start, end);
theDocumentParts.add(thePiece);
start = null;
}

if (start == null) {
start = new DocumentPointer(this, blocknum, block.getStartToken());
end = new DocumentPointer(this, blocknum, block.getEndToken());
} else {
// the block is adjacent to the previous one, so we just extend the end pointer
end = new DocumentPointer(this, blocknum, block.getEndToken());
}

previousBlocknum = blocknum;
}

// add last piece
if (start != null && end != null) {
DocumentPiece thePiece = new DocumentPiece(start, end);
theDocumentParts.add(thePiece);
}

return theDocumentParts;
}

/**
* heuristics to get the header section...
* -> it is now covered by the CRF segmentation model
*/
@Deprecated
public String getHeader() {
//if (firstPass)
//BasicStructureBuilder.firstPass(this);

// try first to find the introduction in a safe way
String tmpRes = getHeaderByIntroduction();
if (tmpRes != null) {
if (tmpRes.trim().length() > 0) {
return tmpRes;
}
}

// we apply a heuristics based on the size of first blocks
String res = null;
beginBody = -1;
StringBuilder accumulated = new StringBuilder();
int i = 0;
int nbLarge = 0;
boolean abstractCandidate = false;
for (Block block : blocks) {
String localText = block.getText();
if ((localText == null) || (localText.startsWith("@"))) {
accumulated.append("\n");
continue;
}
localText = localText.trim();
localText = localText.replace(" ", " ");

Matcher ma0 = BasicStructureBuilder.abstract_.matcher(localText);
if ((block.getNbTokens() > 60) || (ma0.find())) {
if (!abstractCandidate) {
// first large block, it should be the abstract
abstractCandidate = true;
} else if (beginBody == -1) {
// second large block, it should be the first paragraph of
// the body
beginBody = i;
for (int j = 0; j <= i + 1; j++) {
Integer inte = j;
if (blockDocumentHeaders == null)
blockDocumentHeaders = new ArrayList<Integer>();
if (!blockDocumentHeaders.contains(inte))
blockDocumentHeaders.add(inte);
}
res = accumulated.toString();
nbLarge = 1;
} else if (block.getNbTokens() > 60) {
nbLarge++;
if (nbLarge > 5) {
return res;
}
}
} else {
Matcher m = BasicStructureBuilder.introduction
.matcher(localText);
if (abstractCandidate) {
if (m.find()) {
// we clearly found the begining of the body
beginBody = i;
for (int j = 0; j <= i; j++) {
Integer inte = j;
if (blockDocumentHeaders == null)
blockDocumentHeaders = new ArrayList<Integer>();
if (!blockDocumentHeaders.contains(inte)) {
blockDocumentHeaders.add(inte);
}
}
return accumulated.toString();
} else if (beginBody != -1) {
if (localText.startsWith("(1|I|A)\\.\\s")) {
beginBody = i;
for (int j = 0; j <= i; j++) {
Integer inte = j;
if (blockDocumentHeaders == null)
blockDocumentHeaders = new ArrayList<Integer>();
if (!blockDocumentHeaders.contains(inte))
blockDocumentHeaders.add(inte);
}
return accumulated.toString();
}
}
} else {
if (m.find()) {
// we clearly found the begining of the body with the
// introduction section
beginBody = i;
for (int j = 0; j <= i; j++) {
Integer inte = j;
if (blockDocumentHeaders == null)
blockDocumentHeaders = new ArrayList<Integer>();
if (!blockDocumentHeaders.contains(inte))
blockDocumentHeaders.add(inte);
}
res = accumulated.toString();
}
}
}

if ((i > 6) && (i > (blocks.size() * 0.6))) {
if (beginBody != -1) {
return res;
} else
return null;
}

accumulated.append(localText).append("\n");
i++;
}

return res;
}

/**
* We return the first page as header estimation... better than nothing when
* nothing is not acceptable.
* <p/>
* -> now covered by the CRF segmentation model
*/
@Deprecated
public String getHeaderLastHope() {
String res;
StringBuilder accumulated = new StringBuilder();
int i = 0;
if ((pages == null) || (pages.size() == 0)) {
return null;
}
for (Page page : pages) {
if ((page.getBlocks() == null) || (page.getBlocks().size() == 0))
continue;
for (Block block : page.getBlocks()) {
String localText = block.getText();
if ((localText == null) || (localText.startsWith("@"))) {
accumulated.append("\n");
continue;
}
localText = localText.trim();
localText = localText.replace(" ", " ");
accumulated.append(localText);
Integer inte = Integer.valueOf(i);
if (blockDocumentHeaders == null)
blockDocumentHeaders = new ArrayList<Integer>();
if (!blockDocumentHeaders.contains(inte))
blockDocumentHeaders.add(inte);
i++;
}
beginBody = i;
break;
}

return accumulated.toString();
}

/**
* We try to match the introduction section in a safe way, and consider if
* minimum requirements are met the blocks before this position as header.
* <p/>
* -> now covered by the CRF segmentation model
*/
@Deprecated
public String getHeaderByIntroduction() {
String res;
StringBuilder accumulated = new StringBuilder();
int i = 0;
for (Block block : blocks) {
String localText = block.getText();
if ((localText == null) || (localText.startsWith("@"))) {
accumulated.append("\n");
continue;
}
localText = localText.trim();

Matcher m = BasicStructureBuilder.introductionStrict
.matcher(localText);
if (m.find()) {
accumulated.append(localText);
beginBody = i;
for (int j = 0; j < i + 1; j++) {
Integer inte = j;
if (blockDocumentHeaders == null)
blockDocumentHeaders = new ArrayList<Integer>();
if (!blockDocumentHeaders.contains(inte))
blockDocumentHeaders.add(inte);
}
res = accumulated.toString();

return res;
}

accumulated.append(localText);
i++;
}

return null;
}

/**
* Return all blocks without markers.
* <p/>
Expand Down

0 comments on commit 7283c69

Please sign in to comment.