Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/kermitt2/grobid
Browse files Browse the repository at this point in the history
  • Loading branch information
kermitt2 committed Nov 1, 2020
2 parents e686dfc + a9eae59 commit 0d568c5
Show file tree
Hide file tree
Showing 4 changed files with 11 additions and 28 deletions.
8 changes: 4 additions & 4 deletions doc/Coordinates-in-PDF.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,12 @@ Since April 2017, GROBID version 0.4.2 and higher, coordinate areas can be obtai
* ```persName``` for a complete author name,
* ```figure``` for figure AND table,
* ```formula``` for mathematical equations,
* ```s``` for optional sentence structure.
* ```head``` for section titles,
* ```s``` for optional sentence structure (the GROBID fulltext service must be called with the `segmentSentences` parameter to provide the optional sentence-level elements).

However, there is normally no particular limitation to the type of structures which can have their coordinates in the results, the implementation is on-going, see [issue #69](https://github.com/kermitt2/grobid/issues/69), and it is expected that more or less any structures could be associated with their coordinates in the orginal PDF.

Coordinates are currently available in full text processing (returning a TEI document) and the PDF annotation services (returning JSON).
Coordinates are currently available in full text processing (returning a TEI document) and the PDF annotation services (returning JSON for `ref`, `figure` and `formula` only).

### GROBID service

Expand Down Expand Up @@ -47,8 +48,7 @@ Example (under the project main directory `grobid/`):
> java -Xmx1024m -jar grobid-core/build/libs/grobid-core-0.5.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -teiCoordinates -exe processFullText
```

See the [batch mode details](https://grobid.readthedocs.io/en/latest/Grobid-batch/#processfulltext). With the batch mode, it is currenlty not possible to cherry pick up certain elements, coordinates will appear for all.

See the [batch mode details](https://grobid.readthedocs.io/en/latest/Grobid-batch/#processfulltext). With the batch mode, it is currenlty not possible to cherry pick up certain elements, coordinates will appear for all. Again, we recommend to use the service for significantly better performances and more customization options.

## Coordinate system in the PDF

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1195,11 +1195,6 @@ public StringBuilder toTEITextPiece(StringBuilder buffer,
if (clusterLabel.equals(TaggingLabels.SECTION)) {
String clusterContent = LayoutTokensUtil.normalizeDehyphenizeText(cluster.concatTokens());
curDiv = teiElement("div");
/*if (config.isGenerateTeiIds()) {
String divID = KeyGen.getKey().substring(0, 7);
addXmlId(curDiv, "_" + divID);
}*/

Element head = teiElement("head");
// section numbers
org.grobid.core.utilities.Pair<String, String> numb = getSectionNumber(clusterContent);
Expand All @@ -1215,6 +1210,13 @@ public StringBuilder toTEITextPiece(StringBuilder buffer,
addXmlId(head, "_" + divID);
}

if (config.isGenerateTeiCoordinates("head") ) {
String coords = LayoutTokensUtil.getCoordsString(cluster.concatTokens());
if (coords != null) {
head.addAttribute(new Attribute("coords", coords));
}
}

curDiv.appendChild(head);
divResults.add(curDiv);
} else if (clusterLabel.equals(TaggingLabels.EQUATION) ||
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,10 +110,6 @@ public List<OffsetPosition> runSentenceDetection(String text, List<OffsetPositio
return null;
try {
List<OffsetPosition> sentencePositions = sdf.getInstance().detect(text);
/*System.out.println(text);
for(OffsetPosition position : sentencePositions) {
System.out.println("detect: " + text.substring(position.start, position.end));
}*/

// to be sure, we sort the forbidden positions
if (forbidden == null)
Expand Down Expand Up @@ -166,12 +162,6 @@ public List<OffsetPosition> runSentenceDetection(String text, List<OffsetPositio
if (textLayoutTokens == null || textLayoutTokens.size() == 0)
return finalSentencePositions;


/*System.out.println("before finalSentencePositions.size(): " + finalSentencePositions.size());
for(OffsetPosition position : finalSentencePositions) {
System.out.println(text.substring(position.start, position.end));
}*/

int pos = 0;

// init sentence index
Expand Down Expand Up @@ -225,8 +215,6 @@ public List<OffsetPosition> runSentenceDetection(String text, List<OffsetPositio
}

if (pushedEnd > 0) {
//System.out.println("found extra ref marker: " + text.substring(finalSentencePositions.get(currentSentenceIndex).end,
// finalSentencePositions.get(currentSentenceIndex).end+pushedEnd+1));

OffsetPosition newPosition = finalSentencePositions.get(currentSentenceIndex);
newPosition.end += pushedEnd+1;
Expand Down Expand Up @@ -267,12 +255,6 @@ public List<OffsetPosition> runSentenceDetection(String text, List<OffsetPositio
// here, for instance non-breakable italic or bold chunks, or adding sentence split based on
// spacing/indent

/*System.out.println(text);
System.out.println("after finalSentencePositions.size(): " + finalSentencePositions.size());
for(OffsetPosition position : finalSentencePositions) {
System.out.println(text.substring(position.start, position.end));
}*/

return finalSentencePositions;
} catch (Exception e) {
LOGGER.warn("Cannot detect sentences. ", e);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -677,7 +677,6 @@ public Response processPDFReferenceAnnotation(final InputStream inputStream,
.generateTeiCoordinates(elementWithCoords)
.consolidateCitations(consolidateCitations)
.includeRawCitations(includeRawCitations)
.generateTeiCoordinates(elementWithCoords)
.build();

DocumentSource documentSource = DocumentSource.fromPdf(originFile);
Expand Down

0 comments on commit 0d568c5

Please sign in to comment.