Skip to content

Commit

Permalink
Merge branch 'master' into bugfix/sent-seg-ack-fund
Browse files Browse the repository at this point in the history
  • Loading branch information
lfoppiano committed Jun 9, 2024
2 parents cf6fb98 + cb7118d commit bbca7dd
Show file tree
Hide file tree
Showing 14 changed files with 921 additions and 226 deletions.
61 changes: 34 additions & 27 deletions doc/Grobid-service.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,24 +133,30 @@ Extract the header of the input PDF document, normalize it and convert it into a

`consolidateHeader` is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), or `2` (consolidate the header metadata and inject DOI only).

| method | request type | response type | parameters | requirement | description |
|--- |--- |--- |--- |--- |--- |
| POST, PUT | `multipart/form-data` | `application/xml` | `input` | required | PDF file to be processed |
| | | | `consolidateHeader` | optional | consolidateHeader is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the header and inject DOI only), or `3` (consolidate using only extracted DOI - if extracted) . |
| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). |
| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). |
| method | request type | response type | parameters | requirement | description |
|------------|-----------------------|---------------------|--------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| POST, PUT | `multipart/form-data` | `application/xml` | `input` | required | PDF file to be processed |
| | | | `consolidateHeader` | optional | consolidateHeader is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the header and inject DOI only), or `3` (consolidate using only extracted DOI - if extracted) . |
| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). |
| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). |

Use `Accept: application/x-bibtex` to retrieve BibTeX format instead of XML TEI.

However, please bear in mind the following information:

* the TEI XML format is much richer and structured, it should be preferred if there is no particular reason to use BibTeX, so we recommend to always use `Accept: application/xml`.
* always supply an `Accept` header or the response type may be inconsistent. There is no easy way to supply a default response type in the API. See discussion [#1093](https://github.com/kermitt2/grobid/issues/1093).

Use `Accept: application/x-bibtex` to retrieve BibTeX format instead of XML TEI. Note: the TEI XML format is much richer and structured, it should be preferred if there is no particular reason to use BibTeX, so we recommend to always use `Accept: application/xml`.

Response status codes:

| HTTP Status code | reason |
|--- |--- |
| 200 | Successful operation. |
| 204 | Process was completed, but no content could be extracted and structured |
| 400 | Wrong request, missing parameters, missing header |
| 500 | Indicate an internal service error, further described by a provided message |
| 503 | The service is not available, which usually means that all the threads are currently used |
| HTTP Status code | reason |
|-------------------|-------------------------------------------------------------------------------------------|
| 200 | Successful operation. |
| 204 | Process was completed, but no content could be extracted and structured |
| 400 | Wrong request, missing parameters, missing header |
| 500 | Indicate an internal service error, further described by a provided message |
| 503 | The service is not available, which usually means that all the threads are currently used |

A `503` error with the default parallel mode normally means that all the threads available to GROBID are currently used. The client need to re-send the query after a wait time that will allow the server to free some threads. The wait time depends on the service and the capacities of the server, we suggest 2 seconds for the `processHeaderDocument` service.

Expand All @@ -170,19 +176,20 @@ curl -v -H "Accept: application/x-bibtex" --form input=@./thefile.pdf localhost:

Convert the complete input document into TEI XML format (header, body and bibliographical section).

| method | request type | response type | parameters | requirement | description |
|--- |--- |--- |--- |--- |--- |
| POST, PUT | `multipart/form-data` | `application/xml` | `input` | required | PDF file to be processed |
| | | | `consolidateHeader` | optional | `consolidateHeader` is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the citation and inject DOI only), or `3` (consolidate using only extracted DOI - if extracted). |
| | | | `consolidateCitations` | optional | `consolidateCitations` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the citation and inject DOI only). |
| | | | `consolidatFunders` | optional | `consolidateFunders` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the funder and inject DOI only). |
| | | | `includeRawCitations` | optional | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). |
| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). |
| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). |
| | | | `teiCoordinates` | optional | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details |
| | | | `segmentSentences` | optional | Paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |
| | | | `start` | optional | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF) |
| | | | `end` | optional | End page number of the PDF to be considered, next pages will be skipped/ignored, integer with first page starting at `1` (default `-1`, end with the last page of the PDF) |
| method | request type | response type | parameters | requirement | description |
|--- |--- |--- |--------------------------|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| POST, PUT | `multipart/form-data` | `application/xml` | `input` | required | PDF file to be processed |
| | | | `consolidateHeader` | optional | `consolidateHeader` is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the citation and inject DOI only), or `3` (consolidate using only extracted DOI - if extracted). |
| | | | `consolidateCitations` | optional | `consolidateCitations` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the citation and inject DOI only). |
| | | | `consolidatFunders` | optional | `consolidateFunders` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the funder and inject DOI only). |
| | | | `includeRawCitations` | optional | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). |
| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). |
| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). |
| | | | `teiCoordinates` | optional | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details |
| | | | `segmentSentences` | optional | Paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |
| | | | `generateIds` | optional | if supplied as a string equal to `1`, it generates uniqe identifiers for each text component |
| | | | `start` | optional | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF) |
| | | | `end` | optional | End page number of the PDF to be considered, next pages will be skipped/ignored, integer with first page starting at `1` (default `-1`, end with the last page of the PDF) |

Response status codes:

Expand Down
2 changes: 1 addition & 1 deletion grobid-core/src/main/java/org/grobid/core/data/Figure.java
Original file line number Diff line number Diff line change
Expand Up @@ -432,7 +432,7 @@ public String toTEI(GrobidAnalysisConfig config, Document doc, TEIFormatter form
}

if (desc != null && config.isWithSentenceSegmentation()) {
formatter.segmentIntoSentences(desc, this.captionLayoutTokens, config, doc.getLanguage());
formatter.segmentIntoSentences(desc, this.captionLayoutTokens, config, doc.getLanguage(), doc.getPDFAnnotations());

// we need a sentence segmentation of the figure caption, for that we need to introduce
// a <div>, then a <p>
Expand Down
8 changes: 2 additions & 6 deletions grobid-core/src/main/java/org/grobid/core/data/Note.java
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
package org.grobid.core.data;

import org.apache.commons.collections4.CollectionUtils;
import org.apache.commons.lang3.StringUtils;
import org.grobid.core.layout.LayoutToken;
import org.grobid.core.layout.Page;
import org.grobid.core.utilities.*;

import java.util.List;
Expand Down Expand Up @@ -127,10 +127,6 @@ public void setNoteType(NoteType noteType) {
}

public String getNoteTypeName() {
if (this.noteType == NoteType.FOOT) {
return "foot";
} else {
return "margin";
}
return StringUtils.lowerCase(noteType.name());
}
}
4 changes: 2 additions & 2 deletions grobid-core/src/main/java/org/grobid/core/data/Table.java
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ public String toTEI(GrobidAnalysisConfig config, Document doc, TEIFormatter form
}

if (desc != null && config.isWithSentenceSegmentation()) {
formatter.segmentIntoSentences(desc, this.captionLayoutTokens, config, doc.getLanguage());
formatter.segmentIntoSentences(desc, this.captionLayoutTokens, config, doc.getLanguage(), doc.getPDFAnnotations());

// we need a sentence segmentation of the table caption, for that we need to introduce
// a <div>, then a <p>
Expand Down Expand Up @@ -215,7 +215,7 @@ public String toTEI(GrobidAnalysisConfig config, Document doc, TEIFormatter form

if (noteNode != null && config.isWithSentenceSegmentation()) {
// we need a sentence segmentation of the figure caption
formatter.segmentIntoSentences(noteNode, this.noteLayoutTokens, config, doc.getLanguage());
formatter.segmentIntoSentences(noteNode, this.noteLayoutTokens, config, doc.getLanguage(), doc.getPDFAnnotations());
}

// enclose note content in a <p> element
Expand Down
Loading

0 comments on commit bbca7dd

Please sign in to comment.