Skip to content

Commit

Permalink
Merge pull request #483 from kermitt2/option-442
Browse files Browse the repository at this point in the history
Add optional raw reference string in results, see #442
  • Loading branch information
kermitt2 committed Aug 15, 2019
2 parents 379c77a + 662c814 commit cad7683
Show file tree
Hide file tree
Showing 15 changed files with 196 additions and 278 deletions.
199 changes: 1 addition & 198 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright 2008-2018 GROBID's contributors
Copyright 2008-2019 GROBID's contributors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand All @@ -201,200 +201,3 @@
See the License for the specific language governing permissions and
limitations under the License.


Apache CouchDB Subcomponents

The Apache CouchDB project includes a number of subcomponents with separate
copyright notices and license terms. Your use of the code for the these
subcomponents is subject to the terms and conditions of the following licenses.

For the m4/ac_check_icu.m4 component:

Copyright (c) 2005 Akos Maroy <darkeye@tyrell.hu>

Copying and distribution of this file, with or without modification, are
permitted in any medium without royalty provided the copyright notice
and this notice are preserved.

For the share/www/script/jquery.js component:

Copyright (c) 2009 John Resig, http://jquery.com/

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

For the share/www/script/jquery-ui-1.8.11.custom.min.js and
share/www/style/jquery-ui-1.8.11.custom.css components:

Copyright (c) 2011 Paul Bakaus, http://jqueryui.com/

This software consists of voluntary contributions made by many
individuals (AUTHORS.txt, http://jqueryui.com/about) For exact
contribution history, see the revision history and logs, available
at http://jquery-ui.googlecode.com/svn/

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

For the share/www/script/jquery.form.js component:

http://malsup.com/jquery/form/

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

For the share/www/script/json2.js component:

Public Domain

No warranty expressed or implied. Use at your own risk.

For the src/mochiweb component:

Copyright (c) 2007 Mochi Media, Inc.

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

For the src/ibrowse component:

Copyright (c) 2006, Chandrashekhar Mullaparthi
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the T-Mobile nor the names of its contributors may be
used to endorse or promote products derived from this software without
specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

For the src/erlang-oauth component:

Copyright (c) 2008-2009 Tim Fletcher <http://tfletcher.com/>

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

For the src/etap component:

Copyright (c) 2008-2009 Nick Gerakines <nick@gerakines.net>

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

17 changes: 17 additions & 0 deletions doc/Grobid-service.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ Convert the complete input document into TEI XML format (header, body and biblio
| POST, PUT | multipart/form-data | application/xml | input | required | PDF file to be processed |
| | | |consolidateHeader| optional | consolidateHeader is a string of value 0 (no consolidation) or 1 (consolidate and inject all extra metadata, default value), or 2 (consolidate and inject only the DOI value). |
| | | |consolidateCitations| optional | consolidateCitations is a string of value 0 (no consolidation, default value) or 1 (consolidate and inject all extra metadata), or 2 (consolidate and inject only the DOI value). |
| | | |includeRawCitations| optional | includeRawCitations is a boolean value, 0 (default. do not include raw reference string in the result) or 1 (include raw reference string in the result). |
| | | |teiCoordinates| optional | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details |

Response status codes:
Expand Down Expand Up @@ -170,6 +171,12 @@ fulltext extraction and add coordinates for all the supported coordinate element
> curl -v --form input=@./12248_2011_Article_9260.pdf --form teiCoordinates=persName --form teiCoordinates=figure --form teiCoordinates=ref --form teiCoordinates=biblStruct --form teiCoordinates=formula localhost:8070/api/processFulltextDocument
```

Regarding the bibliographical references, it is possible to include the original raw reference string in the parsed bibliographical result with the parameter `includeRawCitations` set to 1:

```bash
curl -v --form input=@./thefile.pdf --form includeRawCitations=1 localhost:8070/api/processFulltextDocument
```

#### /api/processReferences

Extract and convert all the bibliographical references present in the input document into TEI XML format.
Expand All @@ -178,6 +185,7 @@ Extract and convert all the bibliographical references present in the input docu
|--- |--- |--- |--- |--- |--- |
| POST, PUT | multipart/form-data | application/xml | input | required | PDF file to be processed |
| | | |consolidateCitations| optional | is a string of value 0 (no consolidation, default value) or 1 (consolidate all found bib. ref. and inject all extra metadata), or 2 (consolidate all found bib. ref. and inject only the DOI value). |
| | | |includeRawCitations| optional | includeRawCitations is a boolean value, 0 (default. do not include raw reference string in the result) or 1 (include raw reference string in the result). |

Response status codes:

Expand All @@ -196,6 +204,12 @@ You can test this service with the **cURL** command lines, for instance extracti
curl -v --form input=@./thefile.pdf localhost:8070/api/processReferences
```

It is possible to include the original raw reference string in the parsed result with the parameter `includeRawCitations` set to 1:

```bash
curl -v --form input=@./thefile.pdf --form includeRawCitations=1 localhost:8070/api/processReferences
```

### Raw text to TEI conversion services

#### /api/processDate
Expand Down Expand Up @@ -461,6 +475,7 @@ Extract and parse the patent and non patent citations in the description of a pa
|--- |--- |--- |--- |--- |--- |
| POST, PUT | application/x-www-form-urlencoded | application/xml | input | required | patent text to be processed as raw string|
| | | |consolidateCitations| optional | consolidateCitations is a string of value 0 (no consolidation, default value) or 1 (consolidate the citation and inject extra metadata) or 2 (consolidate and inject DOI only) |
| | | |includeRawCitations| optional | for non patent citations, includeRawCitations is a boolean value, 0 (default. do not include raw reference string in the result) or 1 (include raw reference string in the result). |


Response status codes:
Expand Down Expand Up @@ -518,6 +533,7 @@ Extract and parse the patent and non patent citations in the description of a pa
|--- |--- |--- |--- |--- |--- |
| POST, PUT | multipart/form-data | application/xml | input | required | XML file in ST36 standard of the patent document to be processed |
| | | |consolidateCitations| optional | consolidateCitations is a string of value 0 (no consolidation, default value) or 1 (consolidate the citation and inject extra metadata) or 2 (consolidate and inject DOI only) |
| | | |includeRawCitations| optional | for non patent citations, includeRawCitations is a boolean value, 0 (default. do not include raw reference string in the result) or 1 (include raw reference string in the result). |


Response status codes:
Expand All @@ -543,6 +559,7 @@ Extract and parse the patent and non patent citations in the description of a pa
|--- |--- |--- |--- |--- |--- |
| POST, PUT | multipart/form-data | application/xml | input | required | PDF file of the patent document to be processed |
| | | |consolidateCitations| optional | consolidateCitations is a string of value 0 (no consolidation, default value) or 1 (consolidate the citation and inject extra metadata) or 2 (consolidate and inject DOI only) |
| | | |includeRawCitations| optional | for non patent citations, includeRawCitations is a boolean value, 0 (default. do not include raw reference string in the result) or 1 (include raw reference string in the result). |


Response status codes:
Expand Down
23 changes: 19 additions & 4 deletions grobid-core/src/main/java/org/grobid/core/data/BibDataSet.java
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package org.grobid.core.data;

import java.util.*;
import org.grobid.core.engines.config.GrobidAnalysisConfig;

/**
* Structure for representing the different information for a citation and its different context of citation.
Expand Down Expand Up @@ -98,18 +99,32 @@ public String toString() {
+ ", refSymbol=" + refSymbol + ", rawBib=" + rawBib
+ ", confidence=" + confidence + ", offsets=" + offsets + "]";
}

public String toTEI() {
return toTEI(false);
}

public String toTEI() {
public String toTEI(boolean includeRawCitations) {
if (resBib != null) {
return resBib.toTEI(-1);
GrobidAnalysisConfig config = GrobidAnalysisConfig.builder()
.includeRawCitations(includeRawCitations)
.build();
return resBib.toTEI(-1, 0, config);
} else {
return "";
}
}

public String toTEI(int p) {
public String toTEI(int p) {
return toTEI(p, false);
}

public String toTEI(int p, boolean includeRawCitations) {
if (resBib != null) {
return resBib.toTEI(p);
GrobidAnalysisConfig config = GrobidAnalysisConfig.builder()
.includeRawCitations(includeRawCitations)
.build();
return resBib.toTEI(p, 0, config);
} else {
return "";
}
Expand Down
23 changes: 20 additions & 3 deletions grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java
Original file line number Diff line number Diff line change
Expand Up @@ -2699,10 +2699,16 @@ else if (this.getYear().length() == 4)
}

if (dedication != null) {
for (int i = 0; i < indent + 1; i++) {
tei.append("\t");
}
tei.append("<note type=\"dedication\">" + TextUtilities.HTMLEncode(dedication) + "</note>\n");
}

if (book_type != null) {
for (int i = 0; i < indent + 1; i++) {
tei.append("\t");
}
tei.append("<note type=\"report_type\">" + TextUtilities.HTMLEncode(book_type) + "</note>\n");
}

Expand Down Expand Up @@ -2793,9 +2799,20 @@ else if (this.getYear().length() == 4)
for (int i = 0; i < indent + 1; i++) {
tei.append("\t");
}
tei.append("<div type=\"abstract\">" + abstract_ + "</div>\n");
tei.append("<div type=\"abstract\">" + TextUtilities.HTMLEncode(abstract_) + "</div>\n");
}
}

if (config.getIncludeRawCitations() && !StringUtils.isEmpty(reference) ) {
for (int i = 0; i < indent + 1; i++) {
tei.append("\t");
}
String localReference = TextUtilities.HTMLEncode(reference);
localReference = localReference.replace("\n", " ");
localReference = localReference.replaceAll("( )+", " ");
tei.append("<note type=\"raw_reference\">" + localReference + "</note>\n");
}

for (int i = 0; i < indent; i++) {
tei.append("\t");
}
Expand Down Expand Up @@ -3910,7 +3927,7 @@ public String toTEIAuthorBlock(int nbTag, boolean withCoordinates) {
private static volatile Pattern page = Pattern.compile("(\\d+)");

/**
* Correct fields of the first biblio item based on the second one and he reference string.
* Correct fields of the first biblio item based on the second one and the reference string.
*/
public void postProcessPages() {
if (pageRange != null) {
Expand Down Expand Up @@ -3971,7 +3988,7 @@ public static void injectDOI(BiblioItem bib, BiblioItem bibo) {
}

/**
* Correct fields of the first biblio item based on the second one and he reference string
* Correct fields of the first biblio item based on the second one and the reference string
*/
public static void correct(BiblioItem bib, BiblioItem bibo) {
//System.out.println("correct: \n" + bib.toTEI(0));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1439,6 +1439,7 @@ public StringBuilder toTEIReferences(StringBuilder tei,
if (bds.size() > 0) {
for (BibDataSet bib : bds) {
BiblioItem bit = bib.getResBib();
bit.setReference(bib.getRawBib());
if (bit != null) {
tei.append("\n" + bit.toTEI(p, 0, config));
} else {
Expand Down
Loading

0 comments on commit cad7683

Please sign in to comment.