Skip to content

Commit

Permalink
Merge branch 'master' into new-figure-table-models
Browse files Browse the repository at this point in the history
  • Loading branch information
lfoppiano committed Dec 17, 2023
2 parents e2bf621 + 6bd974d commit d189cb5
Show file tree
Hide file tree
Showing 64 changed files with 1,111 additions and 374 deletions.
28 changes: 28 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,34 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [0.8.0] - 2023-11-19

### Added

+ Extraction of funder and funding information with a specific new model, see https://github.com/kermitt2/grobid/pull/1046 for details
+ Optional consolidation of funder with CrossRef Funder Registry
+ Identification of acknowledged entities in the acknowledgement section
+ Optional coordinates in title elements

### Changed

+ Dropwizard upgrade to 4.0
+ Minimum JDK/JVM requirement for building/running the project is now 1.11
+ Logging now with Logback, removal of Log4j2, optional logs in json format
+ General review of logs
+ Enable Github actions / Disable circleci

### Fixed

+ Set dynamic memory limit in pdfalto_server #1038
+ Logging in files when training models work now as expected
+ Various dependency upgrades
+ Fix #1051 with possible problematic PDF
+ Fix #1036 for pdfalto memory limit
+ fix readthedocs build #1040
+ fix for null equation #1030
+ Other minor fixes

## [0.7.3] – 2023-05-13

### Added
Expand Down
8 changes: 4 additions & 4 deletions Dockerfile.delft
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@

## See https://grobid.readthedocs.io/en/latest/Grobid-docker/

## usage example with version 0.7.3:
## docker build -t grobid/grobid:0.7.3 --build-arg GROBID_VERSION=0.7.3 --file Dockerfile.delft .
## usage example with version 0.8.0:
## docker build -t grobid/grobid:0.8.0 --build-arg GROBID_VERSION=0.8.0 --file Dockerfile.delft .

## no GPU:
## docker run -t --rm --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.7.3
## docker run -t --rm --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.8.0

## allocate all available GPUs (only Linux with proper nvidia driver installed on host machine):
## docker run --rm --gpus all --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.7.3
## docker run --rm --gpus all --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.8.0

# -------------------
# build builder image
Expand Down
3 changes: 2 additions & 1 deletion Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,15 @@ The following functionalities are available:
- __Header extraction and parsing__ from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
- __References extraction and parsing__ from articles in PDF format, around .87 F1-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references, and around .90 on a similar bioRxiv set of 2000 PDF (using the Deep Learning citation model). All the usual publication metadata are covered (including DOI, PMID, etc.).
- __Citation contexts recognition and resolution__ of the full bibliographical references of the article. The accuracy of citation contexts resolution is between .76 and .91 F1-score depending on the evaluation collection (this corresponds to both the correct identification of the citation callout and its correct association with a full bibliographical reference).
- __Full text extraction and structuring__ from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, etc.).
- __Full text extraction and structuring__ from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, data availability statements, etc.).
- __PDF coordinates__ for extracted information, allowing to create "augmented" interactive PDF based on bounding boxes of the identified structures.
- Parsing of __references in isolation__ (above .90 F1-score at instance-level, .95 F1-score at field level, using the Deep Learning model).
- __Parsing of names__ (e.g. person title, forenames, middle name, etc.), in particular author names in header, and author names in references (two distinct models).
- __Parsing of affiliation and address__ blocks.
- __Parsing of dates__, ISO normalized day, month, year.
- __Consolidation/resolution of the extracted bibliographical references__ using the [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service or the [CrossRef REST API](https://github.com/CrossRef/rest-api-doc). In both cases, DOI/PMID resolution performance is higher than 0.95 F1-score from PDF extraction.
- __Extraction and parsing of patent and non-patent references in patent__ publications.
- __Extraction of Funders and funding information__ with optional matching of extracted funders with the CrossRef Funder Registry.

In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middle names, affiliation types, detailed address, journal, volume, issue, pages, DOI, PMID, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure captions, etc.).

Expand Down
116 changes: 65 additions & 51 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ allprojects {

tasks.withType(JavaCompile) {
options.encoding = 'UTF-8'
// note: the following is not working
options.compilerArgs << '-parameters'
}
}

Expand All @@ -44,17 +46,17 @@ subprojects {
publishing {
publications {
mavenJava(MavenPublication) {
//from components.java
artifact jar
from components.java
//artifact jar
}
}
repositories {
mavenLocal()
}
}

sourceCompatibility = 1.8
targetCompatibility = 1.8
sourceCompatibility = 1.11
targetCompatibility = 1.11

repositories {
mavenCentral()
Expand All @@ -64,11 +66,11 @@ subprojects {
maven { url "https://jitpack.io" }
}

/*configurations {
configurations {
all*.exclude group: 'org.slf4j', module: "slf4j-log4j12"
//all*.exclude group: 'log4j', module: "log4j"
// implementation.setCanBeResolved(true)
}*/
all*.exclude group: 'log4j', module: "log4j"
implementation.setCanBeResolved(true)
}

ext {
// treating them separately, these jars will be flattened into grobid-core.jar on installing,
Expand All @@ -84,29 +86,31 @@ subprojects {
// packaging local libs inside grobid-core.jar
implementation fileTree(dir: new File(rootProject.rootDir, 'grobid-core/localLibs'), include: localLibs)

testImplementation "junit:junit:4.12"
testImplementation "org.easymock:easymock:3.4"
testRuntimeOnly 'org.junit.vintage:junit-vintage-engine:5.9.3'
testImplementation(platform('org.junit:junit-bom:5.9.3'))
testImplementation('org.junit.jupiter:junit-jupiter')
testImplementation 'org.easymock:easymock:5.1.0'
testImplementation "org.powermock:powermock-api-easymock:2.0.7"
testImplementation "org.powermock:powermock-module-junit4:2.0.7"
testImplementation "xmlunit:xmlunit:1.6"
testImplementation "org.hamcrest:hamcrest-all:1.3"

implementation "com.cybozu.labs:langdetect:1.1-20120112"
implementation "com.rockymadden.stringmetric:stringmetric-core_2.10:0.27.3"
implementation "com.rockymadden.stringmetric:stringmetric-core_2.11:0.27.4"
implementation "commons-pool:commons-pool:1.6"
implementation "commons-io:commons-io:2.5"
implementation "org.apache.commons:commons-lang3:3.6"
implementation "org.apache.commons:commons-collections4:4.1"
implementation 'org.apache.commons:commons-text:1.8'
implementation 'org.apache.commons:commons-text:1.11.0'
implementation "commons-dbutils:commons-dbutils:1.7"
implementation "com.google.guava:guava:28.2-jre"
implementation "com.google.guava:guava:31.0.1-jre"
implementation "org.apache.httpcomponents:httpclient:4.5.3"
implementation "black.ninia:jep:4.0.2"

implementation "com.fasterxml.jackson.core:jackson-core:2.10.1"
implementation "com.fasterxml.jackson.core:jackson-databind:2.10.1"
implementation "com.fasterxml.jackson.module:jackson-module-afterburner:2.10.1"
implementation "com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.10.1"
implementation "com.fasterxml.jackson.core:jackson-core:2.14.3"
implementation "com.fasterxml.jackson.core:jackson-databind:2.14.3"
implementation "com.fasterxml.jackson.module:jackson-module-afterburner:2.14.3"
implementation "com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.14.3"
implementation 'org.apache.xmlgraphics:batik-anim:1.14'
implementation 'org.apache.xmlgraphics:batik-bridge:1.14'
implementation 'org.apache.xmlgraphics:batik-svg-dom:1.14'
Expand Down Expand Up @@ -151,6 +155,8 @@ subprojects {
// }

test {
useJUnitPlatform()

testLogging.showStandardStreams = true
// enable for having separate test executor for different tests
forkEvery = 1
Expand All @@ -174,7 +180,7 @@ subprojects {

if (JavaVersion.current().compareTo(JavaVersion.VERSION_1_8) > 0) {
jvmArgs "--add-opens", "java.base/java.util.stream=ALL-UNNAMED",
"--add-opens", "java.base/java.io=ALL-UNNAMED"
"--add-opens", "java.base/java.io=ALL-UNNAMED", "--add-opens", "java.xml/jdk.xml.internal=ALL-UNNAMED"
}
systemProperty "java.library.path","${System.getProperty('java.library.path')}:" + libraries
}
Expand All @@ -201,9 +207,8 @@ project("grobid-core") {
}

// Logs
api 'org.slf4j:slf4j-api:1.7.25'
//api 'org.slf4j:slf4j-log4j12:1.7.25'
runtimeOnly 'org.slf4j:slf4j-jdk14:1.7.25'
implementation 'org.slf4j:slf4j-api:1.7.30'
implementation 'ch.qos.logback:logback-classic:1.2.3'

implementation "org.apache.pdfbox:pdfbox:2.0.18"

Expand Down Expand Up @@ -232,6 +237,7 @@ project("grobid-core") {
it.isDirectory() ? [] : localLibs.contains(it.getName()) ? zipTree(it) : []
}
}
exclude("logback.xml")
duplicatesStrategy = DuplicatesStrategy.EXCLUDE
}

Expand Down Expand Up @@ -275,6 +281,7 @@ project("grobid-core") {

project("grobid-home") {
task packageGrobidHome(type: Zip) {
zip64 true
from('.') {
include("config/*")
include("language-detection/**")
Expand Down Expand Up @@ -334,17 +341,11 @@ project(":grobid-service") {
// "${System.env.CONDA_PREFIX}/lib/python${pythonVersion}/site-packages/jep"
// }
systemProperty "java.library.path", javaLibraryPath

}

configurations {
all*.exclude group: 'org.slf4j', module: "slf4j-jdk14"
all*.exclude group: 'org.slf4j', module: "slf4j-log4j12"
all*.exclude group: 'log4j', module: "log4j"
}

tasks.distZip.enabled = true
tasks.distTar.enabled = false
//tasks.distZip.zip64 = true
tasks.shadowDistZip.enabled = false
tasks.shadowDistTar.enabled = false

Expand All @@ -354,19 +355,25 @@ project(":grobid-service") {
dependencies {
implementation project(':grobid-core')
implementation project(':grobid-trainer')
implementation "io.dropwizard:dropwizard-core:1.3.23"
implementation "io.dropwizard:dropwizard-assets:1.3.23"
implementation "com.hubspot.dropwizard:dropwizard-guicier:1.3.5.0"
implementation "io.dropwizard:dropwizard-testing:1.3.23"
implementation "io.dropwizard:dropwizard-forms:1.3.23"
implementation "io.dropwizard:dropwizard-client:1.3.23"
implementation "io.dropwizard:dropwizard-auth:1.3.23"

//Dropwizard
implementation 'ru.vyarus:dropwizard-guicey:7.0.0'

implementation 'io.dropwizard:dropwizard-bom:4.0.0'
implementation 'io.dropwizard:dropwizard-core:4.0.0'
implementation 'io.dropwizard:dropwizard-assets:4.0.0'
implementation 'io.dropwizard:dropwizard-testing:4.0.0'
implementation 'io.dropwizard.modules:dropwizard-testing-junit4:4.0.0'
implementation 'io.dropwizard:dropwizard-forms:4.0.0'
implementation 'io.dropwizard:dropwizard-client:4.0.0'
implementation 'io.dropwizard:dropwizard-auth:4.0.0'
implementation 'io.dropwizard.metrics:metrics-core:4.2.22'
implementation 'io.dropwizard.metrics:metrics-servlets:4.2.22'

implementation "org.apache.pdfbox:pdfbox:2.0.3"
implementation "javax.activation:activation:1.1.1"
implementation "io.prometheus:simpleclient_dropwizard:0.11.0"
implementation "io.prometheus:simpleclient_servlet:0.11.0"

testImplementation "io.dropwizard:dropwizard-testing:1.3.17"
implementation "io.prometheus:simpleclient_dropwizard:0.16.0"
implementation "io.prometheus:simpleclient_servlet:0.16.0"
}

shadowJar {
Expand All @@ -377,6 +384,8 @@ project(":grobid-service") {
attributes 'Main-Class': 'org.grobid.core.main.batch.GrobidMain'
}

exclude("logback.xml")

duplicatesStrategy = DuplicatesStrategy.EXCLUDE
}

Expand All @@ -387,12 +396,15 @@ project(":grobid-service") {
distributions {
main {
contents {
from(new File(rootProject.rootDir, "grobid-service/README.md")) {
into "doc"
}
//from(new File(rootProject.rootDir, "../grobid-home/config/grobid.yaml")) {
// into "config"
//from(new File(rootProject.rootDir, "grobid-service/README.md")) {
// into "doc"
//}
from(new File(rootProject.rootDir, "../grobid-home/config/grobid.yaml")) {
into "config"
}
from(new File(rootProject.rootDir, "grobid-service/build/scripts/*")) {
into "bin"
}
}
}
}
Expand All @@ -414,15 +426,13 @@ project(":grobid-trainer") {
implementation project(':grobid-core')
implementation "com.rockymadden.stringmetric:stringmetric-core_2.10:0.27.3"
implementation "me.tongfei:progressbar:0.9.0"
//implementation 'org.slf4j:slf4j-log4j12:1.7.25'
implementation 'org.slf4j:slf4j-api:1.7.25'
//implementation 'org.slf4j:slf4j-jdk14:1.7.25'

// logs
implementation 'org.slf4j:slf4j-api:1.7.30'
implementation 'ch.qos.logback:logback-classic:1.2.3'
}

configurations {
//all*.exclude group: 'org.slf4j', module: "slf4j-jdk14"
//all*.exclude group: 'org.slf4j', module: "slf4j-log4j12"
//all*.exclude group: 'log4j', module: "log4j"
}

jar {
Expand All @@ -431,6 +441,7 @@ project(":grobid-trainer") {
it.isDirectory() ? [] : localLibs.contains(it.getName()) ? zipTree(it) : []
}
}
exclude("logback.xml")

duplicatesStrategy = DuplicatesStrategy.EXCLUDE
}
Expand All @@ -443,6 +454,10 @@ project(":grobid-trainer") {
attributes 'Main-Class': 'org.grobid.trainer.TrainerRunner'
}

from('src/main/resources') {
include '*.xml'
}

duplicatesStrategy = DuplicatesStrategy.EXCLUDE
}

Expand Down Expand Up @@ -594,7 +609,6 @@ coveralls {
sourceDirs = files(subprojects.sourceSets.main.allSource.srcDirs).files.absolutePath
}


tasks.coveralls {
dependsOn codeCoverageReport
}
Expand Down
22 changes: 15 additions & 7 deletions doc/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,11 +114,11 @@ When executing the service, models can be loaded in a lazy manner (if you plan t

```yml
# for **service only**: how to load the models,
# false -> models are loaded when needed (default), avoiding putting in memory useless models but slow down significantly
# the service at first call
# true -> all the models are loaded into memory at the server startup, slow the start of the services and models not
# used will take some memory, but server is immediatly warm and ready
modelPreload: false
# false -> models are loaded when needed, avoiding putting in memory useless models (only in case of CRF) but slow down
# significantly the service at first call
# true -> all the models are loaded into memory at the server startup (default), slow the start of the services
# and models not used will take some more memory (only in case of CRF), but server is immediatly warm and ready
modelPreload: true
```

Finally the following part specifies the port to be used by the GROBID web service:
Expand Down Expand Up @@ -207,15 +207,23 @@ logging:
level: INFO
loggers:
org.apache.pdfbox.pdmodel.font.PDSimpleFont: "OFF"
org.glassfish.jersey.internal: "OFF"
com.squarespace.jersey2.guice.JerseyGuiceUtils: "OFF"
appenders:
- type: console
threshold: ALL
threshold: WARN
timeZone: UTC
# uncomment to have the logs in json format
#layout:
# type: json
- type: file
currentLogFilename: logs/grobid-service.log
threshold: ALL
threshold: INFO
archive: true
archivedLogFilenamePattern: logs/grobid-service-%d.log
archivedFileCount: 5
timeZone: UTC
# uncomment to have the logs in json format
#layout:
# type: json
```
3 changes: 2 additions & 1 deletion doc/Coordinates-in-PDF.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ Since April 2017, GROBID version 0.4.2 and higher, coordinate areas can be obtai
* ```formula``` for mathematical equations,
* ```head``` for section titles,
* ```s``` for optional sentence structure (the GROBID fulltext service must be called with the `segmentSentences` parameter to provide the optional sentence-level elements),
* ```note``` for foot note elements.
* ```note``` for foot note elements,
* ```title``` for the title elements (main article title and cited reference titles).

However, there is normally no particular limitation to the type of structures which can have their coordinates in the results, the implementation is on-going, see [issue #69](https://github.com/kermitt2/grobid/issues/69), and it is expected that more or less any structures could be associated with their coordinates in the orginal PDF.

Expand Down
Loading

0 comments on commit d189cb5

Please sign in to comment.