Skip to content

Commit

Permalink
Merge branch 'master' into feature/add-training-data
Browse files Browse the repository at this point in the history
  • Loading branch information
lfoppiano committed Mar 30, 2021
2 parents 35e918e + 98793c3 commit 810c990
Show file tree
Hide file tree
Showing 83 changed files with 12,292 additions and 5,794 deletions.
1 change: 0 additions & 1 deletion .dockerignore
Expand Up @@ -59,7 +59,6 @@ software-mentions
grobid-books
grobid-smecta
grobid-keyterm
dataseer-ml
grobid-home/models/quantities
grobid-home/models/dictionaries-lexical-entries
grobid-home/models/units
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Expand Up @@ -75,4 +75,7 @@ grobid-home/models/superconductors*
grobid-home/models/values
grobid-home/models/dataseer
grobid-home/models/*-bert*/
grobid-home/models/*scibert*/
grobid-home/models/*scibert*/

Dockerfile.dataseer
Dockerfile.software
33 changes: 28 additions & 5 deletions CHANGELOG.md
Expand Up @@ -8,17 +8,40 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

### Added

+ Option to get sentence segmented text in extracted structures (choice between the Pragmatic Segmenter, integrated via JRuby, and OpenNLP sentence detector)
+ Option to get PDF coordinates for `<s>` structures
### Changed

### Fixed


## [0.6.2] – 2020-03-20

### Added

+ Docker image covering both Deep Learning and CRF models, with GPU detection and preloading of embeddings
+ For Deep Learning models, labeling is now done by batch: application of the citation DL model is 4 times faster for BidLSTM-CRF (with or without features) and 6 times faster for SciBERT
+ More tests for sentence segmentation
+ Add orcid of persons when available from the PDF or via consolidation (i.e. if in CrossRef metadata)
+ Add BidLSTM-CRF-FEATURES header model (with feature channel)
+ Add bioRxiv end-to-end evaluation
+ Bounding boxes for optional section titles coordinates

### Changed

+ Update of TEI XML schema to allow `<s>` structurs in the result
+ Reduce the size of docker images
+ Improve end-to-end evaluation: multithreaded processing of PDF, progress bar, output the evaluation report in markdown format
+ Update of several models covering CRF, BidLSTM-CRF and BidLSTM-CRF-FEATURES, mainly improving citation and author recognitions
+ OpenNLP is the default optional sentence segmenter (similar result as Pragmatic Segmenter for scholar documents after benchmarking, but 30 times faster)
+ Refine sentence segmentation to exploit layout information and predicted reference callouts
+ Update jep version to 3.9.1

### Fixed

+ Structuration of abstract is back
+ Deprecated CrossRef `query.title` field for the CrossRef consolidation service
+ Ignore invalid utf-8 sequences
+ Update CrossRef multithreaded calls to avoid using the unreliable time interval returned by the CrossRef REST API service, update usage of `Crossref-Plus-API-Token` and update the deprecated crossref field `query.title`
+ Missing last table or figure when generating training data for the fulltext model
+ Fix an error related to the feature value for the reference callout for the fulltext model
+ Review/correct DeLFT configuration documentation, with a step-by-step configuration documentation
+ Other minor fixes

## [0.6.1] – 2020-08-12

Expand Down
107 changes: 0 additions & 107 deletions Dockerfile

This file was deleted.

70 changes: 24 additions & 46 deletions Dockerfile.crf
Expand Up @@ -18,13 +18,10 @@ FROM openjdk:8u212-jdk as builder
USER root

RUN apt-get update && \
apt-get -y --no-install-recommends install libxml2
apt-get -y --no-install-recommends install unzip

WORKDIR /opt/grobid-source

RUN mkdir -p .gradle
VOLUME /opt/grobid-source/.gradle

# gradle
COPY gradle/ ./gradle/
COPY gradlew ./
Expand All @@ -38,70 +35,51 @@ COPY grobid-core/ ./grobid-core/
COPY grobid-service/ ./grobid-service/
COPY grobid-trainer/ ./grobid-trainer/

# cleaning unused native libraries before packaging
RUN rm -rf grobid-home/pdf2xml/lin-32
RUN rm -rf grobid-home/pdf2xml/mac-64
RUN rm -rf grobid-home/pdf2xml/win-*
RUN rm -rf grobid-home/lib/lin-32
RUN rm -rf grobid-home/lib/win-*
RUN rm -rf grobid-home/lib/mac-64

# cleaning Delft models
RUN rm -rf grobid-home/models/*-BidLSTM_CRF*

RUN ./gradlew clean assemble --no-daemon --info --stacktrace

WORKDIR /opt/grobid
RUN unzip -o /opt/grobid-source/grobid-service/build/distributions/grobid-service-*.zip && \
mv grobid-service* grobid-service
RUN unzip -o /opt/grobid-source/grobid-home/build/distributions/grobid-home-*.zip && \
chmod -R 755 /opt/grobid/grobid-home/pdf2xml
RUN rm -rf grobid-source

# -------------------
# build runtime image
# -------------------
FROM openjdk:8u212-jre-slim
FROM openjdk:11-jre-slim

RUN apt-get update && \
apt-get -y --no-install-recommends install libxml2 unzip

WORKDIR /opt

COPY --from=builder /opt/grobid-source/grobid-core/build/libs/grobid-core-*-onejar.jar ./grobid/grobid-core-onejar.jar
COPY --from=builder /opt/grobid-source/grobid-service/build/distributions/grobid-service-*.zip ./grobid-service.zip
COPY --from=builder /opt/grobid-source/grobid-home/build/distributions/grobid-home-*.zip ./grobid-home.zip

RUN unzip -o ./grobid-service.zip -d ./grobid && \
mv ./grobid/grobid-service-* ./grobid/grobid-service

RUN unzip ./grobid-home.zip -d ./grobid && \
mkdir -p /opt/grobid/grobid-home/tmp

RUN rm *.zip

# below to allow logs to be written in the container
# RUN mkdir -p logs

VOLUME ["/opt/grobid/grobid-home/tmp"]
apt-get -y --no-install-recommends install libxml2 && \
rm -rf /var/lib/apt/lists/*

WORKDIR /opt/grobid

ENV JAVA_OPTS=-Xmx4g
COPY --from=builder /opt/grobid .

# Add Tini
ENV TINI_VERSION v0.18.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini
ENTRYPOINT ["/tini", "-s", "--"]

CMD ["./grobid-service/bin/grobid-service", "server", "grobid-service/config/config.yaml"]

ARG GROBID_VERSION

LABEL \
authors="Luca Foppiano <luca.foppiano@inria.fr>, Patrice Lopez <patrice.lopez@science-miner.org>" \
org.label-schema.name="Grobid" \
authors="The contributors" \
org.label-schema.name="GROBID" \
org.label-schema.description="Image with GROBID service" \
org.label-schema.url="https://github.com/kermitt2/grobid" \
org.label-schema.version=${GROBID_VERSION}

## Docker tricks:

# - remove all stopped containers
# > docker rm $(docker ps -a -q)

# - remove all unused images
# > docker rmi $(docker images --filter "dangling=true" -q --no-trunc)

# - remove all untagged images
# > docker rmi $(docker images | grep "^<none>" | awk "{print $3}")

# - "Cannot connect to the Docker daemon. Is the docker daemon running on this host?"
# > docker-machine restart

RUN chmod -R 755 /opt/grobid/grobid-home/pdf2xml
RUN chmod 777 /opt/grobid/grobid-home/tmp

0 comments on commit 810c990

Please sign in to comment.