Skip to content

Commit

Permalink
Merge pull request #165 from lfoppiano/remove-seed-delft
Browse files Browse the repository at this point in the history
Refactor Dockerfile
  • Loading branch information
lfoppiano committed Dec 15, 2023
2 parents ae43ed4 + 5e21eef commit 340464f
Show file tree
Hide file tree
Showing 8 changed files with 460 additions and 27 deletions.
22 changes: 22 additions & 0 deletions .github/workflows/ci-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,25 @@ jobs:
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
format: jacoco

docker-build:
needs: [ build ]
runs-on: ubuntu-latest

steps:
- name: Create more disk space
run: sudo rm -rf /usr/share/dotnet && sudo rm -rf /opt/ghc && sudo rm -rf "/usr/local/share/boost" && sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- uses: actions/checkout@v2
- name: Build and push
id: docker_build
uses: mr-smithers-excellent/docker-build-push@v5
with:
dockerfile: Dockerfile.local
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
image: lfoppiano/grobid-quantities
registry: docker.io
pushImage: ${{ github.event_name != 'pull_request' }}
tags: latest-develop
- name: Image digest
run: echo ${{ steps.docker_build.outputs.digest }}
75 changes: 62 additions & 13 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,55 +4,104 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [Unreleased]
## [0.8.0]

## [0.7.1] – 2021-09-06
### Added

+ Docker image snapshots are built and pushed on dockerhub at each commit
+ new Dockerfile.local that does not clone from github

### Changed

+ Updated to Grobid version 0.8.0
+ Updated to Dropwizard version 4.x (from version 1.x)

## [0.7.3] – 2023-06-26

### Added

+ Added additional units in the lexicon
+ Added missing log when exception are raised
+ Introduced Kotlin for new development

### Changed

+ Upgrade to grobid 0.7.3 and support to JDK > 11
+ Updated Docker image to support JDK 17 and use the gradle distribution script instead of the JAR directly
+ Transitioned from circleci to GitHub actions

### Fixed

+ Fix notation lexicon #97
+ Fix list and labelled sequence extraction with DL BERT models #153
+ Improve recognition of composed units using sentence segmentation #155 #87

## [0.7.2] – 2023-01-20

### Added

+ Create holdout set by @lfoppiano in #145
+ Add additional DL and transformers models by @lfoppiano in #146

### Changed

Update to Grobid 0.7.2

### Fixed

+ Fix value parser's incorrect recognition by @lfoppiano in #141

## [0.7.1] – 2022-09-02

### Added

+ New BidLSTM_CRF models for quantities, values and units parsing #129
+ Add docker image on hub.docker.com #142
+ Update to Grobid 0.7.1 #137
+ Add docker image on hub.docker.com #142
+ Update to Grobid 0.7.1 #137

### Changed

+ Use the grobid sentence segmentation for the quantified object sentence splitting #138

### Fixed
+ Fixes incorrect boxes colors #125
+ Fixed lexicon #134

+ Fixes incorrect boxes colors #125
+ Fixed lexicon #134

## [0.7.0] – 2021-08-06

### Added

+ Docker image #128
+ Configurable number of parallel request
+ Configurable number of parallel request
+ Various improvement in the unit normalisation and update of library Unit of measurement to version 2.x #95

### Changed

+ Retrained models with CRF
+ Grobid 0.7.0 #123

### Fixed

+ Coveralls build #127
+ Fixed command line parameters #119



## [0.6.0] – 2020-04-30

### Added

+ First official release
+ Extraction of quantities, units and values using CRF
+ Support for Text and PDF
+ Extraction of quantities, units and values using CRF
+ Support for Text and PDF

### Changed
+ Added evaluation measurement and models

+ Added evaluation measurement and models

### Fixed


[Unreleased]: https://github.com/kermitt2/grobid/compare/0.6.0...HEAD

[0.6.0]: https://github.com/kermitt2/grobid/compare/0.6.0

<!-- markdownlint-disable-file MD024 MD033 -->
121 changes: 121 additions & 0 deletions Dockerfile.local
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
## Docker GROBID-quantities image using deep learning models and/or CRF models, and various python modules
## Borrowed from https://github.com/kermitt2/grobid/blob/master/Dockerfile.delft
## See https://grobid.readthedocs.io/en/latest/Grobid-docker/

## usage example with grobid: https://github.com/kermitt2/grobid/blob/master/Dockerfile.delft

## docker build -t lfoppiano/grobid-quantities:0.7.0 --build-arg GROBID_VERSION=0.7.0 --file Dockerfile .

## no GPU:
## docker run -t --rm --init -p 8060:8060 -p 8061:8061 -v config.yml:/opt/grobid/grobid-quantities:ro lfoppiano/grobid-quantities:0.7.1

## allocate all available GPUs (only Linux with proper nvidia driver installed on host machine):
## docker run --rm --gpus all --init -p 8072:8072 -p 8073:8073 -v grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro lfoppiano/grobid-superconductors:0.3.0-SNAPSHOT

# -------------------
# build builder image
# -------------------

FROM openjdk:17-jdk-slim as builder

USER root

RUN apt-get update && \
apt-get -y --no-install-recommends install apt-utils libxml2 git unzip

WORKDIR /opt/grobid

RUN mkdir -p grobid-quantities-source grobid-home/models
COPY src grobid-quantities-source/src
COPY settings.gradle grobid-quantities-source/
COPY resources/config/config-docker.yml grobid-quantities-source/resources/config/config.yml
COPY resources/models grobid-quantities-source/resources/models
COPY resources/clearnlp/models/* grobid-quantities-source/resources/clearnlp/models/
COPY build.gradle grobid-quantities-source/
COPY gradle.properties grobid-quantities-source/
COPY gradle grobid-quantities-source/gradle/
COPY gradlew grobid-quantities-source/
COPY .git grobid-quantities-source/.git
COPY localLibs grobid-quantities-source/localLibs

# Preparing models
WORKDIR /opt/grobid/grobid-quantities-source
RUN rm -rf /opt/grobid/grobid-home/models/*
RUN ./gradlew clean assemble -x shadowJar --no-daemon --stacktrace --info
#RUN ./gradlew copyModels --info --no-daemon
RUN ./gradlew downloadTransformers --no-daemon --info --stacktrace && rm -f /opt/grobid/grobid-home/models/*.zip

# Preparing distribution
WORKDIR /opt/grobid
RUN unzip -o /opt/grobid/grobid-quantities-source/build/distributions/grobid-quantities-*.zip -d grobid-quantities_distribution && mv grobid-quantities_distribution/grobid-quantities-* grobid-quantities

WORKDIR /opt

# -------------------
# build runtime image
# -------------------

FROM grobid/grobid:0.7.3 as runtime

# setting locale is likely useless but to be sure
ENV LANG C.UTF-8

RUN apt-get update && \
apt-get -y --no-install-recommends install git wget

WORKDIR /opt/grobid

RUN mkdir -p /opt/grobid/grobid-quantities/resources/clearnlp/models /opt/grobid/grobid-quantities/resources/clearnlp/config
COPY --from=builder /opt/grobid/grobid-home/models ./grobid-home/models
COPY --from=builder /opt/grobid/grobid-quantities ./grobid-quantities/
COPY --from=builder /opt/grobid/grobid-quantities-source/resources/config/config.yml ./grobid-quantities/resources/config/
COPY --from=builder /opt/grobid/grobid-quantities-source/resources/clearnlp/models/* ./grobid-quantities/resources/clearnlp/models/

VOLUME ["/opt/grobid/grobid-home/tmp"]

RUN ln -s /opt/grobid/grobid-quantities/resources /opt/grobid/resources

# JProfiler
#RUN wget https://download-gcdn.ej-technologies.com/jprofiler/jprofiler_linux_12_0_2.tar.gz -P /tmp/ && \
# tar -xzf /tmp/jprofiler_linux_12_0_2.tar.gz -C /usr/local &&\
# rm /tmp/jprofiler_linux_12_0_2.tar.gz

WORKDIR /opt/grobid
ARG GROBID_VERSION
ENV GROBID_VERSION=${GROBID_VERSION:-latest}
ENV GROBID_QUANTITIES_OPTS "-Djava.library.path=/opt/grobid/grobid-home/lib/lin-64:/usr/local/lib/python3.8/dist-packages/jep --add-opens java.base/java.lang=ALL-UNNAMED"

# This code removes the fixed seeed in DeLFT to increase the uncertanty
#RUN sed -i '/seed(7)/d' /usr/local/lib/python3.8/dist-packages/delft/utilities/Utilities.py
#RUN sed -i '/from numpy\.random import seed/d' /usr/local/lib/python3.8/dist-packages/delft/utilities/Utilities.py

EXPOSE 8060 8061 5005

#CMD ["java", "-agentpath:/usr/local/jprofiler12.0.2/bin/linux-x64/libjprofilerti.so=port=8849", "-jar", "grobid-superconductors/grobid-quantities-${GROBID_VERSION}-onejar.jar", "server", "grobid-superconductors/config.yml"]
#CMD ["sh", "-c", "java -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=0.0.0.0:5005 -jar grobid-quantities/grobid-quantities-${GROBID_VERSION}-onejar.jar server grobid-quantities/config.yml"]
#CMD ["sh", "-c", "java -jar grobid-quantities/grobid-quantities-${GROBID_VERSION}-onejar.jar server grobid-quantities/config.yml"]
CMD ["./grobid-quantities/bin/grobid-quantities", "server", "grobid-quantities/resources/config/config.yml"]


LABEL \
authors="Luca Foppiano, Patrice Lopez" \
org.label-schema.name="grobid-quantities" \
org.label-schema.description="Docker image for grobid-quantities service" \
org.label-schema.url="https://github.com/kermitt2/grobid-quantities" \
org.label-schema.version=${GROBID_VERSION}


## Docker tricks:

# - remove all stopped containers
# > docker rm $(docker ps -a -q)

# - remove all unused images
# > docker rmi $(docker images --filter "dangling=true" -q --no-trunc)

# - remove all untagged images
# > docker rmi $(docker images | grep "^<none>" | awk "{print $3}")

# - "Cannot connect to the Docker daemon. Is the docker daemon running on this host?"
# > docker-machine restart

11 changes: 7 additions & 4 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,9 @@ publishing {

def conf = new org.yaml.snakeyaml.Yaml().load(new File("resources/config/config.yml").newInputStream())
def grobidHome = conf.grobidHome.replace("\$", "").replace('{', "").replace("GROBID_HOME:- ", "").replace("}", "")
if (grobidHome.startsWith("../")) {
grobidHome = "${rootProject.rootDir}/${grobidHome}"
}

/** Model management **/

Expand All @@ -354,7 +357,7 @@ task copyModels(type: Copy) {
include "**/preprocessor.json"
exclude "**/features-engineering/**"
exclude "**/result-logs/**"
into "${rootDir}/${grobidHome}/models/"
into "${grobidHome}/models/"

doLast {
print "Copy models under grobid-home: ${grobidHome}"
Expand All @@ -365,11 +368,11 @@ task downloadTransformers(dependsOn: copyModels) {
doLast {
download {
src "https://transformers-data.s3.eu-central-1.amazonaws.com/quantities-transformers.zip"
dest "${rootDir}/${grobidHome}/models/quantities-transformers.zip"
dest "${grobidHome}/models/quantities-transformers.zip"
overwrite false
print "Download bulky transformers files under grobid-home: ${grobidHome}"
}
ant.unzip(src: "${rootDir}/${grobidHome}/models/quantities-transformers.zip", dest: "${rootDir}/${grobidHome}/models/")
ant.unzip(src: "${grobidHome}/models/quantities-transformers.zip", dest: "${grobidHome}/models/")
}
}

Expand All @@ -396,4 +399,4 @@ release {
git {
requireBranch.set('test')
}
}
}
7 changes: 3 additions & 4 deletions scripts/dataset_analysis_quantities.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,8 @@
from pathlib import Path

from bs4 import BeautifulSoup, NavigableString, Tag

from grobid_superconductors.commons.grobid_tokenizer import tokenizeSimple
from grobid_superconductors.commons.quantities_tei_parser import get_children_list
from supermat.grobid_tokenizer import tokenizeSimple
from supermat.supermat_tei_parser import get_children_list


def process_dir(input):
Expand Down Expand Up @@ -53,7 +52,7 @@ def process_file(input):

document_statistics['batch'] = batch

children = get_children_list(soup)
children = get_children_list(soup, use_paragraphs=True)

for paragraph in children:
for item in paragraph:
Expand Down

0 comments on commit 340464f

Please sign in to comment.