Skip to content

Commit

Permalink
update docker file and script for new delft version
Browse files Browse the repository at this point in the history
  • Loading branch information
kermitt2 committed Mar 30, 2022
1 parent 79ebdd7 commit 0caf242
Show file tree
Hide file tree
Showing 5 changed files with 45 additions and 44 deletions.
10 changes: 5 additions & 5 deletions Dockerfile.delft
Expand Up @@ -61,8 +61,8 @@ RUN rm -rf grobid-source
# -------------------

# use NVIDIA Container Toolkit to automatically recognize possible GPU drivers on the host machine
FROM tensorflow/tensorflow:1.15.5-gpu
CMD nvidia-smi
FROM tensorflow/tensorflow:2.7.0-gpu
#CMD nvidia-smi

# setting locale is likely useless but to be sure
ENV LANG C.UTF-8
Expand Down Expand Up @@ -117,19 +117,19 @@ RUN curl --fail --show-error --location -q ${JDK_URL} -o /tmp/openjdk.tar.gz \
--directory "${TEMP_JDK_HOME}" \
--strip-components 1 \
--no-same-owner \
&& JAVA_HOME=${TEMP_JDK_HOME} pip3 install jep==3.9.1 \
&& JAVA_HOME=${TEMP_JDK_HOME} pip3 install jep==4.0.2 \
&& rm -f /tmp/openjdk.tar.gz \
&& rm -rf "${TEMP_JDK_HOME}"
ENV LD_LIBRARY_PATH=/usr/local/lib/python3.6/dist-packages/jep:${LD_LIBRARY_PATH}
# remove libjep.so because we are providng our own version in the virtual env
RUN rm /opt/grobid/grobid-home/lib/lin-64/libjep.so
RUN rm /opt/grobid/grobid-home/lib/lin-64/jep/libjep.so

# preload embeddings, for GROBID all the RNN models use glove-840B (default for the script), ELMo is currently not loaded
# to be done: mechanism to download GROBID fine-tuned models based on SciBERT if selected

COPY --from=builder /opt/grobid-source/grobid-home/scripts/preload_embeddings.py .
COPY --from=builder /opt/grobid-source/grobid-home/config/embedding-registry.json .
RUN python3 preload_embeddings.py
RUN python3 preload_embeddings.py --registry ./embedding-registry.json
RUN ln -s /opt/grobid /opt/delft

CMD ["./grobid-service/bin/grobid-service"]
Expand Down
39 changes: 10 additions & 29 deletions doc/Deep-Learning-models.md
Expand Up @@ -2,19 +2,19 @@

## Integration with DeLFT

Since version `0.5.4` (2018), it is possible to use in GROBID recent Deep Learning sequence labelling models trained with [DeLFT](https://github.com/kermitt2/delft). The available neural models include in particular BidLSTM-CRF with Glove embeddings, with additional feature channel (for layout features), with ELMo, and BERT fine-tuned architectures with CRF activation layer (e.g. SciBERT-CRF), which can be used as alternative to the default Wapiti CRF.
Since version `0.5.4` (2018), it is possible to use in GROBID recent Deep Learning sequence labelling models trained with [DeLFT](https://github.com/kermitt2/delft). The available neural models include in particular BidLSTM-CRF with Glove embeddings, with additional feature channel (for layout features), with ELMo, and transformer-based fine-tuned architectures with or without CRF activation layer (e.g. SciBERT-CRF), which can be used as alternative to the default Wapiti CRF.

These architectures have been tested on Linux 64bit and macOS.

Integration is realized via Java Embedded Python [JEP](https://github.com/ninia/jep), which uses a JNI of CPython. This integration is two times faster than the Tensorflow Java API and significantly faster than RPC serving (see [here](https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-dongwon-kim-predictive-maintenance-with-apache-flink), and it does not require to modify DeLFT as it would be the case with Py4J gateway (socket-based).

There are no neural model for the segmentation and the fulltext models, because the input sequences for these models are too large for the current supported Deep Learning architectures. The problem would need to be formulated differently for these tasks or to use alternative DL architectures (with sliding window, etc.).
There are currently no neural model for the segmentation and the fulltext models, because the input sequences for these models are too large for the current supported Deep Learning architectures. The problem would need to be formulated differently for these tasks or to use alternative DL architectures (with sliding window, etc.).

Low level models not using layout features (author name, dates, affiliations...) perform better than CRF. When layout features are involved, neural models with an additional feature channel should be preferred (e.g. `BidLSTM_CRF_FEATURES` in DeLFT), or they will perform significantly worse than Wapiti CRF.
Low-level models not using layout features (author name, dates, affiliations...) perform usually better than CRF and does not require a feature channel. When layout features are involved, neural models with an additional feature channel should be preferred (e.g. `BidLSTM_CRF_FEATURES` in DeLFT) to those without feature channel.

See some evaluations under `grobid-trainer/docs`.

Current neural models can be up to 100 time slower than CRF, depending on the architecture. However when sequences can be processed in batch (e.g. for the citation model), overall runtime remains good with clear accuracy gain. This is where the possibility to mix CRF and Deep Learning models for different structuring tasks is very useful, as it permits to adjust the balance between possible accuracy and scalability in a fine-grained manner.
Current neural models can be up to 50 time slower than CRF, depending on the architecture and available CPU/GPU. However when sequences can be processed in batch (e.g. for the citation model), overall runtime remains good with clear accuracy gain. This is where the possibility to mix CRF and Deep Learning models for different structuring tasks is very useful, as it permits to adjust the balance between possible accuracy and scalability in a fine-grained manner, using a reasonable amount of memory.

### Getting started with Deep Learning

Expand Down Expand Up @@ -83,33 +83,15 @@ If you are using a Python environment for the DeLFT installation, you can set th

Normally by setting the Python environment path in the config file (e.g. `pythonVirtualEnv: "../delft/env"`), you will not need to launch GROBID in the same activated environment.

<span>4.</span> Install [JEP](https://github.com/ninia/jep) manually and preferably globally (outside a virtual env. and not under `~/.local/lib/python3.*/site-packages/`):
<span>4.</span> Install [JEP](https://github.com/ninia/jep) manually and preferably globally (outside a virtual env. and not under `~/.local/lib/python3.*/site-packages/`).

```shell
git clone https://github.com/ninia/jep
cd jep
sudo -E python3 setup.py build install
```

the `sudo -E` should ensure that JEP is installed globally and that the right JVM version is used (`-E` indicates to preserve the environment variables, in particular the `JAVA_HOME`). Installing JEP globally is the only safe way we found to be sure that JEP will work correctly in the JVM.

(here we are unfortunately touching the limit of the messy Python package management system, an install of JEP in a virtualenv should isolate the library depending on the pip/python version, but the JVM might not be able to found and linked these local/isolated libraries, even when the JVM is launched in the virtual env. A global install of JEP should however always work. pip in a virtual env still uses global python libraries when installed, but when pip uses user-level local libraries (e.g. libraries installed under `~/.local/lib/python3.8/site-packages/`) in the virtual env, we did not find a reliable way to make JEP working in the JVM.)

Copy the built JEP library to the `grobid-home/lib/` area - for instance on a Linux 64 machine with Python 3.8:
We provide an installation script for Linux under `grobid-home/scripts`. This script should be launched from grobid root directory (`grobid/`), e.g.:

```shell
lopez@work:~/jep$ cp build/lib.linux-x86_64-3.8/jep/jep.cpython-38-x86_64-linux-gnu.so ~/grobid/grobid-home/lib/lin-64/libjep.so
./grobid-home/scripts/install_jep_lib.sh
```

This will ensure that the GROBID native JEP library matches the JEP version of the system/environment (i.e. same OS and the same python version) and the JEP jar version (GROBID contains a default `libjep.so` for python3.8 and JEP `4.0.2`, it will not work with other version of python and other version of JEP).

Also note that the installed version of JEP must be the same as given in `grobid/build.gradle`. The python version used in the system or environment must match the native JEP library. At the time this documentation is written, we are using JEP version `4.0.2`:

```
implementation 'black.ninia:jep:4.0.2'
```

So if another version of JEP is installed globally on the system, you should update the GROBID `build.gradle` to match this version.
This script will install the right version of the native JEP library according to the local architecture and python version.

<span>5.</span> Run GROBID, this is the "*but on my machine it works*" moment:

Expand Down Expand Up @@ -166,12 +148,11 @@ The following models (citation) will run with DeLFT `BidLSTM-BidLSTM_CRF_FEATURE
architecture: "BidLSTM_CRF_FEATURES"
```

**NOTE**: use the underscore for models whose name contains hyphens.

**NOTE**: model names normally all use underscore and no hyphen. If not the case, replace hyphen by underscore.

## Troubleshooting

1. If there is a dependency problem when JEP starts usually the virtual machine is crashing.
1. If there is a dependency problem when JEP starts, usually the JVM stops.
We are still discovering this part, please feel free to submit issues should you incur in these problems.
See discussion [here](https://github.com/kermitt2/grobid/pull/454)

Expand Down
23 changes: 17 additions & 6 deletions grobid-home/config/embedding-registry.json
Expand Up @@ -78,20 +78,31 @@
"embeddings-contextualized": [
{
"name": "elmo-en",
"path-config": "/media/lopez/T5/embeddings/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json",
"path_weights": "/media/lopez/T5/embeddings/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5",
"path-vocab": "data/models/ELMo/en/vocab_test.txt",
"path-config": "/media/lopez/T51/embeddings/elmo_2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json",
"path_weights": "/media/lopez/T51/embeddings/elmo_2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5",
"path-vocab": "data/models/ELMo/en/vocab.txt",
"path-cache": "data/models/ELMo/en/",
"cache-training": true,
"lang": "en",
"url_config": "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json",
"url_weights": "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5"
},
{
"name": "elmo-pubmed",
"path-config": "/media/lopez/T51/embeddings/elmo-pubmed/elmo_2x4096_512_2048cnn_2xhighway_options_pubmed.json",
"path_weights": "/media/lopez/T51/embeddings/elmo-pubmed/elmo_2x4096_512_2048cnn_2xhighway_weights_PubMed_only.hdf5",
"path-vocab": "data/models/ELMo/en/vocab.txt",
"path-cache": "data/models/ELMo/en/",
"cache-training": true,
"lang": "en",
"url_config": "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/contributed/pubmed/elmo_2x4096_512_2048cnn_2xhighway_options.json",
"url_weights": "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/contributed/pubmed/elmo_2x4096_512_2048cnn_2xhighway_weights_PubMed_only.hdf5"
},
{
"name": "elmo-fr",
"path-config": "/media/lopez/T5/embeddings/elmo_fr_020_options.json",
"path_weights": "/media/lopez/T5/embeddings/elmo_fr_020_weights.hdf5",
"path-vocab": "data/models/ELMo/fr/vocab_test.txt",
"path-config": "/media/lopez/T51/embeddings/elmo_fr_020/elmo_fr_020_options.json",
"path_weights": "/media/lopez/T51/embeddings/elmo_fr_020/elmo_fr_020_weights.hdf5",
"path-vocab": "data/models/ELMo/fr/vocab.txt",
"path-cache": "data/models/ELMo/fr/",
"cache-training": true,
"lang": "fr",
Expand Down
1 change: 1 addition & 0 deletions grobid-home/scripts/install_jep_lib.sh
Expand Up @@ -15,6 +15,7 @@ pwd
git clone --branch v4.0.2 https://github.com/ninia/jep
cd jep
echo "building jep library..."
#sudo -E python3 setup.py build install
python3 setup.py build install
echo "build sucessful"

Expand Down
16 changes: 12 additions & 4 deletions grobid-home/scripts/preload_embeddings.py
Expand Up @@ -19,11 +19,17 @@
from delft.utilities.Embeddings import Embeddings, open_embedding_file
from delft.utilities.Utilities import download_file
import lmdb
import json

map_size = 100 * 1024 * 1024 * 1024

def preload(embeddings_name, input_path=None):
embeddings = Embeddings(embeddings_name, path='./embedding-registry.json', load=False)
def preload(embeddings_name, input_path=None, registry_path=None):
resource_registry = None
if registry_path != None:
with open(registry_path, 'r') as f:
resource_registry = json.load(f)

embeddings = Embeddings(embeddings_name, resource_registry=resource_registry, load=False)

description = embeddings.get_description(embeddings_name)
if description is None:
Expand Down Expand Up @@ -80,11 +86,13 @@ def preload(embeddings_name, input_path=None):
)
parser.add_argument("--input", help="path to the embeddings file to be loaded located on the host machine (where the docker image is built),"
" this is optional, without this parameter the embeddings file will be downloaded from the url indicated"
" in the mebddings registry, embedding-registry.json")
" in the embddings registry, embedding-registry.json")
parser.add_argument("--registry", help="path to the embedding registry to be considered for setting the paths/urls to embeddings")

args = parser.parse_args()

embeddings_name = args.embedding
input_path = args.input
registry_path = args.registry

preload(embeddings_name, input_path)
preload(embeddings_name, input_path, registry_path)

0 comments on commit 0caf242

Please sign in to comment.