[wip] added grobid with delft image #441

de-code · 2019-06-24T14:05:31Z

This is an attempt to provide a Docker image that contains GROBID with DeLFT.

Currently not using TensorFlow GPU version. There is probably less benefit for prediction but could be added back later.

What's currently missing is to add the embedding. The currently trained model is using glove.840B.300d which seems massive (which is probably a relative term in that context).

5.3G glove.840B.300d.txt (uncompressed)
2.1G glove.840B.300d.txt.gz (gzip)
2.0G glove.840B.300d.txt.gz (gzip, "best")
1.7G glove.840B.300d.txt.xz (lzma, "extreme")
2.9G glove.840B.300d.bin (convertvec txt2bin)
2.5G glove.840B.300d.bin.gz (gzip)
5.7G data.mdb (delft's lmdb cache)
4.5G data.mdb.gz (gzip)

/cc @kermitt2 @lfoppiano

Update documentation and fix build on readthedocs

…tic split

into pdfalto_integration

…to converter).

migration to gradle 5

These were the wrong way round.

Swap w/h in coordinates documentation

de-code · 2019-06-24T14:09:50Z

Maybe the best would be to ask users to mount the embeddings.

Could also download on-demand. That could make things complicated with multiple instances using the same shared volume. In any case, it would still make the start-up very slow.

coveralls · 2019-06-24T14:26:06Z

Coverage remained the same at 36.674% when pulling b71408c on de-code:added-grobid-with-delft-image into 9eac968 on kermitt2:master.

kermitt2 · 2019-06-25T11:31:42Z

As a important remark, prediction are 50 times slower without GPU for BERT for instance and it's even worse with the current RNN models and for training in general. So without GPU, there is no possible practical usage of the DL models in GROBID.

Regarding the embeddings, I planned to add to DeLFT an automatic download of the required embeddings, cf here - as it is done in spacy or flair, so embeddings would be downloaded at first launch. We generally want to avoid user to do it manually, it's painful and the .bin format for instance is hard to support on Windows.

de-code · 2019-06-25T12:16:00Z

As a important remark, prediction are 50 times slower without GPU for BERT for instance and it's even worse with the current RNN models and for training in general. So without GPU, there is no possible practical usage of the DL models in GROBID.

Okay, fair enough. The default model BidLSTM_CRF seem to be alright on a CPU. But haven't tried the others. (Not least because of the training time)

It would be good to have a CPU-friendly model though. There may be other factors like using layout features that get's us more improvement than a huge model. But that's another discussion. :-)

Regarding the embeddings, I planned to add to DeLFT an automatic download of the required embeddings, cf here - as it is done in spacy or flair, so embeddings would be downloaded at first launch. We generally want to avoid user to do it manually, it's painful and the .bin format for instance is hard to support on Windows.

An automatic download would certainly preferable. In terms of deployment it could make sense to be able to do it at deployment time rather than startup. With SpaCy I usually run the download command ahead of time (and with the new model that's a much smaller model than it used to be). It could be okay at start-up time as well, as long as the server doesn't start listening before it's ready.

Have you thought about where you would host the models? SpaCy is doing it as part of an attachment to a GitHub release. Not sure if there are limitations on the size.

de-code · 2019-06-25T17:02:11Z

BTW I started experimenting adding it as a GitHub release asset (like SpaCy), not using the full blown embedding though: https://github.com/elifesciences/sciencebeam-models/releases

de-code · 2019-06-26T10:30:29Z

I actually just tried to add a bigger file and got this as a response (of course we could split it): Yowza, that’s a big file. Try again with a file smaller than 2GB.

See also distributing-large-binaries:

We don't limit the total size of your binary release files, nor the bandwidth used to deliver them. However, each individual file must be under 2 GB in size.

de-code · 2019-06-26T15:38:52Z

I started a conda version of the Dockerfile to see whether that is working. Currently I am having difficulties installing all of the dependencies in requirements.txt

de-code · 2019-06-26T21:45:53Z

Switched conda to use non-gpu version of TensorFlow for now. Unfortunately the container is exiting with 139. Something may not be right. Could be the installation of jep.

de-code · 2019-06-26T22:32:01Z

@lfoppiano have you ever tried JEP with conda? It seems they officially only support CPython and conda seems to come with its own flavour.

lfoppiano · 2019-06-27T00:17:26Z

@de-code yes and afaik it was working (on Mac)

de-code · 2019-07-18T09:55:03Z

If you want to see this in action, a variation of it is now working:

docker pull elifesciences/sciencebeam-trainer-delft-grobid_unstable
docker run --rm -p 8070:8070 elifesciences/sciencebeam-trainer-delft-grobid_unstable

You probably want to mount /data to avoid having to download and preprocess the pre-trained word embedding every time, e.g.:

docker run --rm -p 8070:8070 -v $PWD/data:/data \
    elifesciences/sciencebeam-trainer-delft-grobid_unstable

It will download the word embedding automatically. The current embedding registry points to https://github.com/elifesciences/sciencebeam-models/releases/tag/v0.0.1

de-code · 2019-11-29T09:38:38Z

Closing this PR. Let me know if you are interested in it.

lfoppiano and others added 30 commits July 25, 2018 08:45

Minor update documentation: grobid as a java library

e9ae3f2

update documentation format

eb2881b

Merge pull request kermitt2#331 from kermitt2/docu-fix

fcb3480

Update documentation and fix build on readthedocs

create model directory if it doesn't exists when training with automa…

f0762f6

…tic split

use valueOf instead of the deprecated constructor for Integer

af01d0b

updating shadow gradle plugin

9c414cd

filtering resources to add correct version kermitt2#322

99ea296

update doc and license with correct dates and references

3a276bc

add bibtex entry

55183a4

styling bibtex entry

58be7f1

Fix issue kermitt2#339, more robustness for patent number parsing

639cb4e

Fix test

31d46be

First iteration for pdfalto integration.

76724b2

Add pdfalto bin for mac-64.

c5b3f9c

Use ICU library for diacritics handling.

e68001f

Add linux 64 pdfalto binary

d9652ee

Update pdfalto mac-64 binary.

a3d5e32

Update pdflato for mac 64.

7ffac55

Add pdf metadata parser.

8e9a463

Merge branch 'pdfalto_integration' of https://github.com/kermitt2/grobid

5e1e8e0

into pdfalto_integration

Merge branch 'master' into pdfalto_integration

da9c491

Delete pdf metadata file.

eb972ae

Update charachters(adding placeholders) to be ignored during evaluation.

3beeee2

Remove ICU and charcter composition from parser (now support in pdfal…

05fc0e5

…to converter).

Add pdfalto options.

0495347

Put back gradle files.

e25f511

Fixing failing test

887ebf6

Update pdfalto executables.

3e52181

Update pdfalto binaries using static icu.

bf83c36

Fix after and before properties settings.

7973eff

kermitt2 and others added 10 commits June 6, 2019 08:33

Add missing case sensitiveness option for matching layout tokens

5d98df9

Add unicode equivalent classes for parentheses

c59cea3

prepare the addition of subscript/superscript feature

18f18ed

Correct TEI serialization if list items, thanks @Vitaliy-1 kermitt2#429

ae1ba53

migration to gradle 5 kermitt2#432

20d0aeb

Merge pull request kermitt2#435 from kermitt2/gradle5

d372e6c

migration to gradle 5

Correct version in docker documentation

e3b8886

Swap w/h in coordinates documentation

34d59bb

These were the wrong way round.

Merge pull request kermitt2#437 from bfirsh/patch-1

9eac968

Swap w/h in coordinates documentation

added grobid with delft image

712df80

de-code added 2 commits June 26, 2019 15:59

link delft data directory

694257b

added conda version of Dockerfile

771ca1f

de-code added 3 commits June 26, 2019 18:07

install conda dependencies and jep

c0c913d

moved venv down

506bec1

using non-gpu tensorflow for now

b71408c

de-code mentioned this pull request Jul 9, 2019

[wip] better integration with Delft via JEP #454

Merged

kermitt2 force-pushed the master branch from 4a7e4f3 to 9ad861e Compare October 26, 2019 18:54

de-code closed this Nov 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] added grobid with delft image #441

[wip] added grobid with delft image #441

de-code commented Jun 24, 2019 •

edited

de-code commented Jun 24, 2019

coveralls commented Jun 24, 2019 •

edited

kermitt2 commented Jun 25, 2019

de-code commented Jun 25, 2019

de-code commented Jun 25, 2019

de-code commented Jun 26, 2019 •

edited

de-code commented Jun 26, 2019

de-code commented Jun 26, 2019

de-code commented Jun 26, 2019

lfoppiano commented Jun 27, 2019

de-code commented Jul 18, 2019 •

edited

de-code commented Nov 29, 2019

[wip] added grobid with delft image #441

[wip] added grobid with delft image #441

Conversation

de-code commented Jun 24, 2019 • edited

de-code commented Jun 24, 2019

coveralls commented Jun 24, 2019 • edited

kermitt2 commented Jun 25, 2019

de-code commented Jun 25, 2019

de-code commented Jun 25, 2019

de-code commented Jun 26, 2019 • edited

de-code commented Jun 26, 2019

de-code commented Jun 26, 2019

de-code commented Jun 26, 2019

lfoppiano commented Jun 27, 2019

de-code commented Jul 18, 2019 • edited

de-code commented Nov 29, 2019

de-code commented Jun 24, 2019 •

edited

coveralls commented Jun 24, 2019 •

edited

de-code commented Jun 26, 2019 •

edited

de-code commented Jul 18, 2019 •

edited