Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] added grobid with delft image #441

Closed
wants to merge 1,609 commits into from

Conversation

de-code
Copy link
Collaborator

@de-code de-code commented Jun 24, 2019

This is an attempt to provide a Docker image that contains GROBID with DeLFT.

Currently not using TensorFlow GPU version. There is probably less benefit for prediction but could be added back later.

What's currently missing is to add the embedding. The currently trained model is using glove.840B.300d which seems massive (which is probably a relative term in that context).

  • 5.3G glove.840B.300d.txt (uncompressed)
  • 2.1G glove.840B.300d.txt.gz (gzip)
  • 2.0G glove.840B.300d.txt.gz (gzip, "best")
  • 1.7G glove.840B.300d.txt.xz (lzma, "extreme")
  • 2.9G glove.840B.300d.bin (convertvec txt2bin)
  • 2.5G glove.840B.300d.bin.gz (gzip)
  • 5.7G data.mdb (delft's lmdb cache)
  • 4.5G data.mdb.gz (gzip)

/cc @kermitt2 @lfoppiano

lfoppiano and others added 30 commits July 25, 2018 08:45
Update documentation and fix build on readthedocs
@de-code
Copy link
Collaborator Author

de-code commented Jun 24, 2019

Maybe the best would be to ask users to mount the embeddings.

Could also download on-demand. That could make things complicated with multiple instances using the same shared volume. In any case, it would still make the start-up very slow.

@coveralls
Copy link

coveralls commented Jun 24, 2019

Coverage Status

Coverage remained the same at 36.674% when pulling b71408c on de-code:added-grobid-with-delft-image into 9eac968 on kermitt2:master.

@kermitt2
Copy link
Owner

As a important remark, prediction are 50 times slower without GPU for BERT for instance and it's even worse with the current RNN models and for training in general. So without GPU, there is no possible practical usage of the DL models in GROBID.

Regarding the embeddings, I planned to add to DeLFT an automatic download of the required embeddings, cf here - as it is done in spacy or flair, so embeddings would be downloaded at first launch. We generally want to avoid user to do it manually, it's painful and the .bin format for instance is hard to support on Windows.

@de-code
Copy link
Collaborator Author

de-code commented Jun 25, 2019

As a important remark, prediction are 50 times slower without GPU for BERT for instance and it's even worse with the current RNN models and for training in general. So without GPU, there is no possible practical usage of the DL models in GROBID.

Okay, fair enough. The default model BidLSTM_CRF seem to be alright on a CPU. But haven't tried the others. (Not least because of the training time)

It would be good to have a CPU-friendly model though. There may be other factors like using layout features that get's us more improvement than a huge model. But that's another discussion. :-)

Regarding the embeddings, I planned to add to DeLFT an automatic download of the required embeddings, cf here - as it is done in spacy or flair, so embeddings would be downloaded at first launch. We generally want to avoid user to do it manually, it's painful and the .bin format for instance is hard to support on Windows.

An automatic download would certainly preferable. In terms of deployment it could make sense to be able to do it at deployment time rather than startup. With SpaCy I usually run the download command ahead of time (and with the new model that's a much smaller model than it used to be). It could be okay at start-up time as well, as long as the server doesn't start listening before it's ready.

Have you thought about where you would host the models? SpaCy is doing it as part of an attachment to a GitHub release. Not sure if there are limitations on the size.

@de-code
Copy link
Collaborator Author

de-code commented Jun 25, 2019

BTW I started experimenting adding it as a GitHub release asset (like SpaCy), not using the full blown embedding though: https://github.com/elifesciences/sciencebeam-models/releases

@de-code
Copy link
Collaborator Author

de-code commented Jun 26, 2019

I actually just tried to add a bigger file and got this as a response (of course we could split it): Yowza, that’s a big file. Try again with a file smaller than 2GB.

See also distributing-large-binaries:

We don't limit the total size of your binary release files, nor the bandwidth used to deliver them. However, each individual file must be under 2 GB in size.

@de-code
Copy link
Collaborator Author

de-code commented Jun 26, 2019

I started a conda version of the Dockerfile to see whether that is working. Currently I am having difficulties installing all of the dependencies in requirements.txt

@de-code
Copy link
Collaborator Author

de-code commented Jun 26, 2019

Switched conda to use non-gpu version of TensorFlow for now. Unfortunately the container is exiting with 139. Something may not be right. Could be the installation of jep.

@de-code
Copy link
Collaborator Author

de-code commented Jun 26, 2019

@lfoppiano have you ever tried JEP with conda? It seems they officially only support CPython and conda seems to come with its own flavour.

@lfoppiano
Copy link
Collaborator

@de-code yes and afaik it was working (on Mac)

@de-code
Copy link
Collaborator Author

de-code commented Jul 18, 2019

If you want to see this in action, a variation of it is now working:

docker pull elifesciences/sciencebeam-trainer-delft-grobid_unstable
docker run --rm -p 8070:8070 elifesciences/sciencebeam-trainer-delft-grobid_unstable

You probably want to mount /data to avoid having to download and preprocess the pre-trained word embedding every time, e.g.:

docker run --rm -p 8070:8070 -v $PWD/data:/data \
    elifesciences/sciencebeam-trainer-delft-grobid_unstable

It will download the word embedding automatically. The current embedding registry points to https://github.com/elifesciences/sciencebeam-models/releases/tag/v0.0.1

@de-code
Copy link
Collaborator Author

de-code commented Nov 29, 2019

Closing this PR. Let me know if you are interested in it.

@de-code de-code closed this Nov 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet