New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] added grobid with delft image #441
[wip] added grobid with delft image #441
Conversation
Update documentation and fix build on readthedocs
migration to gradle 5
These were the wrong way round.
Swap w/h in coordinates documentation
Maybe the best would be to ask users to mount the embeddings. Could also download on-demand. That could make things complicated with multiple instances using the same shared volume. In any case, it would still make the start-up very slow. |
As a important remark, prediction are 50 times slower without GPU for BERT for instance and it's even worse with the current RNN models and for training in general. So without GPU, there is no possible practical usage of the DL models in GROBID. Regarding the embeddings, I planned to add to DeLFT an automatic download of the required embeddings, cf here - as it is done in spacy or flair, so embeddings would be downloaded at first launch. We generally want to avoid user to do it manually, it's painful and the .bin format for instance is hard to support on Windows. |
Okay, fair enough. The default model It would be good to have a CPU-friendly model though. There may be other factors like using layout features that get's us more improvement than a huge model. But that's another discussion. :-)
An automatic download would certainly preferable. In terms of deployment it could make sense to be able to do it at deployment time rather than startup. With SpaCy I usually run the download command ahead of time (and with the new model that's a much smaller model than it used to be). It could be okay at start-up time as well, as long as the server doesn't start listening before it's ready. Have you thought about where you would host the models? SpaCy is doing it as part of an attachment to a GitHub release. Not sure if there are limitations on the size. |
BTW I started experimenting adding it as a GitHub release asset (like SpaCy), not using the full blown embedding though: https://github.com/elifesciences/sciencebeam-models/releases |
I actually just tried to add a bigger file and got this as a response (of course we could split it): See also distributing-large-binaries:
|
I started a |
Switched conda to use non-gpu version of TensorFlow for now. Unfortunately the container is exiting with 139. Something may not be right. Could be the installation of jep. |
@lfoppiano have you ever tried JEP with conda? It seems they officially only support CPython and conda seems to come with its own flavour. |
@de-code yes and afaik it was working (on Mac) |
If you want to see this in action, a variation of it is now working: docker pull elifesciences/sciencebeam-trainer-delft-grobid_unstable
docker run --rm -p 8070:8070 elifesciences/sciencebeam-trainer-delft-grobid_unstable You probably want to mount docker run --rm -p 8070:8070 -v $PWD/data:/data \
elifesciences/sciencebeam-trainer-delft-grobid_unstable It will download the word embedding automatically. The current embedding registry points to https://github.com/elifesciences/sciencebeam-models/releases/tag/v0.0.1 |
Closing this PR. Let me know if you are interested in it. |
This is an attempt to provide a Docker image that contains GROBID with DeLFT.
Currently not using TensorFlow GPU version. There is probably less benefit for prediction but could be added back later.
What's currently missing is to add the embedding. The currently trained model is using
glove.840B.300d
which seems massive (which is probably a relative term in that context).glove.840B.300d.txt
(uncompressed)glove.840B.300d.txt.gz
(gzip)glove.840B.300d.txt.gz
(gzip, "best")glove.840B.300d.txt.xz
(lzma, "extreme")glove.840B.300d.bin
(convertvec txt2bin)glove.840B.300d.bin.gz
(gzip)data.mdb
(delft's lmdb cache)data.mdb.gz
(gzip)/cc @kermitt2 @lfoppiano