Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow model specific crf engine configuration #559

Merged

Conversation

de-code
Copy link
Collaborator

@de-code de-code commented Mar 13, 2020

This will allow experimentation with the a DELFT segmentation model for example.

It will still use wapiti for the segmentation and fulltext model by default.

You could enable the delft for the segmentation by adding this to the grobid.properties:

grobid.crf.engine.segmentation=delft

/cc @kermitt2 @lfoppiano

@coveralls
Copy link

coveralls commented Mar 13, 2020

Coverage Status

Coverage increased (+0.001%) to 38.436% when pulling e9c371b on elifesciences:make-crf-engine-model-specific into ff10968 on kermitt2:master.

@lfoppiano
Copy link
Collaborator

What about removing this constraint and leave it in the configuration only?

Grobid will use the model for each engine, regardless of the model name unless it's specified in the configuration...

grobid.crf.engine.segmentation=wapiti
grobid.crf.engine.fulltext=wapiti

My motivation is purely to avoid having if-else when we can use the configuration to tune this, and eventually avoiding to modify the code if new models need to have defaulted to Wapiti for example.

For example, we discussed in another issue, the reference-segmenter should also be used using CRF for the long sequences (see comment kermitt2/delft#97 (comment)).

if (
GrobidProperties.getGrobidCRFEngine() == GrobidCRFEngine.DELFT
&& (
modelName.equals("fulltext")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why did you remove the use of the constants?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seem to cause some class loading issues when actually running the service (although the tests were fine). That is probably because GrobidModels depends on GrobidProperties (which then shouldn't depend on GrobidModels).

@kermitt2
Copy link
Owner

Yes having the engine specified in the config file is indeed much nicer.

However, even if delft is specified in the config file, maybe we still need a mechanism where we can fall back to Wapiti if the DeLFT model does not exist no?

It's hard to imagine having any sort of usable deep learning models for segmentation, fulltext and reference segmenter (likely also table) without something like a sliding window implemented.

Related point, it would be probably good to move the property file to a yaml file :D

@lfoppiano
Copy link
Collaborator

Yes having the engine specified in the config file is indeed much nicer.

However, even if delft is specified in the config file, maybe we still need a mechanism where we can fall back to Wapiti if the DeLFT model does not exist no?

Ah yes, that's it. Even better. This would simplify the implementation making it even more robust. 👍

It's hard to imagine having any sort of usable deep learning models for segmentation, fulltext and reference segmenter (likely also table) without something like a sliding window implemented.
👍

Related point, it would be probably good to move the property file to a yaml file :D

👍 (maybe in a different issue, if it does not exists already)

@de-code
Copy link
Collaborator Author

de-code commented Mar 17, 2020

What about removing this constraint and leave it in the configuration only?

That sounds good to me.

Yes having the engine specified in the config file is indeed much nicer.
However, even if delft is specified in the config file, maybe we still need a mechanism where we can fall back to Wapiti if the DeLFT model does not exist no?

Ah yes, that's it. Even better. This would simplify the implementation making it even more robust.

I don't know how easy to implement that would be. My reservation would be that it adds magic and isn't obvious (I didn't quite like the condition in the code either). If the configuration told GROBID to use a DeLFT model and there is no such model, than I would consider that an error. Maybe best not to hide it. But if you feel strongly about it we could implement it. (I just wouldn't know how to at the moment)

It's hard to imagine having any sort of usable deep learning models for segmentation, fulltext and reference segmenter (likely also table) without something like a sliding window implemented.

Actually with the dataset I am working with, and the way I auto-annotated it, I seem to get significantly better results for the header area using a DL segmentation model. I am not sure whether that is in part due to the line numbers. I haven't evaluated non-header elements yet. But this would give me a more strong reason to look into it. (I also somewhat integrated Wapiti into sciencebeam-trainer-delft, mainly so that I can train and evaluate it in the same way. I am retraining the model which gives me more confidence that it was trained with the exact same data. I will report back after that)

@de-code
Copy link
Collaborator Author

de-code commented Mar 17, 2020

Related point, it would be probably good to move the property file to a yaml file :D

(maybe in a different issue, if it does not exists already)

I like properties files because they are simple. yaml is fine too.

In any case, I am almost exclusively using environment variables for the configuration, because that is more Docker friendly and requires less customisation. I have one set of "deployment parameters" (actually helm arguments), which describe how ScienceBeam / GROBID is being deployed and configured. But there are certainly good reasons to use a configuration file for other use-cases.

@de-code
Copy link
Collaborator Author

de-code commented Mar 27, 2020

Probably should also be replacing hyphen (-) or slash (/) with underscore (_) for reference-segmenter, name/citation and name/header etc.

@kermitt2 kermitt2 self-requested a review March 28, 2020 10:02
@kermitt2
Copy link
Owner

Probably should also be replacing hyphen (-) or slash (/) with underscore (_) for reference-segmenter, name/citation and name/header etc.

The / in name/citation or name/header is used for solving the path to the resources and models, so it has a different interpretation than the hyphen (which is just a normal part of the model name).

@de-code
Copy link
Collaborator Author

de-code commented Mar 31, 2020

Probably should also be replacing hyphen (-) or slash (/) with underscore (_) for reference-segmenter, name/citation and name/header etc.

The / in name/citation or name/header is used for solving the path to the resources and models, so it has a different interpretation than the hyphen (which is just a normal part of the model name).

Okay, I see now that it replaces the / with - in getModelName. I made a minor modification to only replace the hyphen with an underscore so that the property name will be consistent with other property names. e.g. grobid.crf.engine.name_citation instead of grobid.crf.engine.name-citation.

@lfoppiano
Copy link
Collaborator

I'm testing it and it works with grobid models.

After setting the following in the configuration file:

grobid.crf.engine=delft

grobid.crf.engine.segmentation=wapiti
grobid.crf.engine.fulltext=wapiti
grobid.crf.engine.reference_segmenter=wapiti
grobid.crf.engine.figure=wapiti

I obtain the corresponding models to be loaded with wapiti / delft

Apr 09 12:00:29 falcon bash[18375]: INFO  [2020-04-09 03:00:29,937] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for citation...
Apr 09 12:00:31 falcon bash[18375]: INFO  [2020-04-09 03:00:31,100] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-home/models/fulltext/model.wapiti (size: 22836546)
Apr 09 12:00:31 falcon bash[18375]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-home/models/fulltext/model.wapiti"
Apr 09 12:00:32 falcon bash[18375]: Model path: /data/workspace/services/grobidl/grobid-home/models/fulltext/model.wapiti
Apr 09 12:00:32 falcon bash[18375]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-home/models/segmentation/model.wapiti"
Apr 09 12:00:32 falcon bash[18375]: INFO  [2020-04-09 03:00:32,569] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-home/models/segmentation/model.wapiti (size: 17807323)
Apr 09 12:00:33 falcon bash[18375]: Model path: /data/workspace/services/grobidl/grobid-home/models/segmentation/model.wapiti
Apr 09 12:00:33 falcon bash[18375]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-home/models/reference-segmenter/model.wapiti"
Apr 09 12:00:33 falcon bash[18375]: INFO  [2020-04-09 03:00:33,811] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-home/models/reference-segmenter/model.wapiti (size: 4921245)
Apr 09 12:00:34 falcon bash[18375]: Model path: /data/workspace/services/grobidl/grobid-home/models/reference-segmenter/model.wapiti
Apr 09 12:00:34 falcon bash[18375]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-home/models/figure/model.wapiti"
Apr 09 12:00:34 falcon bash[18375]: INFO  [2020-04-09 03:00:34,071] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-home/models/figure/model.wapiti (size: 422671)
Apr 09 12:00:34 falcon bash[18375]: Model path: /data/workspace/services/grobidl/grobid-home/models/figure/model.wapiti
Apr 09 12:00:34 falcon bash[18375]: running thread: 32
Apr 09 12:00:34 falcon bash[18375]: INFO  [2020-04-09 03:00:34,089] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for table...

I tested the same when running a sub-module, for example I've tried to force the use of wapiti for quantities, units and values:

grobid.crf.engine.quantities=wapiti
grobid.crf.engine.units=wapiti
grobid.crf.engine.values=wapiti

and I got them loaded correctly:

Apr 09 12:15:37 falcon bash[20246]: INFO  [2020-04-09 03:15:37,319] com.hubspot.dropwizard.guicier.DropwizardModule: Added guice injected health check: org.grobid.service.controller.HealthCheck
Apr 09 12:15:37 falcon bash[20246]: INFO  [2020-04-09 03:15:37,424] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for superconductors...
Apr 09 12:15:37 falcon bash[20246]: running thread: 1
Apr 09 12:15:37 falcon bash[20246]: INFO  [2020-04-09 03:15:37,425] org.grobid.core.jni.JEPThreadPool: Creating JEP instance for thread 14
Apr 09 12:15:40 falcon bash[20246]: INFO  [2020-04-09 03:15:40,031] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/quantities/model.wapiti (size: 12680546)
Apr 09 12:15:40 falcon bash[20246]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/quantities/model.wapiti"
Apr 09 12:15:40 falcon bash[20246]: Model path: /data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/quantities/model.wapiti
Apr 09 12:15:40 falcon bash[20246]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/units/model.wapiti"
Apr 09 12:15:40 falcon bash[20246]: INFO  [2020-04-09 03:15:40,843] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/units/model.wapiti (size: 47661)
Apr 09 12:15:40 falcon bash[20246]: Model path: /data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/units/model.wapiti
Apr 09 12:15:40 falcon bash[20246]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/values/model.wapiti"
Apr 09 12:15:40 falcon bash[20246]: INFO  [2020-04-09 03:15:40,897] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/values/model.wapiti (size: 90108)
Apr 09 12:15:40 falcon bash[20246]: Model path: /data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/values/model.wapiti

@lfoppiano lfoppiano added this to the 0.6.0 milestone Apr 17, 2020
@kermitt2 kermitt2 modified the milestones: 0.6.0, 0.6.1 Apr 24, 2020
@lfoppiano
Copy link
Collaborator

lfoppiano commented Jul 28, 2020

I though this was merged already in version 0.6.0, isn't it?
If it's not, I've been testing it on my workstation since months, no problem so far.

@de-code
Copy link
Collaborator Author

de-code commented Jul 28, 2020

I though this was merged already in version 0.6.0, isn't it?
If it's not, I've been testing it on my workstation since months, no problem so far.

It doesn't seem to have been merged yet. I got reminded of it because I didn't pay enough attention when I merged with master and it got disabled in my fork / branch.

@kermitt2
Copy link
Owner

kermitt2 commented Oct 3, 2020

@lfoppiano @de-code

I am only testing this feature now and it seems this is not working in its current stage as expected - or I might have misunderstood something:

  • apparently it is not working when we select WAPITI as default engine and select one or several DeLFT models:
grobid.crf.engine=wapiti

and then select one DeLFT model for a particular

grobid.crf.engine.citation=delft

The JEP native lib is not correctly loaded in this case and we have the usual error

! java.lang.UnsatisfiedLinkError: jep.Jep.init(Ljava/lang/ClassLoader;ZZ)J
  • apparently it's not working for the models having an hypen in the model name:
grobid.crf.engine=delft

grobid.crf.engine.segmentation=wapiti
grobid.crf.engine.fulltext=wapiti
grobid.crf.engine.reference-segmenter=wapiti

We have ->

INFO  [2020-10-03 20:25:45,810] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for reference-segmenter...

while the Wapiti models are correclty loaded for fulltext and segmentation.

Changing in GrobidProperties.java line 710:

    private static String getModelPropertySuffix(final String modelName) {
        return modelName.replaceAll("-", "_");
    }

to

    private static String getModelPropertySuffix(final String modelName) {
        return modelName;
    }

fixes the loading problem, but I suspect that it might break elsewhere.

@de-code
Copy link
Collaborator Author

de-code commented Oct 4, 2020

Hi @kermitt2 it was changed to expect the property name to have an underscore (see #559 (comment))

i.e. it should work with:

grobid.crf.engine.reference_segmenter=wapiti

or via setting the environment variable GROBID__CRF__ENGINE__REFERENCE_SEGMENTER to wapiti

@de-code de-code deleted the make-crf-engine-model-specific branch October 4, 2020 13:06
@kermitt2
Copy link
Owner

kermitt2 commented Oct 4, 2020

Yes, for the second problem, it works with an underscore instead of the hyphen!

Hard to guess when working with the property file, but it motivates to replace the property file by a yaml config file :)
I guess in between it has to be documented in https://grobid.readthedocs.io/en/latest/Deep-Learning-models/#getting-started-with-dl

@de-code
Copy link
Collaborator Author

de-code commented Oct 5, 2020

For the first problem, you are right, it isn't handled yet. It will work if you set it up such that it doesn't require the code hacking the library path (which I had issues with). I am setting LD_LIBRARY_PATH instead. I guess to make the loading module work for the model specific CRF engine configuration, we could have a method that gives you all of the configured CRF engines for any model, e.g. wapiti and delft and than iterates through that list when initializing them (I just wouldn't be able to test it well).

@lfoppiano
Copy link
Collaborator

I've updated a bit the documentation (and removed the useless conda requirement files) in b0560b5.

@lfoppiano
Copy link
Collaborator

I think I've fixed the first problem. It's in branch https://github.com/kermitt2/grobid/tree/bugfix/problem-pr-559

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants