allow model specific crf engine configuration #559

de-code · 2020-03-13T10:57:09Z

This will allow experimentation with the a DELFT segmentation model for example.

It will still use wapiti for the segmentation and fulltext model by default.

You could enable the delft for the segmentation by adding this to the grobid.properties:

grobid.crf.engine.segmentation=delft

/cc @kermitt2 @lfoppiano

coveralls · 2020-03-13T11:33:42Z

Coverage increased (+0.001%) to 38.436% when pulling e9c371b on elifesciences:make-crf-engine-model-specific into ff10968 on kermitt2:master.

lfoppiano · 2020-03-17T01:56:33Z

What about removing this constraint and leave it in the configuration only?

Grobid will use the model for each engine, regardless of the model name unless it's specified in the configuration...

grobid.crf.engine.segmentation=wapiti
grobid.crf.engine.fulltext=wapiti

My motivation is purely to avoid having if-else when we can use the configuration to tune this, and eventually avoiding to modify the code if new models need to have defaulted to Wapiti for example.

For example, we discussed in another issue, the reference-segmenter should also be used using CRF for the long sequences (see comment kermitt2/delft#97 (comment)).

lfoppiano · 2020-03-17T01:57:34Z

grobid-core/src/main/java/org/grobid/core/utilities/GrobidProperties.java

+        if (
+            GrobidProperties.getGrobidCRFEngine() == GrobidCRFEngine.DELFT
+            && (
+                modelName.equals("fulltext")


I wonder why did you remove the use of the constants?

It seem to cause some class loading issues when actually running the service (although the tests were fine). That is probably because GrobidModels depends on GrobidProperties (which then shouldn't depend on GrobidModels).

kermitt2 · 2020-03-17T02:19:09Z

Yes having the engine specified in the config file is indeed much nicer.

However, even if delft is specified in the config file, maybe we still need a mechanism where we can fall back to Wapiti if the DeLFT model does not exist no?

It's hard to imagine having any sort of usable deep learning models for segmentation, fulltext and reference segmenter (likely also table) without something like a sliding window implemented.

Related point, it would be probably good to move the property file to a yaml file :D

lfoppiano · 2020-03-17T02:22:27Z

Yes having the engine specified in the config file is indeed much nicer.

However, even if delft is specified in the config file, maybe we still need a mechanism where we can fall back to Wapiti if the DeLFT model does not exist no?

Ah yes, that's it. Even better. This would simplify the implementation making it even more robust. 👍

It's hard to imagine having any sort of usable deep learning models for segmentation, fulltext and reference segmenter (likely also table) without something like a sliding window implemented.
👍

Related point, it would be probably good to move the property file to a yaml file :D

👍 (maybe in a different issue, if it does not exists already)

de-code · 2020-03-17T07:06:19Z

What about removing this constraint and leave it in the configuration only?

That sounds good to me.

Yes having the engine specified in the config file is indeed much nicer.
However, even if delft is specified in the config file, maybe we still need a mechanism where we can fall back to Wapiti if the DeLFT model does not exist no?

Ah yes, that's it. Even better. This would simplify the implementation making it even more robust.

I don't know how easy to implement that would be. My reservation would be that it adds magic and isn't obvious (I didn't quite like the condition in the code either). If the configuration told GROBID to use a DeLFT model and there is no such model, than I would consider that an error. Maybe best not to hide it. But if you feel strongly about it we could implement it. (I just wouldn't know how to at the moment)

It's hard to imagine having any sort of usable deep learning models for segmentation, fulltext and reference segmenter (likely also table) without something like a sliding window implemented.

Actually with the dataset I am working with, and the way I auto-annotated it, I seem to get significantly better results for the header area using a DL segmentation model. I am not sure whether that is in part due to the line numbers. I haven't evaluated non-header elements yet. But this would give me a more strong reason to look into it. (I also somewhat integrated Wapiti into sciencebeam-trainer-delft, mainly so that I can train and evaluate it in the same way. I am retraining the model which gives me more confidence that it was trained with the exact same data. I will report back after that)

de-code · 2020-03-17T07:19:26Z

Related point, it would be probably good to move the property file to a yaml file :D

(maybe in a different issue, if it does not exists already)

I like properties files because they are simple. yaml is fine too.

In any case, I am almost exclusively using environment variables for the configuration, because that is more Docker friendly and requires less customisation. I have one set of "deployment parameters" (actually helm arguments), which describe how ScienceBeam / GROBID is being deployed and configured. But there are certainly good reasons to use a configuration file for other use-cases.

de-code · 2020-03-27T18:53:02Z

Probably should also be replacing hyphen (-) or slash (/) with underscore (_) for reference-segmenter, name/citation and name/header etc.

kermitt2 · 2020-03-28T10:06:39Z

Probably should also be replacing hyphen (-) or slash (/) with underscore (_) for reference-segmenter, name/citation and name/header etc.

The / in name/citation or name/header is used for solving the path to the resources and models, so it has a different interpretation than the hyphen (which is just a normal part of the model name).

de-code · 2020-03-31T14:08:31Z

Probably should also be replacing hyphen (-) or slash (/) with underscore (_) for reference-segmenter, name/citation and name/header etc.

The / in name/citation or name/header is used for solving the path to the resources and models, so it has a different interpretation than the hyphen (which is just a normal part of the model name).

Okay, I see now that it replaces the / with - in getModelName. I made a minor modification to only replace the hyphen with an underscore so that the property name will be consistent with other property names. e.g. grobid.crf.engine.name_citation instead of grobid.crf.engine.name-citation.

lfoppiano · 2020-04-09T03:17:05Z

I'm testing it and it works with grobid models.

After setting the following in the configuration file:

grobid.crf.engine=delft

grobid.crf.engine.segmentation=wapiti
grobid.crf.engine.fulltext=wapiti
grobid.crf.engine.reference_segmenter=wapiti
grobid.crf.engine.figure=wapiti

I obtain the corresponding models to be loaded with wapiti / delft

Apr 09 12:00:29 falcon bash[18375]: INFO  [2020-04-09 03:00:29,937] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for citation...
Apr 09 12:00:31 falcon bash[18375]: INFO  [2020-04-09 03:00:31,100] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-home/models/fulltext/model.wapiti (size: 22836546)
Apr 09 12:00:31 falcon bash[18375]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-home/models/fulltext/model.wapiti"
Apr 09 12:00:32 falcon bash[18375]: Model path: /data/workspace/services/grobidl/grobid-home/models/fulltext/model.wapiti
Apr 09 12:00:32 falcon bash[18375]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-home/models/segmentation/model.wapiti"
Apr 09 12:00:32 falcon bash[18375]: INFO  [2020-04-09 03:00:32,569] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-home/models/segmentation/model.wapiti (size: 17807323)
Apr 09 12:00:33 falcon bash[18375]: Model path: /data/workspace/services/grobidl/grobid-home/models/segmentation/model.wapiti
Apr 09 12:00:33 falcon bash[18375]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-home/models/reference-segmenter/model.wapiti"
Apr 09 12:00:33 falcon bash[18375]: INFO  [2020-04-09 03:00:33,811] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-home/models/reference-segmenter/model.wapiti (size: 4921245)
Apr 09 12:00:34 falcon bash[18375]: Model path: /data/workspace/services/grobidl/grobid-home/models/reference-segmenter/model.wapiti
Apr 09 12:00:34 falcon bash[18375]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-home/models/figure/model.wapiti"
Apr 09 12:00:34 falcon bash[18375]: INFO  [2020-04-09 03:00:34,071] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-home/models/figure/model.wapiti (size: 422671)
Apr 09 12:00:34 falcon bash[18375]: Model path: /data/workspace/services/grobidl/grobid-home/models/figure/model.wapiti
Apr 09 12:00:34 falcon bash[18375]: running thread: 32
Apr 09 12:00:34 falcon bash[18375]: INFO  [2020-04-09 03:00:34,089] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for table...

I tested the same when running a sub-module, for example I've tried to force the use of wapiti for quantities, units and values:

grobid.crf.engine.quantities=wapiti
grobid.crf.engine.units=wapiti
grobid.crf.engine.values=wapiti

and I got them loaded correctly:

Apr 09 12:15:37 falcon bash[20246]: INFO  [2020-04-09 03:15:37,319] com.hubspot.dropwizard.guicier.DropwizardModule: Added guice injected health check: org.grobid.service.controller.HealthCheck
Apr 09 12:15:37 falcon bash[20246]: INFO  [2020-04-09 03:15:37,424] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for superconductors...
Apr 09 12:15:37 falcon bash[20246]: running thread: 1
Apr 09 12:15:37 falcon bash[20246]: INFO  [2020-04-09 03:15:37,425] org.grobid.core.jni.JEPThreadPool: Creating JEP instance for thread 14
Apr 09 12:15:40 falcon bash[20246]: INFO  [2020-04-09 03:15:40,031] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/quantities/model.wapiti (size: 12680546)
Apr 09 12:15:40 falcon bash[20246]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/quantities/model.wapiti"
Apr 09 12:15:40 falcon bash[20246]: Model path: /data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/quantities/model.wapiti
Apr 09 12:15:40 falcon bash[20246]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/units/model.wapiti"
Apr 09 12:15:40 falcon bash[20246]: INFO  [2020-04-09 03:15:40,843] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/units/model.wapiti (size: 47661)
Apr 09 12:15:40 falcon bash[20246]: Model path: /data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/units/model.wapiti
Apr 09 12:15:40 falcon bash[20246]: [Wapiti] Loading model: "/data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/values/model.wapiti"
Apr 09 12:15:40 falcon bash[20246]: INFO  [2020-04-09 03:15:40,897] org.grobid.core.jni.WapitiModel: Loading model: /data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/values/model.wapiti (size: 90108)
Apr 09 12:15:40 falcon bash[20246]: Model path: /data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models/values/model.wapiti

lfoppiano · 2020-07-28T02:13:36Z

I though this was merged already in version 0.6.0, isn't it?
If it's not, I've been testing it on my workstation since months, no problem so far.

de-code · 2020-07-28T10:42:41Z

I though this was merged already in version 0.6.0, isn't it?
If it's not, I've been testing it on my workstation since months, no problem so far.

It doesn't seem to have been merged yet. I got reminded of it because I didn't pay enough attention when I merged with master and it got disabled in my fork / branch.

kermitt2 · 2020-10-03T20:50:20Z

@lfoppiano @de-code

I am only testing this feature now and it seems this is not working in its current stage as expected - or I might have misunderstood something:

apparently it is not working when we select WAPITI as default engine and select one or several DeLFT models:

grobid.crf.engine=wapiti

and then select one DeLFT model for a particular

grobid.crf.engine.citation=delft

The JEP native lib is not correctly loaded in this case and we have the usual error

! java.lang.UnsatisfiedLinkError: jep.Jep.init(Ljava/lang/ClassLoader;ZZ)J

apparently it's not working for the models having an hypen in the model name:

grobid.crf.engine=delft

grobid.crf.engine.segmentation=wapiti
grobid.crf.engine.fulltext=wapiti
grobid.crf.engine.reference-segmenter=wapiti

We have ->

INFO  [2020-10-03 20:25:45,810] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for reference-segmenter...

while the Wapiti models are correclty loaded for fulltext and segmentation.

Changing in GrobidProperties.java line 710:

    private static String getModelPropertySuffix(final String modelName) {
        return modelName.replaceAll("-", "_");
    }

to

    private static String getModelPropertySuffix(final String modelName) {
        return modelName;
    }

fixes the loading problem, but I suspect that it might break elsewhere.

de-code · 2020-10-04T13:04:34Z

Hi @kermitt2 it was changed to expect the property name to have an underscore (see #559 (comment))

i.e. it should work with:

grobid.crf.engine.reference_segmenter=wapiti

or via setting the environment variable GROBID__CRF__ENGINE__REFERENCE_SEGMENTER to wapiti

kermitt2 · 2020-10-04T13:19:55Z

Yes, for the second problem, it works with an underscore instead of the hyphen!

Hard to guess when working with the property file, but it motivates to replace the property file by a yaml config file :)
I guess in between it has to be documented in https://grobid.readthedocs.io/en/latest/Deep-Learning-models/#getting-started-with-dl

de-code · 2020-10-05T08:59:52Z

For the first problem, you are right, it isn't handled yet. It will work if you set it up such that it doesn't require the code hacking the library path (which I had issues with). I am setting LD_LIBRARY_PATH instead. I guess to make the loading module work for the model specific CRF engine configuration, we could have a method that gives you all of the configured CRF engines for any model, e.g. wapiti and delft and than iterates through that list when initializing them (I just wouldn't be able to test it well).

lfoppiano · 2020-10-12T00:18:08Z

I've updated a bit the documentation (and removed the useless conda requirement files) in b0560b5.

lfoppiano · 2020-10-12T02:07:02Z

I think I've fixed the first problem. It's in branch https://github.com/kermitt2/grobid/tree/bugfix/problem-pr-559

allow model specific crf engine configuration

8bb456f

de-code mentioned this pull request Mar 13, 2020

allow model specific crf engine configuration elifesciences/grobid#17

Merged

fixed AbstractTrainerIntegrationTest

e98e734

avoid using GrobidModels in GrobidProperties

5a9d2af

lfoppiano reviewed Mar 17, 2020

View reviewed changes

moved segmentation and fulltext exception to default configuration

34fbd00

lfoppiano approved these changes Mar 24, 2020

View reviewed changes

kermitt2 self-requested a review March 28, 2020 10:02

replace hyphen with underscore in model property suffix

daf1d3f

lfoppiano mentioned this pull request Apr 1, 2020

max_sequence_length not used(?) kermitt2/delft#44

Closed

lfoppiano added this to the 0.6.0 milestone Apr 17, 2020

kermitt2 modified the milestones: 0.6.0, 0.6.1 Apr 24, 2020

de-code added 2 commits July 27, 2020 10:25

Merge branch 'master' into make-crf-engine-model-specific

05d999f

fixed pass in model to getGrobidCRFEngine

e9c371b

kermitt2 merged commit 24774b3 into kermitt2:master Aug 11, 2020

lfoppiano mentioned this pull request Aug 13, 2020

NPE when Grobid is initialised from a module or test and the property file is not loaded yet (e.g. static context initialisation of GrobidModels) #620

Merged

de-code deleted the make-crf-engine-model-specific branch October 4, 2020 13:06

lfoppiano mentioned this pull request Oct 12, 2020

Fixing problem in PR 559 #648

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow model specific crf engine configuration #559

allow model specific crf engine configuration #559

de-code commented Mar 13, 2020 •

edited

Loading

coveralls commented Mar 13, 2020 •

edited

Loading

lfoppiano commented Mar 17, 2020

lfoppiano Mar 17, 2020

de-code Mar 17, 2020

kermitt2 commented Mar 17, 2020

lfoppiano commented Mar 17, 2020

de-code commented Mar 17, 2020

de-code commented Mar 17, 2020

de-code commented Mar 27, 2020 •

edited

Loading

kermitt2 commented Mar 28, 2020

de-code commented Mar 31, 2020

lfoppiano commented Apr 9, 2020

lfoppiano commented Jul 28, 2020 •

edited

Loading

de-code commented Jul 28, 2020

kermitt2 commented Oct 3, 2020

de-code commented Oct 4, 2020 •

edited

Loading

kermitt2 commented Oct 4, 2020

de-code commented Oct 5, 2020

lfoppiano commented Oct 12, 2020

lfoppiano commented Oct 12, 2020

allow model specific crf engine configuration #559

allow model specific crf engine configuration #559

Conversation

de-code commented Mar 13, 2020 • edited Loading

coveralls commented Mar 13, 2020 • edited Loading

lfoppiano commented Mar 17, 2020

lfoppiano Mar 17, 2020

Choose a reason for hiding this comment

de-code Mar 17, 2020

Choose a reason for hiding this comment

kermitt2 commented Mar 17, 2020

lfoppiano commented Mar 17, 2020

de-code commented Mar 17, 2020

de-code commented Mar 17, 2020

de-code commented Mar 27, 2020 • edited Loading

kermitt2 commented Mar 28, 2020

de-code commented Mar 31, 2020

lfoppiano commented Apr 9, 2020

lfoppiano commented Jul 28, 2020 • edited Loading

de-code commented Jul 28, 2020

kermitt2 commented Oct 3, 2020

de-code commented Oct 4, 2020 • edited Loading

kermitt2 commented Oct 4, 2020

de-code commented Oct 5, 2020

lfoppiano commented Oct 12, 2020

lfoppiano commented Oct 12, 2020

de-code commented Mar 13, 2020 •

edited

Loading

coveralls commented Mar 13, 2020 •

edited

Loading

de-code commented Mar 27, 2020 •

edited

Loading

lfoppiano commented Jul 28, 2020 •

edited

Loading

de-code commented Oct 4, 2020 •

edited

Loading