-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow model specific crf engine configuration #559
allow model specific crf engine configuration #559
Conversation
What about removing this constraint and leave it in the configuration only? Grobid will use the model for each engine, regardless of the model name unless it's specified in the configuration...
My motivation is purely to avoid having if-else when we can use the configuration to tune this, and eventually avoiding to modify the code if new models need to have defaulted to Wapiti for example. For example, we discussed in another issue, the reference-segmenter should also be used using CRF for the long sequences (see comment kermitt2/delft#97 (comment)). |
if ( | ||
GrobidProperties.getGrobidCRFEngine() == GrobidCRFEngine.DELFT | ||
&& ( | ||
modelName.equals("fulltext") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why did you remove the use of the constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seem to cause some class loading issues when actually running the service (although the tests were fine). That is probably because GrobidModels
depends on GrobidProperties
(which then shouldn't depend on GrobidModels
).
Yes having the engine specified in the config file is indeed much nicer. However, even if It's hard to imagine having any sort of usable deep learning models for Related point, it would be probably good to move the property file to a yaml file :D |
Ah yes, that's it. Even better. This would simplify the implementation making it even more robust. 👍
👍 (maybe in a different issue, if it does not exists already) |
That sounds good to me.
I don't know how easy to implement that would be. My reservation would be that it adds magic and isn't obvious (I didn't quite like the condition in the code either). If the configuration told GROBID to use a DeLFT model and there is no such model, than I would consider that an error. Maybe best not to hide it. But if you feel strongly about it we could implement it. (I just wouldn't know how to at the moment)
Actually with the dataset I am working with, and the way I auto-annotated it, I seem to get significantly better results for the header area using a DL segmentation model. I am not sure whether that is in part due to the line numbers. I haven't evaluated non-header elements yet. But this would give me a more strong reason to look into it. (I also somewhat integrated Wapiti into sciencebeam-trainer-delft, mainly so that I can train and evaluate it in the same way. I am retraining the model which gives me more confidence that it was trained with the exact same data. I will report back after that) |
I like properties files because they are simple. yaml is fine too. In any case, I am almost exclusively using environment variables for the configuration, because that is more Docker friendly and requires less customisation. I have one set of "deployment parameters" (actually helm arguments), which describe how ScienceBeam / GROBID is being deployed and configured. But there are certainly good reasons to use a configuration file for other use-cases. |
Probably should also be replacing hyphen ( |
The |
Okay, I see now that it replaces the |
I'm testing it and it works with grobid models. After setting the following in the configuration file:
I obtain the corresponding models to be loaded with wapiti / delft
I tested the same when running a sub-module, for example I've tried to force the use of wapiti for quantities, units and values:
and I got them loaded correctly:
|
I though this was merged already in version 0.6.0, isn't it? |
It doesn't seem to have been merged yet. I got reminded of it because I didn't pay enough attention when I merged with master and it got disabled in my fork / branch. |
I am only testing this feature now and it seems this is not working in its current stage as expected - or I might have misunderstood something:
and then select one DeLFT model for a particular
The JEP native lib is not correctly loaded in this case and we have the usual error
We have ->
while the Wapiti models are correclty loaded for Changing in
to
fixes the loading problem, but I suspect that it might break elsewhere. |
Hi @kermitt2 it was changed to expect the property name to have an underscore (see #559 (comment)) i.e. it should work with:
or via setting the environment variable |
Yes, for the second problem, it works with an underscore instead of the hyphen! Hard to guess when working with the property file, but it motivates to replace the property file by a yaml config file :) |
For the first problem, you are right, it isn't handled yet. It will work if you set it up such that it doesn't require the code hacking the library path (which I had issues with). I am setting |
I've updated a bit the documentation (and removed the useless conda requirement files) in b0560b5. |
I think I've fixed the first problem. It's in branch https://github.com/kermitt2/grobid/tree/bugfix/problem-pr-559 |
This will allow experimentation with the a DELFT segmentation model for example.
It will still use wapiti for the segmentation and fulltext model by default.
You could enable the delft for the segmentation by adding this to the
grobid.properties
:/cc @kermitt2 @lfoppiano