Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results seem to be vastly improved by first parsing citation with GROBID #25

Closed
bfirsh opened this issue Apr 27, 2019 · 4 comments
Closed
Labels

Comments

@bfirsh
Copy link
Contributor

bfirsh commented Apr 27, 2019

I think this is what you're getting at in #13 and #21, but I figure it'd be useful to share my specific experience with using biblio-glutton.

I'm using biblio-glutton to add citation links to https://www.arxiv-vanity.com/. It works really well -- thank you!

The implementation is a bit weird though. I would have thought I would just be able to do /service/lookup?biblio=... and be done with it. That gave me very few positive results though -- I think I tried several papers and only got perhaps 5 positive matches.

I found I got much better results (most citations working, almost all on some papers) by first parsing the citation with GROBID then passing atitle and firstAuthor to biblio-glutton. I am getting almost perfect results -- I haven't seen a false positive yet.

Here's the lookup code I'm using, if you're interested. You can see the high-level logic at the bottom where it first does a grobid call, then a biblio-glutton call.

This simple improvement makes me wonder -- why doesn't biblio-glutton do this internally? Am I doing something stupid?

@kermitt2
Copy link
Owner

Hello @bfirsh !

Thanks a lot for the interest in glutton!

These are all good remarks and from this I think I need to iterate on the documentation to make things clearer for the users.

The logic of the lookup?biblio=... is described in the following: #12 (comment)

Basically if you only have the full raw bibliographical string, which is your case I think, the best is to pass it to glutton and configure a GROBID service to be used by glutton.

In the config file biblio-glutton/lookup/data/config/config.yml:

# Grobid URL
grobidPath: http://localhost:8070/api 

ok it should be called grobidServiceURL or something like that, but if it points to a GROBID instance, it is doing exactly what you are suggesting. glutton will call GROBID to parse the raw bibliographical string and exploits 1) the parsed metadata for faster matching and 2) atitle, firstAuthor automatically for post-validating the best match(es).

This way of calling glutton is close to what you are doing in your code, but glutton might use more metadata to speed-up the matching (look-up with jtitle, volume, first_page) and avoid as far as possible a full reference search which is time consuming.

Normally if the raw reference string is passed alone (without atitle and firstAuthor) without a GROBID service configured and available, almost all the call will return 404 with the message "Post-validation not possible, ..." (which should explain your very bad results) - post-validation is mandatory by default. By adding the parameter &postValidate=false we can by-pass the post-validation, but the false positive will be booming.

@bfirsh
Copy link
Contributor Author

bfirsh commented Apr 29, 2019

Interesting -- I am running it with grobid and I have configured the grobidPath. I know it can talk to grobid, because it refuses to resolve at all if it isn't connected to grobid (it says "you need either grobid or pass firstAuthor" or something along those lines). It's definitely calling grobid too -- I can see it in grobid's logs.

Even still, I get much better results by parsing first and then passing firstAuthor and atitle.

Maybe something is broken, but it is swallowing the real error?

@kermitt2
Copy link
Owner

kermitt2 commented May 3, 2019

I made some tests regarding the use of GROBID by glutton and the calls are working fine - if GROBID is running and the config is pointing to the running service, you should not get the "Post-validation not possible, no title/first author provided for validation and GROBID is not available."

However, there was a bug in the way the GROBID response was parsed, the same parser instance was reused for each GROBID response, resulting in wrong metadata after the second call. It might be the reason for this loss of accuracy. It's fixed in the current master.

@bfirsh
Copy link
Contributor Author

bfirsh commented Jun 10, 2019

Looks like it was that bug. I updated to upstream master and this seems to be fixed. Thanks! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants