Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add REST service for mixed matching approach #12

Closed
kermitt2 opened this issue Jan 7, 2019 · 10 comments
Closed

Add REST service for mixed matching approach #12

kermitt2 opened this issue Jan 7, 2019 · 10 comments
Assignees

Comments

@kermitt2
Copy link
Owner

kermitt2 commented Jan 7, 2019

From GROBID we usually have both the raw full reference string and the parsed extracted fields (authors and title).

As addition to the present stuff, it would be nice to have as input to the matching service:

  • arguments "raw full reference string" plus author list (all authors separated by something, not just the first author), plus the title
  • this would allow to add a post validation (optional) with all the authors+title
  • this would allow to support the following approach: match first with first author last name + title (3-4 times faster) and only if it fails or author+title metadata are not available, try a matching with the raw full reference string (more expensive)
  • this could be exploited for more precise matching after a search-based blocking (sometimes two version of the same paper -one from a conference and one more complete in a journal issue- have the same title, but one more author, so the full list of authors is useful to select the best candidate)
  • finally we could integrate to glutton for raw reference string only-query, as an option, the parsing of the reference with GROBID, and add the post-validation in all the case. So even with raw reference, we would have integrate reliable post-validation/selection.

Post-validation is necessary to avoid false positive due to the search-based step (which is only the block step normally).

@kermitt2
Copy link
Owner Author

I am doing a new iteration on this.

The "super" matching service would work as this:

GET host:port/service/lookup?doi=DOI&atitle=ARTICLE_TITLE&firstAuthor=FIRST_AUTHOR_SURNAME&jtitle=JOURNAL_TITLE&volume=VOLUME&firstPage=FIRST_PAGE&biblio=BIBLIO_STRING[?postValidate=true][?parseReference=true]

The logic would be the following one:

  1. try to match with DOI if present, if successful try the post validation (if author at least is present), if validated DOI lookup fails go to 2 otherwise 5

  2. try to match with author and article title metadata if present, if successful try the post validation, if validated lookup fails go to 3, otherwise 5

  3. try to match with journal title, volume and firs page if present, if successful try the post validation (at least with author is present), if validated lookup fails go to 4, otherwise 5

  4. try full string matching, if present, if successful try the post-validation if at least if author is present - if not present call the grobid citation parser to get author at least (article title if possible) and post-validate with that, if this validated match fails too go to 6, otherwise 5

  5. return 200 with the matched DOI and extended metadata

  6. return 404

Validation takes place if postValidate is true (default). GROBID parsing takes place if parseReference is true` (default).

Ideally it should work with PMID, PMCID or ISTEX_ID instead of DOI (any "strong" identifier).

@lfoppiano lfoppiano self-assigned this Jan 26, 2019
lfoppiano added a commit that referenced this issue Jan 30, 2019
lfoppiano added a commit that referenced this issue Jan 30, 2019
@lfoppiano
Copy link
Collaborator

The point number 4 is not clear, what to do if the postvalidation doesn't work. Why should be re-validated with the author/title extracted from the full string? shouldn't we try to extract them with grobid and try to match (2)?

Second, so are you sure you want to changepostValidate and parseReference to true by default?

@kermitt2
Copy link
Owner Author

kermitt2 commented Feb 1, 2019

about point 4) I think it is clear but I can rephrase with more details:

4. try full string matching, if a full string is present. 
4.1 if full string matching is successful 
             if at least the author is present -> try to post validate the matching result with this author and possible title
             if author/title not present -> call the grobid citation parser to get author at least (article title if possible) -> try to post-validate with that 
4.2 if the post-validation fails, or if no author/title is available after Grobid parsing, or if full string matching initially failed, too go to 6, otherwise 5

In 2) we use the provided metadata only, Grobid is not called. So it's a different setting, with higher priority as it does not involve costly Grobid parsing.

@kermitt2
Copy link
Owner Author

kermitt2 commented Feb 1, 2019

Second, so are you sure you want to changepostValidate and parseReference to true by default?

Yes because, without post-validation, the false positive due to full reference string matching will kill the accuracy. Parsing the reference is a way to allow this post-validation even when no metadata are provided, so it is something to exploit as much as possible.

lfoppiano added a commit that referenced this issue Feb 2, 2019
@lfoppiano
Copy link
Collaborator

at least with author is present

what does that means?

if the title is not present then it's ignored and if it's present is considered for the postValidation while if the author is not present the post validation will fail? is it correct?

@kermitt2
Copy link
Owner Author

kermitt2 commented Feb 3, 2019

at least with author present

So this is only for step 4)

if the title is not provided as metadata, but author is, we can post-validate just with author (my observation is that most of the cases it is enough, but it might require more tests) this is an acceptable trade-off (there is also good chance that the title is anyway not present in the raw reference bibliography). Of course if both title and author are provided, we use both.

If the post-validation with author only passes, we are done, success.

If the post-validation with author only fails, we are also done, 404. There is no additional grobid reference parsing in this case (this is the trade-off).

@lfoppiano
Copy link
Collaborator

another question.... back to point 1. (but valid everytime)

if the author is not present and postvalidate = true its a 404, right?

If this is the case, this might be a problem for DOI lookup cause if postvalidation is not disable will return 404

@kermitt2
Copy link
Owner Author

kermitt2 commented Feb 3, 2019

mmm if author is not present, we don't post-validate in (1), we go basically then to (3) - and if we reach (4) we parse the raw reference string to try to get an author. There is no "author is not present and postvalidate = true" possible in the whole process. The availability of at least an author name (provided or extracted) if a condition for postvalidating.

@kermitt2
Copy link
Owner Author

So done with PR #18

@kermitt2
Copy link
Owner Author

Finally the mixed matching approach has been removed in version 0.2 because the full matching (which is more accurate) has been made much faster (almostas fast as the previous mixed matching), removing the interest of the mixed matching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants