Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using refextract for unstructured references #156

Open
fschwenn opened this issue Jul 7, 2017 · 1 comment
Open

Using refextract for unstructured references #156

fschwenn opened this issue Jul 7, 2017 · 1 comment

Comments

@fschwenn
Copy link
Contributor

fschwenn commented Jul 7, 2017

When the metadata for an article include references but only in an unstructured way, refextract should be used in the workflow after the individual spider (pipeline.py?).

At the moment refextract is only called if a fulltext is attached. But this wont be the case for all records. And in some cases it's even with fulltext better to start from a list of individual unstructured references than from the complete PDF, where refextract first has to find such list.

@michamos
Copy link
Contributor

As already said though email, I don't think this should be done in hepcrawl (contents of the email follow).

There are several cases that can arise:

  1. The publisher makes available a full structured reference
  2. The publisher makes available a list of unstructured references
  3. The publisher does not make any reference available in the metadata

In case 1., we don't need need refextract as Hepcrawl can do the
conversion from the publisher's reference format to ours, whereas in
case 3. there is nothing Hepcrawl can do besides providing the PDF.

So case 2. remains, but I think it would be better to have Hepcrawl
populate the raw references in the record, and run refextract (or in
the future maybe Grobid) in the workflow as is done curently to extract
references from PDF. We should have a task there that does reference
extraction from raw references in case they have been provided but
there are no parsed references. In this way, we cleanly separate the
task of translating between metadata formats (Hepcrawl) and parsing
references (refextract) and in the future we can easily swap refextract
for Grobid when it is mature enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants