Skip to content


Subversion checkout URL

You can clone with
Download ZIP


RefExtract: introduce author extraction mode #799

tiborsimko opened this Issue · 1 comment

1 participant


Originally on 2011-08-30

RefExtract should be enhanced with author extraction mode, behaving like giva. That is, provided an input PDF file, one should be able to run:

$ refextract --extract-authors -f 1:file.pdf

and RefExtract should study the beginning portion of the file, looking for authors and affiliations, and it should output something like:

    <datafield tag="100" ind1=" " ind2=" ">
      <subfield code="a">Doe, J</subfield>
      <subfield code="u">U. Foo</subfield>
    <datafield tag="700" ind1=" " ind2=" ">
      <subfield code="a">Bloggs, J</subfield>
      <subfield code="u">U. Bar</subfield>
    <datafield tag="700" ind1=" " ind2=" ">
      <subfield code="a">Mustermann, E</subfield>
      <subfield code="u">U. Xyzzy</subfield>
      <subfield code="u">U. Zyxxy</subfield>

IOW, refextract would provide two modes: the traditional --extract-references mode that would be the default, and a new --extract-authors mode the addition of which is the task of this ticket.

(Note that this may later touch a question of marking detected fields with provenance $2 and $9 information so that operating author extraction on the back end may be automatised and that refextract-found fields won't overwrite human-edited fields.)

@tiborsimko tiborsimko closed this

Originally on 2011-11-23

Merged in [b175937].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.