Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Including DocExtract #944

Closed
invenio-developers opened this Issue · 5 comments

1 participant

invenio-developers
invenio-developers
Collaborator

Originally by adeiana (@Osso) on 2012-03-13

I am asking to merge the new refextractor.

It is squashed in a single commit. You can access the branch here:
http://invenio-software.org/repo/personal/invenio-adeiana/commit/?h=refextract-merge&id=b37f6956e3472c88871ec0924456fec0283a3235

invenio-developers
Collaborator

Originally by adeiana (@Osso) on 2012-06-05

Thew new branch is here adeiana/944-refextract

invenio-developers
Collaborator

Originally by Alessio Deiana alessio.deiana@cern.ch on 2012-11-27

In 9c44fff:

#CommitTicketReference repository="" revision="9c44fffa48aba22a416ebd09a41fa15a97158148"
DocExtract: new docextract and refextract modules

- Adds DocExtract as a way to easily access all text mining facilities
  It will allow to extract references, authors, plots, etc. (closes #944)

- Moves the refextract scripts from the bibedit module into its own
  module.

- Adds a new api to use the refextract module. It includes calls to:
  - update_references(): update references by passing a record id;

  - extract_references_from_*(): extract and parse references from
    file/url/record id/string;

  - new function that returns the marcxml of the record with
    updated references;

  - new function to check if a record has a fulltext (pdf) attached.

- Refextract filters out null characters from pdfs converted text as
  they are refused by bibupload.

- Adds several updates to refextract parsing:

  - handling of JHEP-like journals, as they need the last 2 digits
    of the year prepended to the volume;

  - adds support for ISBN. They are added in a new subfield called $$i;

  - adds support for references like CERN-LHCC2003-01 by transforming
    it to CERN-LHCC-2003-01;

  - adds a new subfield <subfield code="t">Text</subfield> where
    refextract stores references to quoted text "Text".

- Adds a new option to the bibtask mode of refextract,
  "--no-overwrite", which checks each record for existing references
  before parsing it. If the record already has references, it skips it.

- Fixes recent records detection:

  - only stores last_updated when running on recent records.
    This prevents from parsing the most recent reference via --recids n, updating
    the last_updated field and have refextract skip all references preceeding n;

  - only updated last_id and last_updated when respectively the new id is bigger
    and the new last_updated is more recent. This prevents to store an old date
    when parsing old records.

- Handles the format arXiv:9910.1234 [physics.ins-det].

- Fixes numeration checking when looking for the end of references.

- Reworks xbook as a single tag: xbook was storing the book title,
  instead the title is always stored in $$t.

- New authors recognized:
    - Figuera-O'Farrill
    - P. Pre'
    - Dan V. Schroeder

- Adds 9+ and w+ to report numers format.

- Handles Sci.Eng. 450(1-3), 3, 2007 (no space after volume).

- Handles PoS LAT2007 (2007) 12 journal.

- Handles report numbers like CERN/LHCC/98-013.

- Handles urls like http://server/?q=1&w=2.

- Handles C67:674,1998 numeration.

- Adds a new way to recognize journals which is needed when we recognized
  short titles. Often the short titles or initials of a journal conflict
  with other names.
  e.g. DAN (the journal ) and Dan (common first name)
  We handle it via precise regular expresssions.

- Match Acknowldgment and Acknowledgment as end of sections.

- Format hep report numbers to hep-th/999999.

- Recognizes roman numbers as volume numbers.

- Removes [] and () from o subfield.

- Removes extra spaces at the end of lines.

- Does not try to detect C et D for roman rumbers. It would result in some
  series letters being detected instead.

- Does not detect "B, 07" volumes anymore since some of these are from journals
  which are different Phys.Rev. & and Phys.Rev.B.

- Format hep-ex report numbers.

- Tweaks how the beginning and the end of the references sections are found.

- Allows dashes as separators for numeration.

- REST api to run refextract.

- Defaults to inspire format on CLI when running on an inspire site.

- Handles journals withe series included in title.

- Introduces a separator in journals kb:
  Phys.Rev.B maps to Phys.Rev.;B.

- Handles Phys.Rev.;B by splitting the B from the journal title and adding
  it in front of the volume.

- Repackages docextract and refextract in one directory.

- Search hook for searching from a reference.

- Updates binaries to use template.in for custom python binaries paths.

- Splits daemon functionality which remains in refextract
  and cli functionality which is moved to docextract.

- Recognizes publishers.

- Removes JINST from special journals.

- Moves special journals kb to a file
.
* Allows to extract references from an arxiv id.

- kbs loading optimization: they are now cached in memory after being loaded.

- Create RT tickets after extracting references.

- Fixes footer removal when references section contains ")".

- Escape ibid authors for xml (was leading to bibupload failed tasks).

- Handle erratum-ibid (closes #1014)

- Transforms hep-lat-9999 to hep-lat/9999 and astro-php-09 to astro-ph/09.

- arXiv papers can have several revisions over the first week
  and curation of this papers is delayed by that one week.
  We decided as a result to re-extract references when an arXiv record
  is modified on its first week.
invenio-developers
Collaborator

Originally by Alessio Deiana alessio.deiana@cern.ch on 2012-11-27

In 9c44fff:

#CommitTicketReference repository="" revision="9c44fffa48aba22a416ebd09a41fa15a97158148"
DocExtract: new docextract and refextract modules

- Adds DocExtract as a way to easily access all text mining facilities
  It will allow to extract references, authors, plots, etc. (closes #944)

- Moves the refextract scripts from the bibedit module into its own
  module.

- Adds a new api to use the refextract module. It includes calls to:
  - update_references(): update references by passing a record id;

  - extract_references_from_*(): extract and parse references from
    file/url/record id/string;

  - new function that returns the marcxml of the record with
    updated references;

  - new function to check if a record has a fulltext (pdf) attached.

- Refextract filters out null characters from pdfs converted text as
  they are refused by bibupload.

- Adds several updates to refextract parsing:

  - handling of JHEP-like journals, as they need the last 2 digits
    of the year prepended to the volume;

  - adds support for ISBN. They are added in a new subfield called $$i;

  - adds support for references like CERN-LHCC2003-01 by transforming
    it to CERN-LHCC-2003-01;

  - adds a new subfield <subfield code="t">Text</subfield> where
    refextract stores references to quoted text "Text".

- Adds a new option to the bibtask mode of refextract,
  "--no-overwrite", which checks each record for existing references
  before parsing it. If the record already has references, it skips it.

- Fixes recent records detection:

  - only stores last_updated when running on recent records.
    This prevents from parsing the most recent reference via --recids n, updating
    the last_updated field and have refextract skip all references preceeding n;

  - only updated last_id and last_updated when respectively the new id is bigger
    and the new last_updated is more recent. This prevents to store an old date
    when parsing old records.

- Handles the format arXiv:9910.1234 [physics.ins-det].

- Fixes numeration checking when looking for the end of references.

- Reworks xbook as a single tag: xbook was storing the book title,
  instead the title is always stored in $$t.

- New authors recognized:
    - Figuera-O'Farrill
    - P. Pre'
    - Dan V. Schroeder

- Adds 9+ and w+ to report numers format.

- Handles Sci.Eng. 450(1-3), 3, 2007 (no space after volume).

- Handles PoS LAT2007 (2007) 12 journal.

- Handles report numbers like CERN/LHCC/98-013.

- Handles urls like http://server/?q=1&w=2.

- Handles C67:674,1998 numeration.

- Adds a new way to recognize journals which is needed when we recognized
  short titles. Often the short titles or initials of a journal conflict
  with other names.
  e.g. DAN (the journal ) and Dan (common first name)
  We handle it via precise regular expresssions.

- Match Acknowldgment and Acknowledgment as end of sections.

- Format hep report numbers to hep-th/999999.

- Recognizes roman numbers as volume numbers.

- Removes [] and () from o subfield.

- Removes extra spaces at the end of lines.

- Does not try to detect C et D for roman rumbers. It would result in some
  series letters being detected instead.

- Does not detect "B, 07" volumes anymore since some of these are from journals
  which are different Phys.Rev. & and Phys.Rev.B.

- Format hep-ex report numbers.

- Tweaks how the beginning and the end of the references sections are found.

- Allows dashes as separators for numeration.

- REST api to run refextract.

- Defaults to inspire format on CLI when running on an inspire site.

- Handles journals withe series included in title.

- Introduces a separator in journals kb:
  Phys.Rev.B maps to Phys.Rev.;B.

- Handles Phys.Rev.;B by splitting the B from the journal title and adding
  it in front of the volume.

- Repackages docextract and refextract in one directory.

- Search hook for searching from a reference.

- Updates binaries to use template.in for custom python binaries paths.

- Splits daemon functionality which remains in refextract
  and cli functionality which is moved to docextract.

- Recognizes publishers.

- Removes JINST from special journals.

- Moves special journals kb to a file
.
* Allows to extract references from an arxiv id.

- kbs loading optimization: they are now cached in memory after being loaded.

- Create RT tickets after extracting references.

- Fixes footer removal when references section contains ")".

- Escape ibid authors for xml (was leading to bibupload failed tasks).

- Handle erratum-ibid (closes #1014)

- Transforms hep-lat-9999 to hep-lat/9999 and astro-php-09 to astro-ph/09.

- arXiv papers can have several revisions over the first week
  and curation of this papers is delayed by that one week.
  We decided as a result to re-extract references when an arXiv record
  is modified on its first week.
invenio-developers
Collaborator

Originally by Alessio Deiana alessio.deiana@cern.ch on 2012-11-27

In 9c44fff:

#CommitTicketReference repository="" revision="9c44fffa48aba22a416ebd09a41fa15a97158148"
DocExtract: new docextract and refextract modules

- Adds DocExtract as a way to easily access all text mining facilities
  It will allow to extract references, authors, plots, etc. (closes #944)

- Moves the refextract scripts from the bibedit module into its own
  module.

- Adds a new api to use the refextract module. It includes calls to:
  - update_references(): update references by passing a record id;

  - extract_references_from_*(): extract and parse references from
    file/url/record id/string;

  - new function that returns the marcxml of the record with
    updated references;

  - new function to check if a record has a fulltext (pdf) attached.

- Refextract filters out null characters from pdfs converted text as
  they are refused by bibupload.

- Adds several updates to refextract parsing:

  - handling of JHEP-like journals, as they need the last 2 digits
    of the year prepended to the volume;

  - adds support for ISBN. They are added in a new subfield called $$i;

  - adds support for references like CERN-LHCC2003-01 by transforming
    it to CERN-LHCC-2003-01;

  - adds a new subfield <subfield code="t">Text</subfield> where
    refextract stores references to quoted text "Text".

- Adds a new option to the bibtask mode of refextract,
  "--no-overwrite", which checks each record for existing references
  before parsing it. If the record already has references, it skips it.

- Fixes recent records detection:

  - only stores last_updated when running on recent records.
    This prevents from parsing the most recent reference via --recids n, updating
    the last_updated field and have refextract skip all references preceeding n;

  - only updated last_id and last_updated when respectively the new id is bigger
    and the new last_updated is more recent. This prevents to store an old date
    when parsing old records.

- Handles the format arXiv:9910.1234 [physics.ins-det].

- Fixes numeration checking when looking for the end of references.

- Reworks xbook as a single tag: xbook was storing the book title,
  instead the title is always stored in $$t.

- New authors recognized:
    - Figuera-O'Farrill
    - P. Pre'
    - Dan V. Schroeder

- Adds 9+ and w+ to report numers format.

- Handles Sci.Eng. 450(1-3), 3, 2007 (no space after volume).

- Handles PoS LAT2007 (2007) 12 journal.

- Handles report numbers like CERN/LHCC/98-013.

- Handles urls like http://server/?q=1&w=2.

- Handles C67:674,1998 numeration.

- Adds a new way to recognize journals which is needed when we recognized
  short titles. Often the short titles or initials of a journal conflict
  with other names.
  e.g. DAN (the journal ) and Dan (common first name)
  We handle it via precise regular expresssions.

- Match Acknowldgment and Acknowledgment as end of sections.

- Format hep report numbers to hep-th/999999.

- Recognizes roman numbers as volume numbers.

- Removes [] and () from o subfield.

- Removes extra spaces at the end of lines.

- Does not try to detect C et D for roman rumbers. It would result in some
  series letters being detected instead.

- Does not detect "B, 07" volumes anymore since some of these are from journals
  which are different Phys.Rev. & and Phys.Rev.B.

- Format hep-ex report numbers.

- Tweaks how the beginning and the end of the references sections are found.

- Allows dashes as separators for numeration.

- REST api to run refextract.

- Defaults to inspire format on CLI when running on an inspire site.

- Handles journals withe series included in title.

- Introduces a separator in journals kb:
  Phys.Rev.B maps to Phys.Rev.;B.

- Handles Phys.Rev.;B by splitting the B from the journal title and adding
  it in front of the volume.

- Repackages docextract and refextract in one directory.

- Search hook for searching from a reference.

- Updates binaries to use template.in for custom python binaries paths.

- Splits daemon functionality which remains in refextract
  and cli functionality which is moved to docextract.

- Recognizes publishers.

- Removes JINST from special journals.

- Moves special journals kb to a file
.
* Allows to extract references from an arxiv id.

- kbs loading optimization: they are now cached in memory after being loaded.

- Create RT tickets after extracting references.

- Fixes footer removal when references section contains ")".

- Escape ibid authors for xml (was leading to bibupload failed tasks).

- Handle erratum-ibid (closes #1014)

- Transforms hep-lat-9999 to hep-lat/9999 and astro-php-09 to astro-ph/09.

- arXiv papers can have several revisions over the first week
  and curation of this papers is delayed by that one week.
  We decided as a result to re-extract references when an arXiv record
  is modified on its first week.
invenio-developers
Collaborator

Originally by Alessio Deiana alessio.deiana@cern.ch on 2012-11-27

In 9c44fff:

#CommitTicketReference repository="" revision="9c44fffa48aba22a416ebd09a41fa15a97158148"
DocExtract: new docextract and refextract modules

- Adds DocExtract as a way to easily access all text mining facilities
  It will allow to extract references, authors, plots, etc. (closes #944)

- Moves the refextract scripts from the bibedit module into its own
  module.

- Adds a new api to use the refextract module. It includes calls to:
  - update_references(): update references by passing a record id;

  - extract_references_from_*(): extract and parse references from
    file/url/record id/string;

  - new function that returns the marcxml of the record with
    updated references;

  - new function to check if a record has a fulltext (pdf) attached.

- Refextract filters out null characters from pdfs converted text as
  they are refused by bibupload.

- Adds several updates to refextract parsing:

  - handling of JHEP-like journals, as they need the last 2 digits
    of the year prepended to the volume;

  - adds support for ISBN. They are added in a new subfield called $$i;

  - adds support for references like CERN-LHCC2003-01 by transforming
    it to CERN-LHCC-2003-01;

  - adds a new subfield <subfield code="t">Text</subfield> where
    refextract stores references to quoted text "Text".

- Adds a new option to the bibtask mode of refextract,
  "--no-overwrite", which checks each record for existing references
  before parsing it. If the record already has references, it skips it.

- Fixes recent records detection:

  - only stores last_updated when running on recent records.
    This prevents from parsing the most recent reference via --recids n, updating
    the last_updated field and have refextract skip all references preceeding n;

  - only updated last_id and last_updated when respectively the new id is bigger
    and the new last_updated is more recent. This prevents to store an old date
    when parsing old records.

- Handles the format arXiv:9910.1234 [physics.ins-det].

- Fixes numeration checking when looking for the end of references.

- Reworks xbook as a single tag: xbook was storing the book title,
  instead the title is always stored in $$t.

- New authors recognized:
    - Figuera-O'Farrill
    - P. Pre'
    - Dan V. Schroeder

- Adds 9+ and w+ to report numers format.

- Handles Sci.Eng. 450(1-3), 3, 2007 (no space after volume).

- Handles PoS LAT2007 (2007) 12 journal.

- Handles report numbers like CERN/LHCC/98-013.

- Handles urls like http://server/?q=1&w=2.

- Handles C67:674,1998 numeration.

- Adds a new way to recognize journals which is needed when we recognized
  short titles. Often the short titles or initials of a journal conflict
  with other names.
  e.g. DAN (the journal ) and Dan (common first name)
  We handle it via precise regular expresssions.

- Match Acknowldgment and Acknowledgment as end of sections.

- Format hep report numbers to hep-th/999999.

- Recognizes roman numbers as volume numbers.

- Removes [] and () from o subfield.

- Removes extra spaces at the end of lines.

- Does not try to detect C et D for roman rumbers. It would result in some
  series letters being detected instead.

- Does not detect "B, 07" volumes anymore since some of these are from journals
  which are different Phys.Rev. & and Phys.Rev.B.

- Format hep-ex report numbers.

- Tweaks how the beginning and the end of the references sections are found.

- Allows dashes as separators for numeration.

- REST api to run refextract.

- Defaults to inspire format on CLI when running on an inspire site.

- Handles journals withe series included in title.

- Introduces a separator in journals kb:
  Phys.Rev.B maps to Phys.Rev.;B.

- Handles Phys.Rev.;B by splitting the B from the journal title and adding
  it in front of the volume.

- Repackages docextract and refextract in one directory.

- Search hook for searching from a reference.

- Updates binaries to use template.in for custom python binaries paths.

- Splits daemon functionality which remains in refextract
  and cli functionality which is moved to docextract.

- Recognizes publishers.

- Removes JINST from special journals.

- Moves special journals kb to a file
.
* Allows to extract references from an arxiv id.

- kbs loading optimization: they are now cached in memory after being loaded.

- Create RT tickets after extracting references.

- Fixes footer removal when references section contains ")".

- Escape ibid authors for xml (was leading to bibupload failed tasks).

- Handle erratum-ibid (closes #1014)

- Transforms hep-lat-9999 to hep-lat/9999 and astro-php-09 to astro-ph/09.

- arXiv papers can have several revisions over the first week
  and curation of this papers is delayed by that one week.
  We decided as a result to re-extract references when an arXiv record
  is modified on its first week.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.