Originally by adeiana (@Osso) on 2012-03-13
I am asking to merge the new refextractor.
It is squashed in a single commit. You can access the branch here:
Originally by adeiana (@Osso) on 2012-06-05
Thew new branch is here adeiana/944-refextract
Originally by Alessio Deiana firstname.lastname@example.org on 2012-11-27
#CommitTicketReference repository="" revision="9c44fffa48aba22a416ebd09a41fa15a97158148"
DocExtract: new docextract and refextract modules
- Adds DocExtract as a way to easily access all text mining facilities
It will allow to extract references, authors, plots, etc. (closes #944)
- Moves the refextract scripts from the bibedit module into its own
- Adds a new api to use the refextract module. It includes calls to:
- update_references(): update references by passing a record id;
- extract_references_from_*(): extract and parse references from
- new function that returns the marcxml of the record with
- new function to check if a record has a fulltext (pdf) attached.
- Refextract filters out null characters from pdfs converted text as
they are refused by bibupload.
- Adds several updates to refextract parsing:
- handling of JHEP-like journals, as they need the last 2 digits
of the year prepended to the volume;
- adds support for ISBN. They are added in a new subfield called $$i;
- adds support for references like CERN-LHCC2003-01 by transforming
it to CERN-LHCC-2003-01;
- adds a new subfield <subfield code="t">Text</subfield> where
refextract stores references to quoted text "Text".
- Adds a new option to the bibtask mode of refextract,
"--no-overwrite", which checks each record for existing references
before parsing it. If the record already has references, it skips it.
- Fixes recent records detection:
- only stores last_updated when running on recent records.
This prevents from parsing the most recent reference via --recids n, updating
the last_updated field and have refextract skip all references preceeding n;
- only updated last_id and last_updated when respectively the new id is bigger
and the new last_updated is more recent. This prevents to store an old date
when parsing old records.
- Handles the format arXiv:9910.1234 [physics.ins-det].
- Fixes numeration checking when looking for the end of references.
- Reworks xbook as a single tag: xbook was storing the book title,
instead the title is always stored in $$t.
- New authors recognized:
- P. Pre'
- Dan V. Schroeder
- Adds 9+ and w+ to report numers format.
- Handles Sci.Eng. 450(1-3), 3, 2007 (no space after volume).
- Handles PoS LAT2007 (2007) 12 journal.
- Handles report numbers like CERN/LHCC/98-013.
- Handles urls like http://server/?q=1&w=2.
- Handles C67:674,1998 numeration.
- Adds a new way to recognize journals which is needed when we recognized
short titles. Often the short titles or initials of a journal conflict
with other names.
e.g. DAN (the journal ) and Dan (common first name)
We handle it via precise regular expresssions.
- Match Acknowldgment and Acknowledgment as end of sections.
- Format hep report numbers to hep-th/999999.
- Recognizes roman numbers as volume numbers.
- Removes  and () from o subfield.
- Removes extra spaces at the end of lines.
- Does not try to detect C et D for roman rumbers. It would result in some
series letters being detected instead.
- Does not detect "B, 07" volumes anymore since some of these are from journals
which are different Phys.Rev. & and Phys.Rev.B.
- Format hep-ex report numbers.
- Tweaks how the beginning and the end of the references sections are found.
- Allows dashes as separators for numeration.
- REST api to run refextract.
- Defaults to inspire format on CLI when running on an inspire site.
- Handles journals withe series included in title.
- Introduces a separator in journals kb:
Phys.Rev.B maps to Phys.Rev.;B.
- Handles Phys.Rev.;B by splitting the B from the journal title and adding
it in front of the volume.
- Repackages docextract and refextract in one directory.
- Search hook for searching from a reference.
- Updates binaries to use template.in for custom python binaries paths.
- Splits daemon functionality which remains in refextract
and cli functionality which is moved to docextract.
- Recognizes publishers.
- Removes JINST from special journals.
- Moves special journals kb to a file
* Allows to extract references from an arxiv id.
- kbs loading optimization: they are now cached in memory after being loaded.
- Create RT tickets after extracting references.
- Fixes footer removal when references section contains ")".
- Escape ibid authors for xml (was leading to bibupload failed tasks).
- Handle erratum-ibid (closes #1014)
- Transforms hep-lat-9999 to hep-lat/9999 and astro-php-09 to astro-ph/09.
- arXiv papers can have several revisions over the first week
and curation of this papers is delayed by that one week.
We decided as a result to re-extract references when an arXiv record
is modified on its first week.