Currently, the code is a proof of concept for the task of extracting Canonical References as defined here.
As input data I used the XML output of JSTOR’s DFR API.
This API allows you to get different pieces of information for a given document in the JSTOR archive. For my purpose I’m calling the API to get the references contained in it. Indeed JSTOR pre-processes already the text content of its document identifying some chunks of information.
However, the references as extracted by JSTOR are a mix of canonical and modern bibliographic references. What currently CRefEx does is to filter out only canonical references form the reference list returned by JSTOR’s DFR API.
The following python snippet, for instance, fetches and prints the references for this paper using the API:
import urllib # you'll be prompted to enter your credentials print urllib.url_open("http://dfr.jstor.org/resource/10.2307/40236128?view=references")
The tagging of Canonical References is done by using a modified version of the IOB2 notation (used to prepare corpora for the ConLL 2003 NER shared task). Indeed, I introduced the tags B-CRF and I-CRF to tag tokens that are respectively at the beginning (B-CRF) or inside (I-CRF) a Canonical Reference.
After you installed all the depencies, install the CRefEx python module as usually by typing:
python setup.py install
CRefEx relies on the following external modules/libraries:
Before running the code, change set the following variables accordingly with your local settings:
To run the example provided type: