Skip to content


Subversion checkout URL

You can clone with
Download ZIP


mromanello edited this page · 28 revisions
Clone this wiki locally

What’s all about?

Currently, the code is a proof of concept for the task of extracting Canonical References as defined here.

As input data I used the XML output of JSTOR’s DFR API.
This API allows you to get different pieces of information for a given document in the JSTOR archive. For my purpose I’m calling the API to get the references contained in it. Indeed JSTOR pre-processes already the text content of its document identifying some chunks of information.

However, the references as extracted by JSTOR are a mix of canonical and modern bibliographic references. What currently CRefEx does is to filter out only canonical references form the reference list returned by JSTOR’s DFR API.


The following python snippet, for instance, fetches and prints the references for this paper using the API:

import urllib
# you'll be prompted to enter your credentials
print urllib.url_open("")


The tagging of Canonical References is done by using a modified version of the IOB2 notation (used to prepare corpora for the ConLL 2003 NER shared task). Indeed, I introduced the tags B-CRF and I-CRF to tag tokens that are respectively at the beginning (B-CRF) or inside (I-CRF) a Canonical Reference.



After you installed all the depencies, install the CRefEx python module as usually by typing:

python install


CRefEx relies on the following external modules/libraries:

  • CRF++: a C++ implementation of CRF written by Taku Kudo (I’m using version 0.53) which provides a python bridge.
    For the installation of the python module to call CRF++ please refer to the instruction in the ./python/README file (the path is relative to CRF++’s installation folder)
  • two python scripts for Cross Validation ( and written by Michael G. Noll

Run the example

Before running the code, change set the following variables accordingly with your local settings:

  • EVAL_PATH in path to the directory where some data files will be written to
  • LOG_FILE: path to the log file

To run the example provided type:

Something went wrong with that request. Please try again.