Skip to content


Subversion checkout URL

You can clone with
Download ZIP


mromanello edited this page · 28 revisions
Clone this wiki locally

What’s all about?

Currently, the code is a proof of concept for the task of extracting Canonical References as defined here.

As input data I used the XML output of JSTOR’s DFR API.
This API allows you to get different pieces of information for a given document in the JSTOR archive. For my purpose I’m calling the API to get the references contained in it. Indeed JSTOR pre-processes already the text content of its document identifying some chunks of information.





After you installed all the depencies, install the CRefEx python module as usually by typing:

python install


CRefEx relies on the following external modules/libraries:

  • CRF++: a C++ implementation of CRF written by Taku Kudo (I’m using version 0.53) which provides a python bridge.
    For the installation of the python module to call CRF++ please refer to the instruction in the ./python/README file (the path is relative to CRF++’s installation folder)
  • two python scripts for Cross Validation ( and written by Michael G. Noll

Run the example

Before running the code, change set the following variables accordingly with your local settings:

  • EVAL_PATH in path to the directory where some data files will be written to
  • LOG_FILE: path to the log file

To run the example provided type:

Something went wrong with that request. Please try again.