mromanello edited this page Jun 3, 2011 · 1 revision
Clone this wiki locally

What’s all about?

Currently, the code is a proof of concept for the task of extracting Canonical References as defined here.

As input data I used the XML output of JSTOR’s DFR API.
This API allows you to get different pieces of information for a given document in the JSTOR archive. For my purpose I’m calling the API to get the references contained in it. Indeed JSTOR pre-processes already the text content of its document identifying some chunks of information.

However, the references as extracted by JSTOR are a mix of canonical and modern bibliographic references. What currently CRefEx does is to filter out only canonical references from the list of references returned by JSTOR’s DFR API.


The following python snippet, for instance, fetches and prints the references for this paper using the API:

import urllib
# you'll be prompted to enter your credentials
print urllib.urlopen("http://dfr.jstor.org/resource/10.2307/40236128?view=references")


The tagging of Canonical References is done by using a modified version of the IOB2 notation (used to prepare corpora for the ConLL 2003 NER shared task). Indeed, I introduced the tags B-CRF and I-CRF to tag tokens that are respectively at the beginning (B-CRF) or inside (I-CRF) a Canonical Reference.


For each token in the training set the following set of features is extracted:

  • features concerning the presence of brackets
  • features concerning the presence of punctuation
  • features concerning the text case (e.g. upper case, lower case, etc.)
  • features concerning the presence of numbers

This list of features is not meant to be fixed and immutable. For instance, one useful feature to be added concerns the alphabet(s) the characters of a token belong to (e.g. Latin or Greek).