Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Home

mromanello edited this page · 28 revisions
Clone this wiki locally

What’s all about?

Currently, the code is a proof of concept for the task of extracting Canonical References as defined here.

As input data I used the XML output of JSTOR’s DFR API.
This API allows you to get different pieces of information for a given document in the JSTOR archive. For my purpose I’m calling the API to get the references contained in it. Indeed JSTOR pre-processes already the text content of its document identifying some chunks of information.

However, the references as extracted by JSTOR are a mix of canonical and modern bibliographic references. What currently CRefEx does is to filter out only canonical references form the reference list returned by JSTOR’s DFR API.

Data

The following python snippet, for instance, fetches and prints the references for this paper using the API:

import urllib
# you'll be prompted to enter your credentials
print urllib.url_open("http://dfr.jstor.org/resource/10.2307/40236128?view=references")

Tagging

The tagging of Canonical References is done by using a modified version of the IOB2 notation (used to prepare corpora for the ConLL 2003 NER shared task). Indeed, I introduced the tags B-CRF and I-CRF to tag tokens that are respectively at the beginning (B-CRF) or inside (I-CRF) a Canonical Reference.

Training

Installation

After you installed all the depencies, install the CRefEx python module as usually by typing:

python setup.py install

Dependencies

CRefEx relies on the following external modules/libraries:

  • CRF++: a C++ implementation of CRF written by Taku Kudo (I’m using version 0.53) which provides a python bridge.
    For the installation of the python module to call CRF++ please refer to the instruction in the ./python/README file (the path is relative to CRF++’s installation folder)
  • two python scripts for Cross Validation (partitioner.py and crossvalidationconstructor.py) written by Michael G. Noll

Run the example

Before running the code, change set the following variables accordingly with your local settings:

  • EVAL_PATH in eval.py: path to the directory where some data files will be written to
  • LOG_FILE: path to the log file

To run the example provided type:

python test_crex.py
Something went wrong with that request. Please try again.