Skip to content
This repository has been archived by the owner on Nov 14, 2019. It is now read-only.

Commit

Permalink
Initial Commit
Browse files Browse the repository at this point in the history
  • Loading branch information
jakevdp committed Oct 15, 2011
0 parents commit 356cbfe
Show file tree
Hide file tree
Showing 16 changed files with 22,233 additions and 0 deletions.
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
*~
*pyc
*so
*.model
build
src/build
output/*
4 changes: 4 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
all: crfsuite.so

crfsuite.so: src/crfsuite.pyx
python setup.py build_ext --inplace
56 changes: 56 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
==========
pyCRFsuite
==========

This is a python wrapper for crfsuite, a fast implementation of Conditional
Random Fields

Authors
-------

- Jake Vanderplas <vanderplas@astro.washington.edu>


Installation
------------

Currently the package is set-up only for in-place installation. It requires
the ``crfsuite`` library to be installed: see
http://www.chokkan.org/software/crfsuite/

Once this is installed, simply type ``make`` in the head directory.

Testing
-------
There are a few basic test scripts in the head directory. ``test.py`` will
read a small dataset from ``example_files``, then run a basic training and
tagging operation. ``crfsuite_test.csh`` runs the same operation using the
command-line frontend provided by crfsuite. To compare the results of the
training and tagging, run ``compare_output.csh``. This will print all the
places where the tagging results differ.

TODO
----
This is still a very incomplete wrapper. Search ``TODO`` within
``src/crfsuite.pyx`` to see some issues that need to be addressed.

Issues
------
There are a few 'features' in crfsuite that make efficient python wrapping
difficult.

- **Model File Output**: as currently written, crfsuite writes the result of
a training directly to a binary file. The library is not configured to
allow writing the model to memory. This means that a python wrapper must
write the model to disk, then read the model into memory before performing
any tagging operation. It would be better if the model could be saved
directly to a CRFsuite model structure, though when dealing with the very
large datasets for which crfsuite is designed, it's clear why the author
made the choice he did.

- **Memory mapping**: as currently written, crfsuite data is not stored in
contiguous arrays. This means that there is no way to map a crfsuite data
structure to a numpy array, and any input to crfsuite will need to be
copied in memory. Addressing this would require significant upstream
changes: the ``crfsuite_item_t`` structure would have to use an array of
floats and an array of ints rather than an array of attribute structures.
10 changes: 10 additions & 0 deletions compare_output.csh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
csh crfsuite_test.csh > output/out_c.txt
tail -1026 output/out_c.txt > output/out_c.txt

python test.py > ouput/out_python.txt
tail -1026 output/out_python.txt > output/out_python.txt

echo "Differences between crfsuite tagging and python tagging:"
echo ""
diff output/out_c.txt output/out_python.txt
echo "--------------------------------------------------------"
2 changes: 2 additions & 0 deletions crfsuite_test.csh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
crfsuite learn --model='output/model.dat' 'example_files/train_small.txt'
crfsuite tag -r -p -i --model='output/model.dat' 'example_files/test_small.txt'
2 changes: 2 additions & 0 deletions example_files/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
These example files are taken from the first ~1000 +/- lines of the examples
packaged with CRFsuite. They allow tests to be run very quickly.
987 changes: 987 additions & 0 deletions example_files/test_small.txt

Large diffs are not rendered by default.

980 changes: 980 additions & 0 deletions example_files/train_small.txt

Large diffs are not rendered by default.

65 changes: 65 additions & 0 deletions include/iwa.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
/*
* A parser for Item With Attributes (IWA) format.
*
* Copyright (c) 2007-2010, Naoaki Okazaki
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the names of the authors nor the names of its contributors
* may be used to endorse or promote products derived from this
* software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
* OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
* NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

/* $Id$ */

#ifndef __IWA_H__
#define __IWA_H__

#ifdef __cplusplus
extern "C" {
#endif/*__cplusplus*/

typedef struct tag_iwa iwa_t;

enum {
IWA_NONE,
IWA_EOF,
IWA_BOI,
IWA_EOI,
IWA_ITEM,
};

struct tag_iwa_token {
int type;
const char *attr;
const char *value;
};
typedef struct tag_iwa_token iwa_token_t;

iwa_t* iwa_reader(FILE *fp);
const iwa_token_t* iwa_read(iwa_t* iwa);
void iwa_delete(iwa_t* iwa);

#ifdef __cplusplus
}
#endif/*__cplusplus*/

#endif/*__IWA_H__*/
12 changes: 12 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("crfsuite", ["src/crfsuite.pyx",
'src/iwa.c'],
libraries = ["crfsuite"],
library_dirs = ["/usr/local/lib"],
include_dirs = ["include"])]
)
Loading

0 comments on commit 356cbfe

Please sign in to comment.