This repository has been archived by the owner on Nov 14, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 356cbfe
Showing
16 changed files
with
22,233 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
*~ | ||
*pyc | ||
*so | ||
*.model | ||
build | ||
src/build | ||
output/* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
all: crfsuite.so | ||
|
||
crfsuite.so: src/crfsuite.pyx | ||
python setup.py build_ext --inplace |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
========== | ||
pyCRFsuite | ||
========== | ||
|
||
This is a python wrapper for crfsuite, a fast implementation of Conditional | ||
Random Fields | ||
|
||
Authors | ||
------- | ||
|
||
- Jake Vanderplas <vanderplas@astro.washington.edu> | ||
|
||
|
||
Installation | ||
------------ | ||
|
||
Currently the package is set-up only for in-place installation. It requires | ||
the ``crfsuite`` library to be installed: see | ||
http://www.chokkan.org/software/crfsuite/ | ||
|
||
Once this is installed, simply type ``make`` in the head directory. | ||
|
||
Testing | ||
------- | ||
There are a few basic test scripts in the head directory. ``test.py`` will | ||
read a small dataset from ``example_files``, then run a basic training and | ||
tagging operation. ``crfsuite_test.csh`` runs the same operation using the | ||
command-line frontend provided by crfsuite. To compare the results of the | ||
training and tagging, run ``compare_output.csh``. This will print all the | ||
places where the tagging results differ. | ||
|
||
TODO | ||
---- | ||
This is still a very incomplete wrapper. Search ``TODO`` within | ||
``src/crfsuite.pyx`` to see some issues that need to be addressed. | ||
|
||
Issues | ||
------ | ||
There are a few 'features' in crfsuite that make efficient python wrapping | ||
difficult. | ||
|
||
- **Model File Output**: as currently written, crfsuite writes the result of | ||
a training directly to a binary file. The library is not configured to | ||
allow writing the model to memory. This means that a python wrapper must | ||
write the model to disk, then read the model into memory before performing | ||
any tagging operation. It would be better if the model could be saved | ||
directly to a CRFsuite model structure, though when dealing with the very | ||
large datasets for which crfsuite is designed, it's clear why the author | ||
made the choice he did. | ||
|
||
- **Memory mapping**: as currently written, crfsuite data is not stored in | ||
contiguous arrays. This means that there is no way to map a crfsuite data | ||
structure to a numpy array, and any input to crfsuite will need to be | ||
copied in memory. Addressing this would require significant upstream | ||
changes: the ``crfsuite_item_t`` structure would have to use an array of | ||
floats and an array of ints rather than an array of attribute structures. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
csh crfsuite_test.csh > output/out_c.txt | ||
tail -1026 output/out_c.txt > output/out_c.txt | ||
|
||
python test.py > ouput/out_python.txt | ||
tail -1026 output/out_python.txt > output/out_python.txt | ||
|
||
echo "Differences between crfsuite tagging and python tagging:" | ||
echo "" | ||
diff output/out_c.txt output/out_python.txt | ||
echo "--------------------------------------------------------" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
crfsuite learn --model='output/model.dat' 'example_files/train_small.txt' | ||
crfsuite tag -r -p -i --model='output/model.dat' 'example_files/test_small.txt' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
These example files are taken from the first ~1000 +/- lines of the examples | ||
packaged with CRFsuite. They allow tests to be run very quickly. |
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
/* | ||
* A parser for Item With Attributes (IWA) format. | ||
* | ||
* Copyright (c) 2007-2010, Naoaki Okazaki | ||
* All rights reserved. | ||
* | ||
* Redistribution and use in source and binary forms, with or without | ||
* modification, are permitted provided that the following conditions are met: | ||
* * Redistributions of source code must retain the above copyright | ||
* notice, this list of conditions and the following disclaimer. | ||
* * Redistributions in binary form must reproduce the above copyright | ||
* notice, this list of conditions and the following disclaimer in the | ||
* documentation and/or other materials provided with the distribution. | ||
* * Neither the names of the authors nor the names of its contributors | ||
* may be used to endorse or promote products derived from this | ||
* software without specific prior written permission. | ||
* | ||
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS | ||
* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT | ||
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR | ||
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER | ||
* OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, | ||
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, | ||
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR | ||
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF | ||
* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING | ||
* NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS | ||
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | ||
*/ | ||
|
||
/* $Id$ */ | ||
|
||
#ifndef __IWA_H__ | ||
#define __IWA_H__ | ||
|
||
#ifdef __cplusplus | ||
extern "C" { | ||
#endif/*__cplusplus*/ | ||
|
||
typedef struct tag_iwa iwa_t; | ||
|
||
enum { | ||
IWA_NONE, | ||
IWA_EOF, | ||
IWA_BOI, | ||
IWA_EOI, | ||
IWA_ITEM, | ||
}; | ||
|
||
struct tag_iwa_token { | ||
int type; | ||
const char *attr; | ||
const char *value; | ||
}; | ||
typedef struct tag_iwa_token iwa_token_t; | ||
|
||
iwa_t* iwa_reader(FILE *fp); | ||
const iwa_token_t* iwa_read(iwa_t* iwa); | ||
void iwa_delete(iwa_t* iwa); | ||
|
||
#ifdef __cplusplus | ||
} | ||
#endif/*__cplusplus*/ | ||
|
||
#endif/*__IWA_H__*/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
from distutils.core import setup | ||
from distutils.extension import Extension | ||
from Cython.Distutils import build_ext | ||
|
||
setup( | ||
cmdclass = {'build_ext': build_ext}, | ||
ext_modules = [Extension("crfsuite", ["src/crfsuite.pyx", | ||
'src/iwa.c'], | ||
libraries = ["crfsuite"], | ||
library_dirs = ["/usr/local/lib"], | ||
include_dirs = ["include"])] | ||
) |
Oops, something went wrong.