Find file
Fetching contributors…
Cannot retrieve contributors at this time
47 lines (30 sloc) 2.83 KB

Training IOB Chunkers

The script can use any corpus included with NLTK that implements a chunked_sents() method.

Train the default sequential backoff tagger based chunker on the treebank_chunk corpus::
python treebank_chunk
To train a NaiveBayes classifier based chunker:
python treebank_chunk --classifier NaiveBayes
To train on the conll2000 corpus:
python conll2000
To train on a custom corpus, whose fileids end in ".pos", using a ChunkedCorpusReader:
python /path/to/corpus --reader nltk.corpus.reader.chunked.ChunkedCorpusReader --fileids '.+\.pos'

The corpus path can be absolute, or relative to a nltk_data directory. For example, both corpora/treebank/tagged and /usr/share/nltk_data/corpora/treebank/tagged will work.

You can also restrict the files used with the --fileids option:
python conll2000 --fileids train.txt
For a complete list of usage options:
python --help

There are also many usage examples shown in Chapter 5 of Python 3 Text Processing with NLTK 3 Cookbook.

Using a Trained Chunker

You can use a trained chunker by loading the pickle file using
>>> import
>>> tagger ="chunkers/NAME_OF_CHUNKER.pickle")
Or if your chunker pickle file is not in a nltk_data subdirectory, you can load it with pickle.load:
>>> import pickle
>>> tagger = pickle.load(open("/path/to/NAME_OF_CHUNKER.pickle"))
Either method will return an object that supports the ChunkerParserI interface. But before you can use this chunker, you must have a trained tagger <>. You first use the tagger to tag a sentence, and then use a chunker to parse the tagged sentence with the chunker.parse(sent) method:
>>> chunker.parse(tagged_words)

chunker.parse(tagged_words) will return a Tree whose subtrees will be chunks, and whose leaves are the original tagged words.

All of the chunkers demonstrated at were trained with