Training Part of Speech Taggers
train_tagger.py script can use any corpus included with NLTK that implements a
tagged_sents() method. It can also train on the
timit corpus, which includes tagged sentences that are not available through the
Example usage can be found in Training Part of Speech Taggers with NLTK Trainer.
- Train the default sequential backoff tagger on the treebank corpus:
python train_tagger.py treebank
- To use a brill tagger with the default initial tagger:
python train_tagger.py treebank --brill
- To train a NaiveBayes classifier based tagger, without a sequential backoff tagger:
python train_tagger.py treebank --sequential '' --classifier NaiveBayes
- To train a unigram tagger:
python train_tagger.py treebank --sequential u
- To train on the switchboard corpus:
python train_tagger.py switchboard
- To train on a custom corpus, whose fileids end in ".pos", using a TaggedCorpusReader:
python train_tagger.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'
The corpus path can be absolute, or relative to a nltk_data directory. For example, both
/usr/share/nltk_data/corpora/treebank/tagged will work.
- You can also restrict the files used with the
python train_tagger.py conll2000 --fileids train.txt
- For a complete list of usage options:
python train_tagger.py --help
There are also many usage examples shown in Chapter 4 of Python 3 Text Processing with NLTK 3 Cookbook.
Using a Trained Tagger
- You can use a trained tagger by loading the pickle file using nltk.data.load:
>>> import nltk.data >>> tagger = nltk.data.load("taggers/NAME_OF_TAGGER.pickle")
- Or if your tagger pickle file is not in a
nltk_datasubdirectory, you can load it with pickle.load:
>>> import pickle >>> tagger = pickle.load(open("/path/to/NAME_OF_TAGGER.pickle"))
Either method will return an object that supports the TaggerI interface.
- Once you have a
taggerobject, you can use it to tag sentences (or lists of words) with the
>>> tagger.tag(['some', 'words', 'in', 'a', 'sentence'])
tagger.tag(words) will return a list of 2-tuples of the form
All of the taggers demonstrated at text-processing.com were trained with