This repository contains a Python utility for training and testing part-of-speech taggers from a provided training file.
This package assumes a training file structured according to the following rules:
- Each line contains one sentence (i.e., sentences are separated by the
newline character
\n
) - Each sentence is white-space delimited---a space should precede every punctuation mark.
- Each token in each sentence consists of three parts: the word/punctuation
mark itself, the separator character, and the associated tag. For example,
here's a breakdown of a token
Men_PN1s
:Men
-- the word as it would appear in an untagged sentence._
-- the separator character (the underbar_
is the default, but other separators may be specified. The slash/
is also common).PN1s
-- the (part-of-speech) tag, here indicating a first-person singular pronoun.
- UTF-8 is the default encoding, but other encodings may be specified.
Men_PN1s besh_NU minut_N usul_N oynidim_Vt-PST.dir-1s1 ._PUNCT
Sen_PN2si poluni_N-ACC yéding_Vt-PST.dir-2si2 dédi_Vt-PST.dir-3s2 Tursun_Npr ._PUNCT
Xinjiangda_Ntop-LOC turghan_Vi-REL.PST méning_PN1s.GEN ayalim_N-POSS1s qaytip_Vi-CNV keldi_Vdirc-PST.dir-3s2 ._PUNCT
This repository includes a sample file, uyghurtagger.train
, structured
according to the standards described above. The Uyghur sentences in this
file are taken from the public online corpus of the
Uyghur Light Verbs Project
(PI Arienne M. Dwyer, NSF BCS-1053152).
Train a Brill tagger on a provided training file:
import razmetka.tag
btt = razmetka.tag.TTBrillTaggerTrainer(file_name='uyghurtagger.train',
language='Uyghur')
btt.train(verbose=True)
Train and test Stanford log-linear taggers from a provided training file using ten-fold cross-validation:
import razmetka.tag
tst = razmetka.tag.TaggerTester(file_name='uyghurtagger.train',
language='Uyghur')
tst.split_groups()
tst.estimate_tagger_accuracy()
tst.print_results()
Repeat the entire ten-fold cross-validation process multiple times:
import razmetka.tag
razmetka.tag.repeat_tagger_tests(fname='uyghurtagger.train',
number_of_tests=3, language='Uyghur')
The Razmetka
package requires NLTK 3.0+.
- Use
with nltk.compat.TemporaryDirectory() as tempdir:
for storing the properties files and training files generated when using the Stanford NLP POS tagger.
This Python package is being written to support the work of the Annotating Turki Manuscripts Online project (Principal Investigators: Arienne M. Dwyer and C.M. Sperberg-McQueen), sponsored by the Luce Foundation. The support of the Luce Foundation is gratefully acknowledged.