Brill Tagger on third-party corpora #27

alexrudnick · 2012-01-17T06:13:50Z

Originally reported by starkman (sourceforge.net user: starkmanuk) on
2008-03-09

Hello all, I am having a bit of trouble whilst trying to use the Brill
tagger. Below is my code,

from nltk.corpus import treebank
from nltk import tag
from nltk.tag import brill
from nltk.corpus import reader
from nltk.corpus.reader import TaggedCorpusReader

root = 'C:\lob'
reader = TaggedCorpusReader(root, 'a.txt', sep='/')
tagged_data = reader.tagged_sents()
nn_cd_tagger = tag.RegexpTagger([(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
(r'.*', 'NN')])

train is the proportion of data used in training; the rest is reserved

for testing.

print "Loading tagged data... "

cutoff = int(num_sents*train)
training_data = tagged_data[:cutoff]
gold_data = tagged_data[cutoff:num_sents]
testing_data = [[t[0] for t in sent] for sent in gold_data]
print "Done lodaing."

Start Unigram tagger

print "Training unigram tagger:"
unigram_tagger = tag.UnigramTagger(training_data,
backoff=nn_cd_tagger)
if gold_data:
print " [accuracy: %f]" % tag.accuracy(unigram_tagger, gold_data)

Start Bigram tagger

print "Training bigram tagger:"
bigram_tagger = tag.BigramTagger(training_data,
backoff=unigram_tagger)
if gold_data:
print " [accuracy: %f]" % tag.accuracy(bigram_tagger, gold_data)

Brill tagger

templates = [
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1)),
]
trainer = brill.FastBrillTaggerTrainer(bigram_tagger, templates, trace)

trainer = brill.BrillTaggerTrainer(u, templates, trace)

brill_tagger = trainer.train(training_data, max_rules, min_score)

It is of course a modifcation of the example brill tagger from the api. I
receive an error when it comes to computer the last line, brill_tagger
trainer.train(training_data, max_rules, min_score). This is the error..

Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
brilltagger()
File "<pyshell#0>", line 80, in brilltagger
brill_tagger = trainer.train(training_data, max_rules, min_score)
File "C:\Python25\Lib\site-packages\nltk\tag\brill.py", line 869, in train
rule = self._best_rule(train_sents, test_sents, min_score)
File "C:\Python25\Lib\site-packages\nltk\tag\brill.py", line 1008, in
_best_rule
max_score = max(self._rules_by_score)
ValueError: max() arg is an empty sequence

The error i think can decipher, (i.e. max() is empty) but i am unsure as to
why this is occurring. I am using sections from the LOB corpus, the part of
the corpus has been modified so that NLTK can decipher the word and its
associated tags. This seems to work and i can print out both the word and
its tag correctly as with the other predefined corpora that is bundled with
NLTK. Is it possible that the text I am passing to the Brill tagger simply
has no rules?

Kind Regards,
David

Migrated from http://code.google.com/p/nltk/issues/detail?id=67

earlier comments

paulbone.au said, at 2008-11-06T07:21:14.000Z:

David/starkman,

If you have a copy of the LOB corpus, particularly the a.txt file could you provide
it so I can test this against it.

Thanks.

paulbone.au said, at 2008-11-06T07:33:20.000Z:

I've commited a fix for this but am unable to test it until I have a failing test case. I'll leave the bug open.

StevenBird1 said, at 2009-01-08T23:35:27.000Z:

Wrote to starkmanuk to request data.

stevenbird · 2014-04-20T22:52:45Z

Closing, stale. Cf #555

stevenbird closed this as completed Apr 20, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brill Tagger on third-party corpora #27

Brill Tagger on third-party corpora #27

alexrudnick commented Jan 17, 2012

stevenbird commented Apr 20, 2014

Brill Tagger on third-party corpora #27

Brill Tagger on third-party corpora #27

Comments

alexrudnick commented Jan 17, 2012

train is the proportion of data used in training; the rest is reserved

for testing.

Start Unigram tagger

Start Bigram tagger

Brill tagger

trainer = brill.BrillTaggerTrainer(u, templates, trace)

earlier comments

stevenbird commented Apr 20, 2014