Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brill Tagger on third-party corpora #27

Closed
alexrudnick opened this issue Jan 17, 2012 · 1 comment
Closed

Brill Tagger on third-party corpora #27

alexrudnick opened this issue Jan 17, 2012 · 1 comment

Comments

@alexrudnick
Copy link
Member

Originally reported by starkman (sourceforge.net user: starkmanuk) on
2008-03-09

Hello all, I am having a bit of trouble whilst trying to use the Brill
tagger. Below is my code,

from nltk.corpus import treebank
from nltk import tag
from nltk.tag import brill
from nltk.corpus import reader
from nltk.corpus.reader import TaggedCorpusReader

root = 'C:\lob'
reader = TaggedCorpusReader(root, 'a.txt', sep='/')
tagged_data = reader.tagged_sents()
nn_cd_tagger = tag.RegexpTagger([(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
(r'.*', 'NN')])

train is the proportion of data used in training; the rest is reserved

for testing.

print "Loading tagged data... "

cutoff = int(num_sents*train)
training_data = tagged_data[:cutoff]
gold_data = tagged_data[cutoff:num_sents]
testing_data = [[t[0] for t in sent] for sent in gold_data]
print "Done lodaing."

Start Unigram tagger

print "Training unigram tagger:"
unigram_tagger = tag.UnigramTagger(training_data,
backoff=nn_cd_tagger)
if gold_data:
print " [accuracy: %f]" % tag.accuracy(unigram_tagger, gold_data)

Start Bigram tagger

print "Training bigram tagger:"
bigram_tagger = tag.BigramTagger(training_data,
backoff=unigram_tagger)
if gold_data:
print " [accuracy: %f]" % tag.accuracy(bigram_tagger, gold_data)

Brill tagger

templates = [
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1)),
]
trainer = brill.FastBrillTaggerTrainer(bigram_tagger, templates, trace)

trainer = brill.BrillTaggerTrainer(u, templates, trace)

brill_tagger = trainer.train(training_data, max_rules, min_score)

It is of course a modifcation of the example brill tagger from the api. I
receive an error when it comes to computer the last line, brill_tagger
trainer.train(training_data, max_rules, min_score). This is the error..

Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
brilltagger()
File "<pyshell#0>", line 80, in brilltagger
brill_tagger = trainer.train(training_data, max_rules, min_score)
File "C:\Python25\Lib\site-packages\nltk\tag\brill.py", line 869, in train
rule = self._best_rule(train_sents, test_sents, min_score)
File "C:\Python25\Lib\site-packages\nltk\tag\brill.py", line 1008, in
_best_rule
max_score = max(self._rules_by_score)
ValueError: max() arg is an empty sequence

The error i think can decipher, (i.e. max() is empty) but i am unsure as to
why this is occurring. I am using sections from the LOB corpus, the part of
the corpus has been modified so that NLTK can decipher the word and its
associated tags. This seems to work and i can print out both the word and
its tag correctly as with the other predefined corpora that is bundled with
NLTK. Is it possible that the text I am passing to the Brill tagger simply
has no rules?

Kind Regards,
David

Migrated from http://code.google.com/p/nltk/issues/detail?id=67


earlier comments

paulbone.au said, at 2008-11-06T07:21:14.000Z:

David/starkman,

If you have a copy of the LOB corpus, particularly the a.txt file could you provide
it so I can test this against it.

Thanks.

paulbone.au said, at 2008-11-06T07:33:20.000Z:

I've commited a fix for this but am unable to test it until I have a failing test case. I'll leave the bug open.

StevenBird1 said, at 2009-01-08T23:35:27.000Z:

Wrote to starkmanuk to request data.

@stevenbird
Copy link
Member

Closing, stale. Cf #555

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants