New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unseen token error in nltk.model.NgramModel.entropy #167

Open
alexrudnick opened this Issue Jan 17, 2012 · 3 comments

Comments

Projects
None yet
3 participants
@alexrudnick
Member

alexrudnick commented Jan 17, 2012

NLTK version 2.0b9

I'm attempting to compute the entropy of a piece of text with respect to a given model and there seems to be an issue with unseen tokens (even when an estimator is used for the model).

from nltk.model import NgramModel
from nltk.corpus import brown
from nltk.probability import WittenBellProbDist

estimator = lambda fdist, bins: WittenBellProbDist(fdist,0.2)
lm = NgramModel(3, brown.words(categories='news'), estimator)

sent = "this is a sentence with the word aardvark".split()
lm.entropy(sent)

Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
lm.entropy("this is a sentence with the word aardvark".split())
File "C:\Python26\lib\site-packages\nltk\model\ngram.py", line 133, in entropy
e += self.logprob(token, context)
File "C:\Python26\lib\site-packages\nltk\model\ngram.py", line 98, in logprob
return -log(self.prob(word, context), 2)
File "C:\Python26\lib\site-packages\nltk\model\ngram.py", line 79, in prob
return self._alpha(context) * self._backoff.prob(word, context[1:])
File "C:\Python26\lib\site-packages\nltk\model\ngram.py", line 82, in prob
"context %s" % (word, ' '.join(context)))
TypeError: not all arguments converted during string formatting

An exception is not raised (and a value is returned) if the entropy is computed using the same sentence but with 'aardvark' removed, which leads me to believe it's an error with unseen tokens. However, my understanding is that the model should be equipped to handle unseen tokens as indicated by the estimator.

Migrated from http://code.google.com/p/nltk/issues/detail?id=694


earlier comments

SilaiSantai said, at 2011-07-29T11:23:30.000Z:

I am having the exact same problem (though I use a LidstoneProbDist as my estimator). I also looked at it and it seems that it is unable to handle unseen words in the text.

This would seem to be a major issue for the NgramModel method as it is not much use if it cannot handle unseen words.

SilaiSantai said, at 2011-07-29T12:20:45.000Z:

The fix written here:

http://code.google.com/p/nltk/issues/detail?id=673&can=1&q=ngrammodel

Solves the issue it seems. At least now it yields a result.

@paulproteus

This comment has been minimized.

Show comment
Hide comment
@paulproteus

paulproteus Sep 18, 2013

It would be nice if someone can attempt to reproduce this bug.

If it is still an issue, it seems like a reasonably good first bug for someone new to NLTK to solve.

paulproteus commented Sep 18, 2013

It would be nice if someone can attempt to reproduce this bug.

If it is still an issue, it seems like a reasonably good first bug for someone new to NLTK to solve.

@alexrudnick

This comment has been minimized.

Show comment
Hide comment
@alexrudnick

alexrudnick Aug 4, 2014

Member

The particular problem that ssingh.math saw seems to have been fixed along with #158, so this works with a LidstoneProbDist, but it's still breaking for me with a WittenBellProbDist.

I'm pretty sure WittenBellProbDist is broken, though. If you don't pass a "bins" parameter, it will always set its "Z" field to 0 and then try to divide by that.

Member

alexrudnick commented Aug 4, 2014

The particular problem that ssingh.math saw seems to have been fixed along with #158, so this works with a LidstoneProbDist, but it's still breaking for me with a WittenBellProbDist.

I'm pretty sure WittenBellProbDist is broken, though. If you don't pass a "bins" parameter, it will always set its "Z" field to 0 and then try to divide by that.

kruskod pushed a commit to kruskod/nltk that referenced this issue Jul 15, 2015

@stevenbird

This comment has been minimized.

Show comment
Hide comment
@stevenbird

stevenbird Nov 20, 2016

Member

@Copper-Head would you please confirm if this issue is current?

Member

stevenbird commented Nov 20, 2016

@Copper-Head would you please confirm if this issue is current?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment