New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unseen token error in nltk.model.NgramModel.entropy #167

alexrudnick opened this Issue Jan 17, 2012 · 5 comments


None yet
4 participants
Copy link

alexrudnick commented Jan 17, 2012

NLTK version 2.0b9

I'm attempting to compute the entropy of a piece of text with respect to a given model and there seems to be an issue with unseen tokens (even when an estimator is used for the model).

from nltk.model import NgramModel
from nltk.corpus import brown
from nltk.probability import WittenBellProbDist

estimator = lambda fdist, bins: WittenBellProbDist(fdist,0.2)
lm = NgramModel(3, brown.words(categories='news'), estimator)

sent = "this is a sentence with the word aardvark".split()

Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
lm.entropy("this is a sentence with the word aardvark".split())
File "C:\Python26\lib\site-packages\nltk\model\", line 133, in entropy
e += self.logprob(token, context)
File "C:\Python26\lib\site-packages\nltk\model\", line 98, in logprob
return -log(self.prob(word, context), 2)
File "C:\Python26\lib\site-packages\nltk\model\", line 79, in prob
return self._alpha(context) * self._backoff.prob(word, context[1:])
File "C:\Python26\lib\site-packages\nltk\model\", line 82, in prob
"context %s" % (word, ' '.join(context)))
TypeError: not all arguments converted during string formatting

An exception is not raised (and a value is returned) if the entropy is computed using the same sentence but with 'aardvark' removed, which leads me to believe it's an error with unseen tokens. However, my understanding is that the model should be equipped to handle unseen tokens as indicated by the estimator.

Migrated from

earlier comments

SilaiSantai said, at 2011-07-29T11:23:30.000Z:

I am having the exact same problem (though I use a LidstoneProbDist as my estimator). I also looked at it and it seems that it is unable to handle unseen words in the text.

This would seem to be a major issue for the NgramModel method as it is not much use if it cannot handle unseen words.

SilaiSantai said, at 2011-07-29T12:20:45.000Z:

The fix written here:

Solves the issue it seems. At least now it yields a result.


This comment has been minimized.

Copy link

paulproteus commented Sep 18, 2013

It would be nice if someone can attempt to reproduce this bug.

If it is still an issue, it seems like a reasonably good first bug for someone new to NLTK to solve.


This comment has been minimized.

Copy link

alexrudnick commented Aug 4, 2014

The particular problem that ssingh.math saw seems to have been fixed along with #158, so this works with a LidstoneProbDist, but it's still breaking for me with a WittenBellProbDist.

I'm pretty sure WittenBellProbDist is broken, though. If you don't pass a "bins" parameter, it will always set its "Z" field to 0 and then try to divide by that.

kruskod pushed a commit to kruskod/nltk that referenced this issue Jul 15, 2015


This comment has been minimized.

Copy link

stevenbird commented Nov 20, 2016

@Copper-Head would you please confirm if this issue is current?


This comment has been minimized.

Copy link

Copper-Head commented Aug 25, 2018

@stevenbird we can close this now!

@stevenbird stevenbird closed this Aug 25, 2018


This comment has been minimized.

Copy link

stevenbird commented Aug 25, 2018

Thanks @Copper-Head

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment