Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Unseen token error in nltk.model.NgramModel.entropy #167
NLTK version 2.0b9
I'm attempting to compute the entropy of a piece of text with respect to a given model and there seems to be an issue with unseen tokens (even when an estimator is used for the model).
from nltk.model import NgramModel
estimator = lambda fdist, bins: WittenBellProbDist(fdist,0.2)
sent = "this is a sentence with the word aardvark".split()
Traceback (most recent call last):
An exception is not raised (and a value is returned) if the entropy is computed using the same sentence but with 'aardvark' removed, which leads me to believe it's an error with unseen tokens. However, my understanding is that the model should be equipped to handle unseen tokens as indicated by the estimator.
Migrated from http://code.google.com/p/nltk/issues/detail?id=694
SilaiSantai said, at 2011-07-29T11:23:30.000Z:
I am having the exact same problem (though I use a LidstoneProbDist as my estimator). I also looked at it and it seems that it is unable to handle unseen words in the text.
This would seem to be a major issue for the NgramModel method as it is not much use if it cannot handle unseen words.
SilaiSantai said, at 2011-07-29T12:20:45.000Z:
The fix written here:
Solves the issue it seems. At least now it yields a result.
referenced this issue
Feb 11, 2014
added a commit
Feb 20, 2014
The particular problem that ssingh.math saw seems to have been fixed along with #158, so this works with a LidstoneProbDist, but it's still breaking for me with a WittenBellProbDist.
I'm pretty sure WittenBellProbDist is broken, though. If you don't pass a "bins" parameter, it will always set its "Z" field to 0 and then try to divide by that.