-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nltk.lm.api entropy formula source? #2961
Comments
@mbauwens Hello! I dug into the code a little bit, and the The commit info is:
Some quick searching indicates that JBG might refer to Jordan Boyd-Graber, who has taught about Language Models at the University of Maryland. That said, I wasn't able to find a specific paper, nor am I certain that it indeed refers to Boyd-Graber. We would be best off just asking @iliakur, the author of the NLTK language model package. |
Thanks for the quick reply! Ok, interesting. After a quick Google search, it does seem Jordan Boyd-Graber (if that's what "JBG" references) actually also references the Shannon Entropy formula (see slide 16, second formula) so that does make it more 'perplexing'... :) Given that it's hard to find the source of this formula and general theory (as far as I know) usually references Shannon Entropy in the domain of information theory/communication/language, it might not be a bad idea to change this formula into the more well-documented one? Unless there were clear theoretic arguments in favour of the formula that's used right now, of course. Looking forward to hear @iliakur 's thoughts! |
Hi all! I'm currently swamped, will respond to this in the course of this week. @mbauwens thanks for the detailed issue 🙏 |
Small update: my formula for perplexity based on Jurafsky (in the original post) was completely wrong so I did correct that. I dug a little deeper by comparing Shannon (entropy and perplexity) with the NLTK formulas (entropy and perplexity) as well as Jurafsky's implementation of perplexity. I've written a little program so you can easily compare the outcome of the formulas by supplying the main Conclusions
from numpy import log2, prod, mean
def shannon_entropy(problist):
"""Shannon Entropy: negative sum over all probabilities*log2_probabilities
https://en.wikipedia.org/wiki/Entropy_(information_theory)
"""
return -1 * sum([prob * log2(prob) for prob in problist])
def nltk_entropy(problist):
"""From nltk.lm: negative average log2_probabilities
https://www.nltk.org/api/nltk.lm.api.html#nltk.lm.api.LanguageModel.entropy
"""
return -1 * mean([log2(prob) for prob in problist])
def perplexity(entropy):
"""From nltk.lm: 2**entropy
This is also a general formula for entropy, not only by NLTK
https://www.nltk.org/api/nltk.lm.api.html#nltk.lm.api.LanguageModel.perplexity
"""
return pow(2.0, entropy)
def jurafsky_perplexity(problist):
"""From Jurafsky, Stanford NLP: product of all probabilities** -1/count_of_probabilities
https://www.youtube.com/watch?v=NCyCkgMLRiY
"""
probs = [prob for prob in problist]
return pow(prod(probs), -1/len(probs))
def uncertainty(problist):
prob_sum = sum(problist) # just to be sure our total probability mass == 1
shan_ent = shannon_entropy(problist)
shan_prplx = perplexity(shan_ent)
nltk_ent = nltk_entropy(problist)
nltk_prplx = perplexity(nltk_ent)
juraf_prplx = jurafsky_perplexity(problist)
print(
f"Probability sum: {prob_sum}\n\n"
f"Shannon entropy: {shan_ent}\n"
f"Shannon perplexity: {shan_prplx}\n\n"
f"NLTK entropy: {nltk_ent}\n"
f"NLTK perplexity: {nltk_prplx}\n\n"
f"Jurafsky perplexity: {juraf_prplx}"
)
# some examples with equal probability
# coin toss
uncertainty([0.5]*2)
# Probability sum: 1.0
#
# Shannon entropy: 1.0
# Shannon perplexity: 2.0
#
# NLTK entropy: 1.0
# NLTK perplexity: 2.0
#
# Jurafsky perplexity: 2.0
uncertainty([0.1]*10) # float accuracy issues do occur
# Probability sum: 0.9999999999999999
#
# Shannon entropy: 3.321928094887362
# Shannon perplexity: 9.999999999999998
#
# NLTK entropy: 3.3219280948873626
# NLTK perplexity: 10.000000000000002
# Jurafsky perplexity: 10.0
# some examples with UNequal probability
# 'weighted' coin toss
uncertainty([0.7, 0.3])
# Probability sum: 1.0
#
# Shannon entropy: 0.8812908992306927
# Shannon perplexity: 1.8420227750373133
#
# NLTK entropy: 1.1257693834979823
# NLTK perplexity: 2.182178902359924
# Jurafsky perplexity: 2.182178902359924
uncertainty([0.1,0.25,0.35,0.15, 0.15])
# Probability sum: 1.0
#
# Shannon entropy: 2.1833830982290134
# Shannon perplexity: 4.542174393148615
#
# NLTK entropy: 2.4620864912099063
# NLTK perplexity: 5.5101305142271135
#
# Jurafsky perplexity: 5.510130514227115
# tested with a very large, random probability list
uncertainty(random.dirichlet(ones(5000),size=1).tolist()[0])
# Probability sum: 0.9999999999999974
#
# Shannon entropy: 11.690225456458114
# Shannon perplexity: 3304.5210178220477
#
# NLTK entropy: 13.104676557913615
# NLTK perplexity: 8808.475027062655
#
# Jurafsky perplexity: inf |
I'm still a bit tight on time, but I can at least provide some leads. I can confirm Tom's hunch: the entropy/perplexity formula is taken verbatim from Jordan Boyd-Graber. He was kind enough to share a script in this comment (the From scanning your comment, @mbauwens it seems you've arrived at something similar independently. Does this mean all is well, or should I do some more digging? |
Well, this cleared up where the formula is from, yes! In that sense, the issue can be closed. The only 'issue' I still have is that this formula of entropy-perplexity is less documented, apparently, whereas the Shannon Entropy formula is more commonly found. Don't know which one is 'better' though (I assume they're just different). Future users might benefit from an update. So, not sure if it's worth it, but it might be nice to include a reference to Jordan Boyd-Graber/Jurafsky in the
I'm pretty new to contributing to open source git projects but the how-to is pretty clear and the edit is very straightforward so I'd be okay with suggesting it via PR - if you agree with the statement above. If you don't believe this is necessary for the NLTK core and this is something users should add themselves (like I now did in my project), that's also fine. 🙂 Either way, thanks for having taken the time to look into my question 👍 |
👍 from me for your plan. Adding methods and docs is generally not bad imo, we should just make sure we take this opportunity to think about the relevant interfaces. Ping me when you make the PR, I'll gladly do the review. Or ping me if you need help in the process of making the PR too :) Thanks for your work on this! |
This is done via GitHub, actually. The API Reference pages, such as this page is automatically generated from comments in https://github.com/nltk/nltk/blob/develop/nltk/lm/api.py. You can change the comments in the code, and after a website push, the site will be updated. Similarly, the Example Usage pages, such as this page are generated from doctest files like this one: https://github.com/nltk/nltk/blob/develop/nltk/test/lm.doctest. That said, some of these doctest files are really simply just tests, while others actually explain how to use the sections. |
Will do, @iliakur ! I'm actually investigating some variants of entropy because I've come across some practical issues/limitations when using the measure. Would it be okay if I suggest some optional parameters to the entropy function so that the method allows for more flexibility? I am planning to test whether the parameters do improve the quality of the measure and, if so, I'm going to document it thoroughly. Still a work in progress, but I do plan to finish it by 17 June (right on time to possibly present it at the CLIN 32 conference in Tilburg). |
Update: I have found the reasoning behind negative average log probability as a measure for entropy (Shannon-McMillan-Breiman theorem). It's actually mentioned in chapter 3 of Speech and Language Processing (Jurafsky and Martin, 2021). I probably overlooked it as it was mainly explained using mathematical formulas (and I'm a mere computational linguist 😃). I'll include this information in my suggested PR as it does clarify some things. |
@mbauwens optional parameters should be ok as long as their default values result in existing behavior. The idea being that folks who aren't passing these parameters explicitly should not have their code break all of a sudden :) |
Ok, that was also what I was planning 😃 Do you prefer a new method |
could you outline the two alternatives a bit more? Just the level of method signatures should be enough to help us decide between the two. |
sorry @iliakur for the late reply! Was a bit busy and had to still try out some things. I've just presented the alternatives on Computational Linguistics in The Netherlands (CLIN32) and wrote a demo in a GitHub repo: https://github.com/mbauwens/clin32-entropy Basically, the different variations of entropy would be:
With the final length normalised relative frequency weighted, I did seem to receive promising results. I think it would be most elegant if the functions would be separated, but it would be more efficient and compact to have a single function with parameters. |
@mbauwens should I close this issue? |
This might not be the question directly related to the discussion here, but, perhaps, it's a good place (and time) to ask the question: what would be the theoretical and methodological support to use the currently implemented formula for computing entropy, e.g. the one that does not have p(x) in it, but only log probs? I am trying to motivate the implementation in NLTK as it has been criticised by some of my colleagues for not being a correct one. The correct one would be the definition by Shannon. Any references or hints on that topic? |
Based on this comment I disagree with the claim that the current implementation is incorrect. The comment clearly states that the different variations of entropy are the 2 we're discussing. In fact, Shannon also had a hand in the one that's currently implemented. To address your request for more support of the current version, see this comment. Basically the current implementation is based on the Jurafsky book and a well-known nlp researcher's code. |
I agree that this is not an incorrect implementation, but just a specific implementation.
My bad for the long time in between creating this issue and closing it! I've added PR #3229, which reflects the outcome of this discussion in the documentation. Because of different priorities, I haven't added the new functionality as proposed by myself in this thread. This PR should close this issue! |
Hi! I've been experimenting with training and testing a standard trigram language model on my own dataset. Upon investigating the
entropy
method of the LM class, I was a bit confused. The docs only mention a very brief description: "Calculate cross-entropy of model for given evaluation text."It seems to me that this entropy measure just averages the ngram negative log probability (so trigram in my case) and makes it positive by multiplying by -1:
So my general question is: Where does this formula for entropy stem from? Is there any paper referencing the method? I'm just a bit stuck with the different versions of entropy that exist and I don't know which one is used here (and therefore I don't know how to interpret it correctly).
Other formulas of entropy/perplexity
As a formula I know the Shannon entropy which in Python would be:
And there's also the perplexity formula of Jurafsky, which returns a different score than
lm.perplexity
(which is 2**entropy):Thanks!
PS: My apologies if this issue lacks some information - I don't often open issues on Github 😅 Let me know if I need to update something!
UPDATE: my Python implementation of the Jurafsky perplexity was totally wrong as I had quickly written it. I've updated it to reflect the actual scores from the web lecture.
The text was updated successfully, but these errors were encountered: