Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutual information is greater than information entropy #11

Closed
lupupu opened this issue Dec 12, 2021 · 25 comments
Closed

Mutual information is greater than information entropy #11

lupupu opened this issue Dec 12, 2021 · 25 comments

Comments

@lupupu
Copy link

lupupu commented Dec 12, 2021

Thank you very much. If you can explain why mutual information is greater than information entropy, the code is as follows

import numpy as np
# np.random.seed(1)
x = np.random.standard_normal(1000)
y = x + 0.1 * np.random.standard_normal(x.shape)

from sklearn.feature_selection import mutual_info_regression
mi = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)

from entropy_estimators import continuous
infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)
@paulbrodersen
Copy link
Owner

paulbrodersen commented Dec 13, 2021

Hi, thanks for raising the issue. Both packages implement the Kozachenko-Leonenko estimator for entropy and the Kraskov et al estimator for mutual information. At first I thought the difference might come down to the parameterisation but then I noticed that you already matched the k parameter. I then dug into their source code a bit. There is a minor implementation difference as they add some small noise by default. However, that is unlikely to explain the rather large discrepancy. The major discrepancy and (to me) smoking gun is in this line:

mi = (
    digamma(n_samples)
    + digamma(n_neighbors)
    - np.mean(digamma(nx + 1))
    - np.mean(digamma(ny + 1))
)

If you read the paper, you will notice that the correct equation (equation 8) is instead:

mi = digamma(k) - np.mean(digamma(nx+1) + digamma(ny+1)) + digamma(n)

I tested my implementation against distributions for which the mutual information can be computed analytically, so I am fairly sure that this equation is not only the intended but also the correct one.

Tl;dr: they may have put some parentheses in the wrong place.

@lupupu
Copy link
Author

lupupu commented Dec 13, 2021

Sorry, I didn't find the difference between eq. (8) in the paper and mi (line 70 of _mutual_infor.py) in sklearn. What's the problem with brackets (parentheses)? Or I didn't understand what you mean.

@paulbrodersen
Copy link
Owner

np.mean(digamma(nx+1) + digamma(ny+1)) != np.mean(digamma(nx+1)) + np.mean(digamma(ny+1))

The expression in the paper includes the left-hand term, the code in scikit-learn the term on the right.

@lupupu
Copy link
Author

lupupu commented Dec 13, 2021

According to the nature of expectation, the expectation of sum is equal to the sum of expectation. So

np.mean(digamma(nx+1) + digamma(ny+1)) == np.mean(digamma(nx+1)) + np.mean(digamma(ny+1))

@paulbrodersen
Copy link
Owner

Only if nx and ny are uncorrelated. Which they are not.

@paulbrodersen
Copy link
Owner

Might be having a bit of brain fart though. I am having a cold, and every thought takes ages.

@paulbrodersen
Copy link
Owner

paulbrodersen commented Dec 13, 2021

I think you are right. Had to run some numbers on the ipython prompt to help my reduced mental capacities understand basic math again. In that case, I don't know where the difference comes from, at least not today. I will look into it when my brain is in working conditions again.

@lupupu
Copy link
Author

lupupu commented Dec 13, 2021

Thank you for your careful reply. Good health is the first. I wish you a speedy recovery. Maybe you'll know the answer when you recover from a cold.

@paulbrodersen
Copy link
Owner

Actually, I don't think there is a difference at all. The definitional or so-called "naive" estimator of the mutual information is:

I(X;Y) = H(X) + H(Y) - H(X,Y)

If we plug in your example:

import numpy as np
from sklearn.feature_selection import mutual_info_regression
from entropy_estimators import continuous

np.random.seed(1)
x = np.random.standard_normal(10000)
y = x + 0.1 * np.random.standard_normal(x.shape)

print(mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False, n_neighbors=3))
# [2.31452164]

hx = continuous.get_h(x, k=3)
hy = continuous.get_h(y, k=3)
hxy = continuous.get_h(np.c_[x, y], k=3)

mi = hx + hy - hxy
print(mi)
# 2.325853446732216

I would say those estimates are pretty close, given that we are using two different methods to get the result.

@lupupu
Copy link
Author

lupupu commented Dec 13, 2021

I find mi>hx, which puzzles me.

@paulbrodersen
Copy link
Owner

That continues to be a strong point.

However, I am by now fully convinced that the entropy computations are fine:

import scipy.stats as st
from entropy_estimators import continuous
distribution = st.norm(0, 1)
analytical = distribution.entropy()
empirical = continuous.get_h(distribution.rvs(10000), k=3)
print(analytical, empirical)
# 1.4189385332046727 1.4197821857006883

This leaves the following options:

  1. The scikit-learn maintainers and I independently made the same mistake when implementing the KSG estimator for mutual information.
  2. The KSG estimator is wrong. This is somewhat true as there is a bias as discussed in the paper itself (Fig. 4 - Fig. 8). However, when I implemented it, I checked it extensively against analytical results, and I don't remember seeing large deviations for Gaussian distributions.
  3. We are both not thinking particularly clearly today.

My money is on option 3.

@paulbrodersen
Copy link
Owner

Actually, option 1 and 2 are ruled out as possible explanations by the fact that the naive estimator that I implemented above returns the same result for the mutual information as the KSG estimator...

@lupupu
Copy link
Author

lupupu commented Dec 14, 2021

Thanks a lot. I'll take the time to study it again.

@lupupu
Copy link
Author

lupupu commented Dec 17, 2021

Mutual information is not necessarily less than information entropy. I was misled by a picture on Wikipedia.

@lupupu
Copy link
Author

lupupu commented Dec 17, 2021

Screenshot_20211211_223137_org wikipedia

@shyshy903
Copy link

why Mutual information is not necessarily less than information entropy?
Recently, i also meet the problem, mi is larger than entropy?

@lupupu
Copy link
Author

lupupu commented Mar 23, 2022

mi is larger than entropy. This is indeed a serious problem.

@braniii
Copy link

braniii commented Jun 22, 2022

The fact that the mutual information can be larger as the entropy is due to the fact that the continuous Shannon entropy formula,

$$ H(X) = - \int \rm{d}x p(x) \log p(x)$$

is not correct. Jaynes found that the correct formula should be,¹

$$ H(X) = - \int \rm{d}x p(x) \log \frac{p(x)}{m(x)}$$

instead. This way, the entropy becomes invariant on scaling $X\to cX$ or shifting the data $X\to c + X$. Kraskov et al. where using the upper definition of Shannon, which introduces a term that depends directly on the scaling. The entropy estimator is given by

$$H(X) = \psi(N) - \psi(k) + d \langle \log \varepsilon \rangle $$

where the last term $\langle \log \varepsilon \rangle$ can become infinite small by transforming $X\to cX$ with $c\to 0$ (so all nearest neighbor distances getting smaller).

A first order fix to make the entropy more meaningful is to use scaler, e.g. RobustScaler, before performing entropy estimation. Nevertheless, this does not ensure that $H(X,Y) \ge I(X,Y)$!

@singledoggy
Copy link

singledoggy commented Mar 1, 2023

For exmple in

import numpy as np
from sklearn.feature_selection import mutual_info_regression
from entropy_estimators import continuous

np.random.seed(1)
x = np.random.standard_normal(10000)
y = x + 0.1 * np.random.standard_normal(x.shape)

How can H(X,Y) equals to the following?

hxy = continuous.get_h(np.c_[x, y], k=3)

The H(X,Y) is from a 10000*10000 Dimension vector, and you can't just get a joint distribution from marginal distributions, so typically you have to assume it is a multivarible normal distibution, and that's what you do when marginal distributions is normal. But if the marginal distribution is not normal, what can we assume?
Or how can we create a norm joint distribution from 2 non-normal margina distribution?

@paulbrodersen
Copy link
Owner

@singledoggy You seem to have a separate question. Please open another issue.

@singledoggy
Copy link

Sorry, I was thinking about this and suddenly jumped to another topic, and I didn't realize.
I think the scaling is right cause. We can see the h(x), h(y) would be effected by scale and mi(x,y ) won't.
image

import numpy as np
x = np.random.standard_normal(1000)
y = x + 0.1 * np.random.standard_normal(x.shape)

from sklearn.feature_selection import mutual_info_regression
import continuous
mi_reg = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)
mi_this = continuous.get_mi(x.reshape(-1, 1), y.reshape(-1, ))

infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)

print(infor_ent_x, infor_ent_y, mi_reg, mi_this)
# out put  1.4006482292373557 1.428777538316779 [2.37495851] 2.170190377210962

import numpy as np
x = np.random.standard_normal(1000) / 100
y = x + 0.1 * np.random.standard_normal(x.shape) / 100

from sklearn.feature_selection import mutual_info_regression
import continuous
mi_reg = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)
mi_this = continuous.get_mi(x.reshape(-1, 1), y.reshape(-1, ))

infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)

print(infor_ent_x, infor_ent_y, mi_reg, mi_this)
# out put  -3.1608263267673697 -3.1215943346483987 [2.43274451] 2.24871532950979

@singledoggy
Copy link

Another exmaple is 2 independnt normal distribution, as the $\sigma$ changes their entropy can change from negative to positive but mutual information would be 0, so it can be larger or smaller or equal.

image

@mahlzahn
Copy link
Contributor

Another exmaple is 2 independnt normal distribution, as the σ changes their entropy can change from negative to positive but mutual information would be 0, so it can be larger or smaller or equal.

That’s a nice intuitive counterexample! (I propose closing this issue with that.)

@braniii
Copy link

braniii commented May 14, 2024

Hi everyone, to follow up on this discussion. As @singledoggy and I have explained, the reason that the mutual information of continuous variables can be larger than the entropies is that taking the limit $|\mathcal{X}|\to\infty$, the Shannon entropy $H(X) = \sum_{x\in\mathcal{X}} p_x \ln p_x$ does not converge to $H(X) = \int \mathrm{d}x p(x) \ln p(x)$. The probability density is a dimensional quantity, and therefore the logarithm of it, $\ln p(x)$, is not defined. Changing the units changes the values.

For anyone who actually needs to estimate a normalized mutual information and it is not enough to know why the Kraskov estimator fails: I've recently published an article, arXiv:2405.04980, where I discuss this issue in detail and present a generalization of the Kraskov estimator that is able to estimate normalized mutual information as well. The source code is available on github at moldyn/NorMI.

@paulbrodersen
Copy link
Owner

@braniii Thanks for taking the time to explain the issue clearly and providing an alternative. I have linked your last comment at the top of the README to direct people your way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants