Mutual information is greater than information entropy #11

lupupu · 2021-12-12T04:26:52Z

Thank you very much. If you can explain why mutual information is greater than information entropy, the code is as follows

import numpy as np
# np.random.seed(1)
x = np.random.standard_normal(1000)
y = x + 0.1 * np.random.standard_normal(x.shape)

from sklearn.feature_selection import mutual_info_regression
mi = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)

from entropy_estimators import continuous
infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)

paulbrodersen · 2021-12-13T11:47:09Z

Hi, thanks for raising the issue. Both packages implement the Kozachenko-Leonenko estimator for entropy and the Kraskov et al estimator for mutual information. At first I thought the difference might come down to the parameterisation but then I noticed that you already matched the k parameter. I then dug into their source code a bit. There is a minor implementation difference as they add some small noise by default. However, that is unlikely to explain the rather large discrepancy. The major discrepancy and (to me) smoking gun is in this line:

mi = (
    digamma(n_samples)
    + digamma(n_neighbors)
    - np.mean(digamma(nx + 1))
    - np.mean(digamma(ny + 1))
)

If you read the paper, you will notice that the correct equation (equation 8) is instead:

mi = digamma(k) - np.mean(digamma(nx+1) + digamma(ny+1)) + digamma(n)

I tested my implementation against distributions for which the mutual information can be computed analytically, so I am fairly sure that this equation is not only the intended but also the correct one.

Tl;dr: they may have put some parentheses in the wrong place.

lupupu · 2021-12-13T12:19:33Z

Sorry, I didn't find the difference between eq. (8) in the paper and mi (line 70 of _mutual_infor.py) in sklearn. What's the problem with brackets (parentheses)? Or I didn't understand what you mean.

paulbrodersen · 2021-12-13T13:06:09Z

np.mean(digamma(nx+1) + digamma(ny+1)) != np.mean(digamma(nx+1)) + np.mean(digamma(ny+1))

The expression in the paper includes the left-hand term, the code in scikit-learn the term on the right.

lupupu · 2021-12-13T13:18:21Z

According to the nature of expectation, the expectation of sum is equal to the sum of expectation. So

np.mean(digamma(nx+1) + digamma(ny+1)) == np.mean(digamma(nx+1)) + np.mean(digamma(ny+1))

paulbrodersen · 2021-12-13T13:36:36Z

Only if nx and ny are uncorrelated. Which they are not.

paulbrodersen · 2021-12-13T13:38:45Z

Might be having a bit of brain fart though. I am having a cold, and every thought takes ages.

paulbrodersen · 2021-12-13T13:42:07Z

I think you are right. Had to run some numbers on the ipython prompt to help my reduced mental capacities understand basic math again. In that case, I don't know where the difference comes from, at least not today. I will look into it when my brain is in working conditions again.

lupupu · 2021-12-13T13:44:09Z

Thank you for your careful reply. Good health is the first. I wish you a speedy recovery. Maybe you'll know the answer when you recover from a cold.

paulbrodersen · 2021-12-13T13:56:43Z

Actually, I don't think there is a difference at all. The definitional or so-called "naive" estimator of the mutual information is:

I(X;Y) = H(X) + H(Y) - H(X,Y)

If we plug in your example:

import numpy as np
from sklearn.feature_selection import mutual_info_regression
from entropy_estimators import continuous

np.random.seed(1)
x = np.random.standard_normal(10000)
y = x + 0.1 * np.random.standard_normal(x.shape)

print(mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False, n_neighbors=3))
# [2.31452164]

hx = continuous.get_h(x, k=3)
hy = continuous.get_h(y, k=3)
hxy = continuous.get_h(np.c_[x, y], k=3)

mi = hx + hy - hxy
print(mi)
# 2.325853446732216

I would say those estimates are pretty close, given that we are using two different methods to get the result.

lupupu · 2021-12-13T14:36:46Z

I find mi>hx, which puzzles me.

paulbrodersen · 2021-12-13T16:11:50Z

That continues to be a strong point.

However, I am by now fully convinced that the entropy computations are fine:

import scipy.stats as st
from entropy_estimators import continuous
distribution = st.norm(0, 1)
analytical = distribution.entropy()
empirical = continuous.get_h(distribution.rvs(10000), k=3)
print(analytical, empirical)
# 1.4189385332046727 1.4197821857006883

This leaves the following options:

The scikit-learn maintainers and I independently made the same mistake when implementing the KSG estimator for mutual information.
The KSG estimator is wrong. This is somewhat true as there is a bias as discussed in the paper itself (Fig. 4 - Fig. 8). However, when I implemented it, I checked it extensively against analytical results, and I don't remember seeing large deviations for Gaussian distributions.
We are both not thinking particularly clearly today.

My money is on option 3.

paulbrodersen · 2021-12-13T16:25:25Z

Actually, option 1 and 2 are ruled out as possible explanations by the fact that the naive estimator that I implemented above returns the same result for the mutual information as the KSG estimator...

lupupu · 2021-12-14T06:19:47Z

Thanks a lot. I'll take the time to study it again.

lupupu · 2021-12-17T12:44:24Z

Mutual information is not necessarily less than information entropy. I was misled by a picture on Wikipedia.

lupupu · 2021-12-17T12:45:33Z

shyshy903 · 2022-01-21T16:32:51Z

why Mutual information is not necessarily less than information entropy?
Recently, i also meet the problem, mi is larger than entropy?

lupupu · 2022-03-23T06:38:38Z

mi is larger than entropy. This is indeed a serious problem.

braniii · 2022-06-22T13:22:05Z

The fact that the mutual information can be larger as the entropy is due to the fact that the continuous Shannon entropy formula,

$$ H(X) = - \int \rm{d}x p(x) \log p(x)$$

is not correct. Jaynes found that the correct formula should be,¹

$$ H(X) = - \int \rm{d}x p(x) \log \frac{p(x)}{m(x)}$$

instead. This way, the entropy becomes invariant on scaling $X\to cX$ or shifting the data $X\to c + X$. Kraskov et al. where using the upper definition of Shannon, which introduces a term that depends directly on the scaling. The entropy estimator is given by

$$H(X) = \psi(N) - \psi(k) + d \langle \log \varepsilon \rangle $$

where the last term $\langle \log \varepsilon \rangle$ can become infinite small by transforming $X\to cX$ with $c\to 0$ (so all nearest neighbor distances getting smaller).

A first order fix to make the entropy more meaningful is to use scaler, e.g. RobustScaler, before performing entropy estimation. Nevertheless, this does not ensure that $H(X,Y) \ge I(X,Y)$!

singledoggy · 2023-03-01T13:07:22Z

For exmple in

import numpy as np
from sklearn.feature_selection import mutual_info_regression
from entropy_estimators import continuous

np.random.seed(1)
x = np.random.standard_normal(10000)
y = x + 0.1 * np.random.standard_normal(x.shape)

How can H(X,Y) equals to the following?

hxy = continuous.get_h(np.c_[x, y], k=3)

The H(X,Y) is from a 10000*10000 Dimension vector, and you can't just get a joint distribution from marginal distributions, so typically you have to assume it is a multivarible normal distibution, and that's what you do when marginal distributions is normal. But if the marginal distribution is not normal, what can we assume?
Or how can we create a norm joint distribution from 2 non-normal margina distribution?

paulbrodersen · 2023-03-06T09:47:42Z

@singledoggy You seem to have a separate question. Please open another issue.

singledoggy · 2023-03-07T07:49:29Z

Sorry, I was thinking about this and suddenly jumped to another topic, and I didn't realize.
I think the scaling is right cause. We can see the h(x), h(y) would be effected by scale and mi(x,y ) won't.

import numpy as np
x = np.random.standard_normal(1000)
y = x + 0.1 * np.random.standard_normal(x.shape)

from sklearn.feature_selection import mutual_info_regression
import continuous
mi_reg = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)
mi_this = continuous.get_mi(x.reshape(-1, 1), y.reshape(-1, ))

infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)

print(infor_ent_x, infor_ent_y, mi_reg, mi_this)
# out put  1.4006482292373557 1.428777538316779 [2.37495851] 2.170190377210962

import numpy as np
x = np.random.standard_normal(1000) / 100
y = x + 0.1 * np.random.standard_normal(x.shape) / 100

from sklearn.feature_selection import mutual_info_regression
import continuous
mi_reg = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)
mi_this = continuous.get_mi(x.reshape(-1, 1), y.reshape(-1, ))

infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)

print(infor_ent_x, infor_ent_y, mi_reg, mi_this)
# out put  -3.1608263267673697 -3.1215943346483987 [2.43274451] 2.24871532950979

singledoggy · 2023-03-07T08:08:55Z

Another exmaple is 2 independnt normal distribution, as the $\sigma$ changes their entropy can change from negative to positive but mutual information would be 0, so it can be larger or smaller or equal.

mahlzahn · 2023-11-28T19:46:35Z

Another exmaple is 2 independnt normal distribution, as the σ changes their entropy can change from negative to positive but mutual information would be 0, so it can be larger or smaller or equal.

That’s a nice intuitive counterexample! (I propose closing this issue with that.)

braniii · 2024-05-14T08:24:39Z

Hi everyone, to follow up on this discussion. As @singledoggy and I have explained, the reason that the mutual information of continuous variables can be larger than the entropies is that taking the limit $|\mathcal{X}|\to\infty$, the Shannon entropy $H(X) = \sum_{x\in\mathcal{X}} p_x \ln p_x$ does not converge to $H(X) = \int \mathrm{d}x p(x) \ln p(x)$. The probability density is a dimensional quantity, and therefore the logarithm of it, $\ln p(x)$, is not defined. Changing the units changes the values.

For anyone who actually needs to estimate a normalized mutual information and it is not enough to know why the Kraskov estimator fails: I've recently published an article, arXiv:2405.04980, where I discuss this issue in detail and present a generalization of the Kraskov estimator that is able to estimate normalized mutual information as well. The source code is available on github at moldyn/NorMI.

paulbrodersen · 2024-05-14T11:07:01Z

@braniii Thanks for taking the time to explain the issue clearly and providing an alternative. I have linked your last comment at the top of the README to direct people your way.

paulbrodersen closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutual information is greater than information entropy #11

Mutual information is greater than information entropy #11

lupupu commented Dec 12, 2021

paulbrodersen commented Dec 13, 2021 •

edited

Loading

lupupu commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021

lupupu commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021 •

edited

Loading

lupupu commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021

lupupu commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021

lupupu commented Dec 14, 2021

lupupu commented Dec 17, 2021

lupupu commented Dec 17, 2021

shyshy903 commented Jan 21, 2022

lupupu commented Mar 23, 2022

braniii commented Jun 22, 2022 •

edited

Loading

singledoggy commented Mar 1, 2023 •

edited

Loading

paulbrodersen commented Mar 6, 2023

singledoggy commented Mar 7, 2023

singledoggy commented Mar 7, 2023

mahlzahn commented Nov 28, 2023

braniii commented May 14, 2024

paulbrodersen commented May 14, 2024

Mutual information is greater than information entropy #11

Mutual information is greater than information entropy #11

Comments

lupupu commented Dec 12, 2021

paulbrodersen commented Dec 13, 2021 • edited Loading

lupupu commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021

lupupu commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021 • edited Loading

lupupu commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021

lupupu commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021

paulbrodersen commented Dec 13, 2021

lupupu commented Dec 14, 2021

lupupu commented Dec 17, 2021

lupupu commented Dec 17, 2021

shyshy903 commented Jan 21, 2022

lupupu commented Mar 23, 2022

braniii commented Jun 22, 2022 • edited Loading

singledoggy commented Mar 1, 2023 • edited Loading

paulbrodersen commented Mar 6, 2023

singledoggy commented Mar 7, 2023

singledoggy commented Mar 7, 2023

mahlzahn commented Nov 28, 2023

braniii commented May 14, 2024

paulbrodersen commented May 14, 2024

paulbrodersen commented Dec 13, 2021 •

edited

Loading

paulbrodersen commented Dec 13, 2021 •

edited

Loading

braniii commented Jun 22, 2022 •

edited

Loading

singledoggy commented Mar 1, 2023 •

edited

Loading