# Mutual Information

YT Video - https://www.youtube.com/watch?v=eJIp_mgVLwE&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=8

A measure of how much knowing one variable reduces uncertainty about another.

MI = 0 when variables are independent

The more one tells you about the other, the higher the MI

### Formula - I(X,Y) = Σₓ₍ᵧ₎ p(x,y) log₂ [ p(x,y) / (p(x)p(y)) ]


### Derivation Sketch

Show that - I(X,Y) = H(X) + H(Y) – H(X, Y)


In [2]:
import numpy as np

def mutual_info(counts):
    # counts: 2D numpy array of joint counts
    pxy = counts / counts.sum()
    px = pxy.sum(axis=1)[:, None]      # column vector
    py = pxy.sum(axis=0)[None, :]      # row vector
    mask = pxy > 0
    return np.sum(pxy[mask] * np.log2(pxy[mask] / (px @ py)[mask]))


### Example

In [3]:
counts = np.array([
    [30, 10],
    [10, 50]
])

print("Mutual information:", mutual_info(counts))


Mutual information: 0.2564258916820028


### Feature Selection with Mutual Information

In [4]:
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X = iris.data        # Features
y = iris.target      # Target labels

# Compute MI between each feature and the class label
mi_scores = mutual_info_classif(X, y, discrete_features=False)

# Display scores
for feature, score in zip(iris.feature_names, mi_scores):
    print(f"{feature}: {score:.4f}")


sepal length (cm): 0.5014
sepal width (cm): 0.2415
petal length (cm): 1.0024
petal width (cm): 0.9915


### Connection Between Surprise and Entropy

Expected surprise --> Entropy

Mutual Information is built on entropy.

Entropy measures how surprising a variable is, on average:
- H(X): Surprise of X
- H(Y): Surprise of Y
- H(X, Y): Surprise of the two together

So mutual information is:
How much total surprise is reduced when we know both variables, compared to each one alone:
I(X; Y) = H(X) + H(Y) – H(X, Y)


### Mutual Information when Variables are Independent

When variables are independent, mutual information = 0

In [5]:
# Perfectly independent joint distribution
indep_counts = np.array([
    [25, 25],
    [25, 25]
])

print("MI for independent variables:", mutual_info(indep_counts))


MI for independent variables: 0.0


### Mutual Information when Variables are Dependent

When varibales are dependent, mutual information > 0

In [6]:
# Perfectly dependent: X predicts Y
dependent_counts = np.array([
    [50, 0],
    [0, 50]
])

print("MI for perfectly dependent variables:", mutual_info(dependent_counts))


MI for perfectly dependent variables: 1.0


- Mutual Information (MI) tells us how much knowing one variable reduces uncertainty about another.
- It’s built on entropy, which measures surprise.
- MI = 0 → variables are completely independent
- MI > 0 → knowing one tells us something about the other
- Works for non-linear, non-numeric relationships
- Common uses:
    • Feature selection
    • Dependency detection
    • Evaluating clustering vs. ground truth
