# Machine Learning
### Mutual information: Discrete-discrete case
In Machine learning, **mutual information** (**MI**) measures the amount of information obtained about the target variable $Y$ by observing a feature $X$. Specifically, for two **discrete** random variables $X$ (feature) and $Y$ (target), the mutual information is defined as:
<br>$\large I(X;Y)=\sum_{x\in \mathcal{X}}\sum_{y\in \mathcal{Y}} p(x,y) \cdot log⁡(\frac{p(x,y)}{p(x) \cdot p(y)})$
<br>Where
- $p(x,y)$: joint probability mass function (joint PMF) 
- $p(x),p(y)$: marginal probabilities  
- $\mathcal{X}$ and $\mathcal{Y}$ are support (set of possible values) of $X$ and $Y$.
- Unit: nats (if natural log) or bits (if log₂)

<hr> 

Some properties of **mutual informaiton** (**MI**):
- **Symmetry:** $I(X;Y)=I(Y;X)$.
- **Non-negativity:** $I(X;Y)\ge 0$, with equality iff $X$ and $Y$ are independent.
- **Upper bound:** $I(X;Y)\le min⁡(H(X),H(Y))$.
- **Additivity** for independent variables: If $(X_1,Y_1)$ and $(X_2,Y_2)$ are independent, then $I(X_1,X_2;Y_1,Y_2)=I(X_1;Y_1)+I(X_2;Y_2)$.

<hr>

Computing MI from **entropies**:
- $I(X;Y)=H(X)+H(Y)−H(X,Y)$
- $I(X;Y)=H(X)−H(X∣Y)=H(Y)−H(Y∣X)$
- $I(X;Y)=H(X,Y)-H(Y|X)-H(X|Y)$
- Generally: $I(X_1,X_2,...,X_p)=\sum_{i=1}^p H(X_i)-H(X_1,X_2,...,X_p)$
    - Reminder: $H(X)=-\sum_x p(x)\cdot log\,p(x)$

<hr>

In the following,
- We compute MI for two discrete random variables. 
- A minimal code to compute MI is given.
- The code to compute MI from three entropies are mentioned.
- Finally, a code to compute MI between any number of random variables using entropy and `Counter` from `collections`.

<hr>

https://github.com/ostad-ai/Machine-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Machine-Learning/background-knowledge

In [1]:
# Import required module
import numpy as np

In [2]:
def mi_discrete_discrete(X, Y,base=2):
    """
    MI for discrete-discrete variables using empirical probabilities
    """
    X = np.asarray(X).flatten()
    Y = np.asarray(Y).flatten()
    n = len(X)
    
    # Create contingency table
    # Get unique values
    x_vals = np.unique(X)
    y_vals = np.unique(Y)
    
    # Joint probability p(x,y)
    joint_counts = np.zeros((len(x_vals), len(y_vals)))
    
    for i, x in enumerate(x_vals):
        for j, y in enumerate(y_vals):
            joint_counts[i, j] = np.sum((X == x) & (Y == y))
    
    joint_probs = joint_counts / n
    
    # Marginal probabilities
    p_x = np.sum(joint_probs, axis=1)
    p_y = np.sum(joint_probs, axis=0)
    
    # Compute MI
    mi = 0
    for i in range(len(x_vals)):
        for j in range(len(y_vals)):
            if joint_probs[i, j] > 0:
                mi += joint_probs[i, j] * np.log(joint_probs[i, j] / (p_x[i] * p_y[j]))
    
    # Convert to desired base and ensure non-negative
    mi = max(mi / np.log(base), 0)
    
    return mi

In [3]:
# Examples
print('=== Computing MI ===')
# Example 1: Perfect dependence
X1 = ['A', 'B', 'C', 'A', 'B', 'C']
Y1 = ['A', 'B', 'C', 'A', 'B', 'C']
mi1 = mi_discrete_discrete(X1, Y1, base=2)
print(f"Perfect dependence MI: {mi1:.4f} bits")

# Example 2: Complete independence
X2 = ['A', 'A', 'B', 'B']
Y2 = ['X', 'Y', 'X', 'Y']
mi2 =  mi_discrete_discrete(X2, Y2, base=2)
print(f"Complete independence MI: {mi2:.4f} bits")

# Example 3: Partial dependence
X3 = ['A', 'A', 'A', 'B', 'B', 'C']
Y3 = ['X', 'X', 'Y', 'X', 'Y', 'Y']
mi3 = mi_discrete_discrete(X3, Y3, base=2)
print(f"Partial dependence MI: {mi3:.4f} bits")

=== Computing MI ===
Perfect dependence MI: 1.5850 bits
Complete independence MI: 0.0000 bits
Partial dependence MI: 0.2075 bits


<hr style="height:3px; background:lightgreen">

# A minimal code for computing MI

In [4]:
# Computing MI with a minimal code

def mi_minimal(X, Y, base=2):
    """Minimal mutual information implementation"""
    n = len(X)
    joint, mx, my = {}, {}, {}
    for x, y in zip(X, Y):
        joint[(x, y)] = joint.get((x, y), 0) + 1
        mx[x] = mx.get(x, 0) + 1
        my[y] = my.get(y, 0) + 1
    
    mi= sum((c/n) * np.log((c/n) / ((mx[x]/n)*(my[y]/n))) 
               for (x, y), c in joint.items()) / np.log(base)
    
    return max(mi,0)

print('=== Computing MI with the minimal code ===')
print(f"Perfect dependence MI: {mi_minimal(X1, Y1):.4f} bits")
print(f"Complete independence MI: {mi_minimal(X2, Y2):.4f} bits")
print(f"Partial dependence MI: {mi_minimal(X3, Y3):.4f} bits")

=== Computing MI with the minimal code ===
Perfect dependence MI: 1.5850 bits
Complete independence MI: 0.0000 bits
Partial dependence MI: 0.2075 bits


<hr style="height:3px; background:lightgreen">

# Using entropies to compute MI

In [5]:
# Compute three entropies to get MI
def entropy(arr,base=2):
        # For joint entropy, arr should be tuple of arrays
        n=len(arr)
        if isinstance(arr, tuple):
            vals, counts = np.unique(np.column_stack(arr), axis=0, return_counts=True)
            n=len(arr[0])
        else:
            vals, counts = np.unique(arr, return_counts=True)
            n=len(arr)
        p = counts / n
        return -np.sum(p * np.log(p) / np.log(base))
    
def mi_entropy(X, Y, base=2):
    """Using MI from entropy"""
    X, Y = np.array(X), np.array(Y)
    return entropy(X,base) + entropy(Y,base) - entropy((X, Y),base)

# Example
print('=== Computing MI with Entropies ===')
print(f"Perfect dependence MI: {mi_entropy(X1, Y1):.4f} bits")
print(f"Complete independence MI: {mi_entropy(X2, Y2):.4f} bits")
print(f"Partial dependence: {mi_entropy(X3, Y3):.4f} bits")

=== Computing MI with Entropies ===
Perfect dependence MI: 1.5850 bits
Complete independence MI: 0.0000 bits
Partial dependence: 0.2075 bits


<hr style="height:3px; background:lightgreen">

# Bonus
#### Computing MI between any number of random variables (two or more)
- For one random variable, it returns its entropy.

In [6]:
# Import Counter
from collections import Counter

def mi(*args, base=2):
    """Minimal MI with variable arguments"""
    n = len(args[0])
    
    # Single entropy function
    def H(vals):
        counts = Counter(vals)
        p = np.array(list(counts.values())) / n
        p = p[p > 0]  # Remove zero probabilities
        return -np.sum(p * np.log(p)) / np.log(base)
    
    # Individual entropies
    H_individual = [H(arg) for arg in args]
    
    # Joint entropy
    if len(args) == 1:
        return H_individual[0]
    else:
        H_joint = H(zip(*args))
        return sum(H_individual) - H_joint
    
#---------------------------
# Example
X4 = ['A', 'B', 'A', 'B']
Y4 = ['X', 'X', 'X', 'Y'] 
Z4 = ['1', '2', '1', '2']

print('=== Computing MI for any number of random variables ===')
print(f"MI between two variables, I(X4;Y4): {mi(X4, Y4):.4f} bits") 
print(f"MI between three variables I(X4;Y4;Z4): {mi(X4, Y4,Z4):.4f} bits")
print(f"MI for one variable = entropy I(X4): {mi(X4):.4f} bits")

=== Computing MI for any number of random variables ===
MI between two variables, I(X4;Y4): 0.3113 bits
MI between three variables I(X4;Y4;Z4): 1.3113 bits
MI for one variable = entropy I(X4): 1.0000 bits
