# Machine Learning
### Mutual information for continuous-discrete case
**Mutual information** (**MI**) when the feature variable $X$ is continuous and the target $Y$ is discrete with support $\mathcal{Y}=\{y_1,y_2,...,y_K\}$ may be defined by:
<br>$\large I(X;Y)=\sum_k\int p(x,y_k)\cdot log(\frac{p(x,y_k)}{p(x)\cdot p(y_k)})\,dx$
<br>Where
- $p(x,y_k)$: joint pdf/pmf (probablity density function/probability mass function)  
- $p(x),p(y_k)$: marginal pdf of $X$ and marginal pmf of $Y$ 
- Unit: nats (if natural log) or bits (if log₂)

Alternatively, we may write:
<br>$\large I(X;Y)=\int \sum_k p(x,y_k)\cdot log(\frac{p(x,y_k)}{p(x)\cdot p(y_k)})\,dx$
<br>If we use the equality $p(x,y_k)=p(x|y_k)\cdot p(y_k)$ in above formula we get:
<br>$\large I(X;Y)=\int \sum_k p(x|y_k)\cdot p(y_k)\cdot log(\frac{p(x|y_k)}{p(x)})\,dx$ &nbsp;&nbsp;&nbsp; (1)
<br>**Hint:** One way to compute mutual information is to use formula (1) mentioned above.
<hr>
Another formula to compute mutual infromation is to use the direct formula by Monte Carlo method to estimate expectation. accordinly, we may approximate the following forumla:
<br>$\large I(X;Y)=\int\int p(x,y)\cdot log(\frac{p(x,y)}{p(x)\cdot p(y)})\,dx$
<br> with Monte carlo estimation of expectation, which leads to:
<br> $\large I(X;Y)\approx\frac{1}{n}\sum_{i=1}^n log(\frac{p(x_i,y_i)}{p(x_i)\cdot p(y_i)})$
<br>Replacing $p(x_i,y_i)$ by its equivalent $p(x_i|y_i)p(y_i)$ leads to:
<br>$\large I(X;Y)\approx \frac{1}{n}\sum_{i=1}^n log(\frac{p(x_i|y_i)}{p(x_i)})$ &nbsp;&nbsp;&nbsp; (2)
<hr>

**Reminder:** $I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)$
<br> Where:
- $H(X)=−\int p(x)\cdot logp(x)\,dx$ (Differential entropy)
- $H(Y)=−\sum_y p(y)\cdot log⁡p(y)$ (Shannon entropy)
- $H(X∣Y)=\sum_y p(y)\cdot H(X∣Y=y)$ (Conditional entropy)
- $H(X∣Y=y)=−\int p(x∣y)\cdot log⁡p(x∣y)\,dx$
<hr>

In the following, we estimate **conditional** and **marginal** densities by 1D KDE (kernel density estimation). Then, we use formula (1) or (2) to compute **mutual information**.

<hr>

https://github.com/ostad-ai/Machine-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Machine-Learning/

In [1]:
# Import required module
import numpy as np

In [2]:
class MutualInfoContinuousDiscrete:
    """
    Estimate Mutual Information between continuous X and discrete Y 
    by Monte Carlo using KDE with Scott's bandwidth rule.
    """
    
    def __init__(self):
        # Only Scott's rule is used
        pass
    
    def _scott_bandwidth(self, X):
        """Scott's rule for 1D: h = σ * n^(-1/5)"""
        return np.std(X) * len(X) ** (-1/5)
    
    def _kde_logpdf(self, X_train, X_test, bandwidth):
        """Compute log PDF using Gaussian KDE."""
        # Vectorized computation for efficiency
        diff = X_test[:, None] - X_train[None, :]
        z = diff / bandwidth
        kernels = np.exp(-0.5 * z**2) / (bandwidth * np.sqrt(2*np.pi))
        pdf = np.mean(kernels, axis=1)
        return np.log(np.maximum(pdf, 1e-100))
    
    def fit(self, X, Y):
        """Estimate Mutual Information."""
        X = np.asarray(X).ravel()
        Y = np.asarray(Y).ravel()
        
        if len(X) != len(Y):
            raise ValueError("X and Y must have same length")
        
        n = len(X)
        
        # Get classes and probabilities
        unique_classes, class_counts = np.unique(Y, return_counts=True)
        p_y = class_counts / n
        
        # Marginal bandwidth
        h_marginal = self._scott_bandwidth(X)
        
        # Estimate log p(x) for all samples
        log_px = self._kde_logpdf(X, X, h_marginal)
        
        # Estimate p(x|y) for each class
        log_px_given_y = np.zeros(n)
        
        for cls in unique_classes:
            mask = (Y == cls)
            X_class = X[mask]
            
            if len(X_class) > 1:
                h_class = self._scott_bandwidth(X_class)
                log_px_given_y[mask] = self._kde_logpdf(X_class, X[mask], h_class)
            else:
                log_px_given_y[mask] = log_px[mask]  # Fallback
        
        # Compute MI directly
        mi = np.mean(log_px_given_y - log_px)
        self.mi_ = max(mi, 0)  # MI cannot be negative
        
        # Optional: Also compute entropies
        self.H_X_ = -np.mean(log_px)
        self.H_X_given_Y_ = np.sum([
            p_y[i] * -np.mean(log_px_given_y[Y == cls]) 
            for i, cls in enumerate(unique_classes)
        ])
        
        return self
    
    def get_mutual_info(self):
        return self.mi_
    
    def get_entropies(self):
        return {'H_X': self.H_X_, 'H_X_given_Y': self.H_X_given_Y_}

In [3]:
# Example
# Create synthetic data
X = np.concatenate([
    np.random.normal(0, 1, 300),
    np.random.normal(2, 1.5, 400),
    np.random.normal(-1, 0.8, 300)
])
Y = np.concatenate([np.zeros(300), np.ones(400), np.full(300, 2)])

# Compute MI
estimator = MutualInfoContinuousDiscrete()
estimator.fit(X, Y)
entropies=estimator.get_entropies()
print(f"MI by direct formula: {estimator.get_mutual_info()}")
print(f"Entropies: {entropies}")
MI_entropies=entropies['H_X']-entropies['H_X_given_Y']
print(f'MI by entropies: I(X;Y)=H(X)-H(X|Y) -> {MI_entropies} ')

MI by direct formula: 0.4325848741329876
Entropies: {'H_X': 1.9346301530769638, 'H_X_given_Y': 1.5020452789439762}
MI by entropies: I(X;Y)=H(X)-H(X|Y) -> 0.43258487413298763 


<hr style="background:lightgreen; height:3px">

# Bonus 
#### Mutual informaiton by formula (1) and integration

In [4]:
# Using formula (1) to estimate Mutual Information
from scipy.stats import norm
from scipy.integrate import simps

def kde_pdf(x, samples, bandwidth):
    """
    Compute KDE PDF at points x using Gaussian kernel.
    
    Args:
        x: array of points to evaluate PDF (shape: [M])
        samples: observed data (shape: [N])
        bandwidth: scalar bandwidth
    
    Returns:
        pdf: estimated PDF at x (shape: [M])
    """
    samples = np.asarray(samples)
    x = np.asarray(x)
    # Compute pairwise differences
    diff = (x[:, None] - samples[None, :]) / bandwidth  # Shape: [M, N]
    kernel_vals = norm.pdf(diff)  # Gaussian kernel
    pdf = np.mean(kernel_vals, axis=1) / bandwidth
    return pdf

def mutual_information_integration(X, Y, bandwidth=None, n_grid=1000):
    """
    Estimate I(X; Y) where X is continuous, Y is discrete by formula (1)
    
    Args:
        X: continuous data (1D array, shape [N])
        Y: discrete labels (1D array, shape [N])
        bandwidth: float or None (if None, use Scott's rule)
        n_grid: number of points for numerical integration
    
    Returns:
        mi: estimated mutual information (scalar)
    """
    X = np.asarray(X).flatten()
    Y = np.asarray(Y).flatten()
    N = len(X)
    
    # Empirical probabilities for Y
    unique_labels, counts = np.unique(Y, return_counts=True)
    p_y = counts / N  # p(y_k)
    
    # Marginal bandwidth (if not provided)
    if bandwidth is None:
        bandwidth_global = X.std() * N ** (-1/5)
    
    # Integration grid
    x_min, x_max = X.min(), X.max()
    x_grid = np.linspace(x_min - 3*bandwidth_global, x_max + 3*bandwidth_global, n_grid)
    
    # Estimate marginal p(x) with global bandwidth
    p_x = kde_pdf(x_grid, X, bandwidth_global)
    
    # Avoid log(0) by clipping densities
    eps = 1e-10
    p_x = np.clip(p_x, eps, None)
    
    # Initialize integrand
    integrand = np.zeros_like(x_grid)
    
    # Loop over each class
    for k, label in enumerate(unique_labels):
        # Get samples for this class
        X_k = X[Y == label]
        N_k = len(X_k)
        
        if N_k == 0:
            continue
        # DIFFERENT BANDWIDTH PER CLASS
        if bandwidth is None:
            bandwidth_k = X_k.std() * N_k ** (-1/5) if N_k > 1 else bandwidth_global
        else:
            bandwidth_k = bandwidth    
        # Estimate p(x | y_k)
        p_x_given_y = kde_pdf(x_grid, X_k, bandwidth_k)
        p_x_given_y = np.clip(p_x_given_y, eps, None)
        
        # Compute contribution: p(y_k) * p(x|y_k) * log(p(x|y_k)/p(x))
        log_ratio = np.log(p_x_given_y / p_x)
        integrand += p_y[k] * p_x_given_y * log_ratio
    
    # Numerical integration (Simpson's rule)
    mi = simps(integrand, x_grid)
    return mi

In [7]:
# Comparing MI estimation by formulae (1) and (2)
def compare_methods():
    """Compare both methods on synthetic data with known properties"""
    
    # Create test data: 3 classes with different distributions
    n_samples = 1000
    Y = np.random.choice([0, 1, 2], size=n_samples, p=[0.4, 0.35, 0.25])
    X = np.zeros(n_samples)
    
    # Class 0: N(0, 1)
    X[Y == 0] = np.random.normal(0, 1, np.sum(Y == 0))
    # Class 1: N(2, 1.5)  
    X[Y == 1] = np.random.normal(2, 1.5, np.sum(Y == 1))
    # Class 2: N(-1, 0.8)
    X[Y == 2] = np.random.normal(-1, 0.8, np.sum(Y == 2))
    
    # Formula (1) (integration)
    mi_F1 = mutual_information_integration(X, Y, n_grid=2000)
    
    # Formula (2) (Monte Carlo)
    mi_F2 = MutualInfoContinuousDiscrete().fit(X, Y).get_mutual_info()
    
    print("Comparison Results:")
    print(f"MI by formula (1) (integration): {mi_F1} nats")
    print(f"MI by formula (2) (Monte Carlo): {mi_F2} nats")
    print(f"Difference: {abs(mi_F1 - mi_F2):.4f} nats")    
    return mi_F1, mi_F2

# Run comparison
mi_F1, mi_F2 = compare_methods()

Comparison Results:
MI by formula (1) (integration): 0.34188036858603205 nats
MI by formula (2) (Monte Carlo): 0.3651783944021437 nats
Difference: 0.0233 nats
