# Machine Learning
### Differential entropy
For a continuous random variable $X$ with probability density function (PDF) $p(x)$, the **differential entropy** $h(X)$ is defined as:
<br> $\large h(X)=-\int_{\mathcal{X}}p(x)\cdot log p(x)\,dx$
<br> Where:
- The integral is taken over the support of $X$ denoted by $\mathcal{X}$.
- The logarithm is typically base $e$ (units: nats) or base $2$ (units: bits).
- Unlike discrete entropy, differential entropy can be negative, and it is not invariant under change of variables.
- It measures the "spread" or uncertainty of a continuous distribution, but does not represent the average code length (as in discrete entropy).

<hr>

**Reminder 1:** Theoretical Values:
- Gaussian distribution $N(\mu,\sigma^2)$ has $h(X)=\frac{1}{2}log⁡(2\pi e \sigma^2)$
- Uniform distribution $U(a,b)$ has $h(X)=log⁡(b−a)$

<hr>

We may estimate differential entropy for data points with pdf (probability density function) $p(x)$:
- **Monte carlo estimation**: $h(X)\approx -\frac{1}{n} \sum_{i=1}^n logp(x_i)$, $x_i\sim p(x)$
- **Numerical integration**: $h(X)\approx \int_\mathcal{X} f(x) dx$, $f(x)=-p(x)\cdot logp(x)$

**Hint:** For both methods mentioned above, by having $n$ data points $x_1$, $x_2$, ...,$x_n$, we estimate the density function p(x) by 1D KDE (kernel density estimation): $\hat{p}(x)=\frac{1}{n\cdot h}\sum_{i=1}^n K(\frac{x-x_i}{h})$, where **kernel** $K$ is Gaussian: $K(u)=\frac{1}{\sqrt{2\pi}}e^{-u^2/2}$, and $h$ is called the **bandwidth**.

<br>**Reminder 2:** For KDE (**kernel density estimation**), you may see our previous posts.
<hr>

In the following, we estimate **differential entropy** by both methods: *Monte Carlo-based*, and *integration-based* methods.

<hr>

https://github.com/ostad-ai/Machine-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Machine-Learning/

In [1]:
# Import required module
import numpy as np

In [2]:
def _scott_bandwidth(X):
        """Scott's rule for 1D: h = σ * n^(-1/5)"""
        return np.std(X) * len(X) ** (-1/5)

def _kde_logpdf(X_train, X_test, bandwidth):
    """PROPER stable log-PDF computation."""
    n_train = len(X_train)
    const = -np.log(n_train * bandwidth * np.sqrt(2*np.pi))
    
    log_pdf = np.zeros(len(X_test))
    
    for i, x in enumerate(X_test):
        # Compute exponents directly (no exp() yet)
        z = (x - X_train) / bandwidth
        exponents = -0.5 * z**2  # These can be very negative
        
        # Log-sum-exp trick
        max_exp = np.max(exponents)
        log_sum = max_exp + np.log(np.sum(np.exp(exponents - max_exp)))
        
        log_pdf[i] = const + log_sum
    
    return log_pdf
    
def diff_entropy_MC(data):
    """
    Estimate differential entropy using Monte Carlo and KDE.    
    Parameters:
    -----------
    data : array-like
        Continuous observations
    Returns:
    --------
    entropy : float
        Differential entropy in nats
    """
    data = np.asarray(data).flatten()
    
    # Fit KDE
    bandwidth=_scott_bandwidth(data)
    log_pdf = _kde_logpdf(data, data,bandwidth)
    
    # Estimate entropy: h = -E[log f(x)] ≈ -mean(log f(x_i))
    return -np.mean(log_pdf)

# Example
np.random.seed(42)
normal_data = np.random.normal(0, 1, 1000)
# Standard normal distribution (theoretical h = 0.5 * log(2*pi*e) ≈ 1.4189 nats)
samples = np.random.normal(0, 1, 10000)
h_est_MC = diff_entropy_MC(normal_data)
h_true = 0.5 * np.log(2 * np.pi * np.e)  # ≈ 1.4189

print(f"Estimated by MC h(X): {h_est_MC} nats")
print(f"Theoretical h(X): {h_true} nats")

Estimated by MC h(X): 1.3932904433936801 nats
Theoretical h(X): 1.4189385332046727 nats


In [3]:
from scipy.stats import norm
from scipy.integrate import simps

def kde_pdf(x, samples, bandwidth):
    """
    Compute Gaussian KDE PDF at points x.
    
    Args:
        x: array of evaluation points (shape [M])
        samples: observed data (shape [N])
        bandwidth: scalar bandwidth (h)
    
    Returns:
        pdf: estimated PDF at x (shape [M])
    """
    samples = np.asarray(samples)
    x = np.asarray(x)
    diff = (x[:, None] - samples[None, :]) / bandwidth
    kernel_vals = norm.pdf(diff)  # Gaussian kernel
    return np.mean(kernel_vals, axis=1) / bandwidth

def diff_entropy_intg(samples, bandwidth=None, n_grid=1000):
    """
    Estimate differential entropy from 1D samples using KDE.
    
    Args:
        samples: 1D array of data points (shape [N])
        bandwidth: float (if None, use Scott's rule)
        n_grid: number of points for numerical integration
    
    Returns:
        h: estimated differential entropy (scalar, in nats)
    """
    samples = np.asarray(samples).flatten()
    N = len(samples)
    
    # Bandwidth selection (Scott's rule)
    if bandwidth is None:
        std = samples.std()
        if std == 0:
            return -np.inf  # Degenerate distribution
        bandwidth = std * N ** (-1/5)
    
    # Integration grid (extend beyond data range)
    x_min, x_max = samples.min(), samples.max()
    x_grid = np.linspace(x_min - 3*bandwidth, x_max + 3*bandwidth, n_grid)
    
    # Estimate PDF via KDE
    p_x = kde_pdf(x_grid, samples, bandwidth)
    
    # Avoid log(0)
    eps = 1e-15
    p_x = np.clip(p_x, eps, None)
    
    # Compute integrand: -p(x) * log(p(x))
    integrand = -p_x * np.log(p_x)
    
    # Numerical integration (Simpson's rule)
    h = simps(integrand, x_grid)
    return h

# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Standard normal distribution (theoretical h = 0.5 * log(2*pi*e) ≈ 1.4189 nats)
    samples = np.random.normal(0, 1, 10000)
    h_est = diff_entropy_intg(samples)
    h_true = 0.5 * np.log(2 * np.pi * np.e)  # ≈ 1.4189
    
    print(f"Estimated by integration h(X): {h_est} nats")
    print(f"Theoretical h(X): {h_true} nats")

Estimated by integration h(X): 1.4344746695229058 nats
Theoretical h(X): 1.4189385332046727 nats
