# Divergence vs Metric Functions


## Introduction

In this notebook, we will discuss the difference between divergence and metric functions. We will also see how they are related to each other.

## Divergence

Divergence is a measure of how different two probability distributions are from each other. It is a non-negative scalar value that quantifies the difference between two distributions. Divergence is often used in machine learning and statistics to compare the similarity between two probability distributions.

There are many different divergence functions, such as Kullback-Leibler divergence, cross entropy divergence. Each of these divergence functions has its own properties and applications.

Divergence is not always well defined. It depends on context and community. For example:

- In statistics, divergence is often used to measure the difference between two probability distributions.

- In physics, divergence is used to measure the flow of a vector field.

- In mathematics, divergence is used to measure the rate at which a vector field spreads out from a point.

All metric functions are divergence functions, but not all divergence functions are metric functions.

For example:

Jensen-Shannon divergence, and Hellinger distance are all divergence functions that are also metric functions. It's a question of context and community.

## Metric Function

A metric function is a function that takes two points as input and outputs a scalar value. The metric function is non-negative and is equal to zero if and only if the two points are the same. The metric function satisfies the following properties:

1. Non-negativity: $d(x, y) \geq 0$ for all $x, y \in X$.

2. Identity of indiscernibles: $d(x, y) = 0$ if and only if $x = y$.

3. Symmetry: $d(x, y) = d(y, x)$ for all $x, y \in X$.

4. Triangle inequality: $d(x, y) + d(y, z) \geq d(x, z)$ for all $x, y, z \in X$.

## JS Divergence is a Metric Function

The Jensen-Shannon divergence is a divergence function that is also a metric function. The Jensen-Shannon divergence is defined as:

$JSD(P, Q) = \frac{1}{2} KL(P || M) + \frac{1}{2} KL(Q || M)$

where $KL(P || Q)$ is the Kullback-Leibler divergence between two probability distributions $P$ and $Q$, and $M = \frac{1}{2}(P + Q)$ is the average of the two distributions.

Because it is a metric, it offers some advantages over other divergence functions. For example it is defined when the **model distribution q is zero for some values of x, which is not the case for the KL divergence**.


## Wasserstein Distance

The Wasserstein distance is another metric function that is used to measure the difference between two probability distributions. The Wasserstein distance is defined as:

$W(P, Q) = \inf_{\Pi(P, Q)} \int c(x, y) d\Pi(x, y)$

where $\Pi(P, Q)$ is the set of all joint probability distributions 
with marginals $P$, $Q$ and:

* $c(x, y)$ is the cost of transporting a unit of mass from $x$ to $y$.

### Wasserstein Loss Formula for GANs

In the context of Generative Adversarial Networks (GANs), the Wasserstein distance is used as a loss function to measure the difference between the generated and real data distributions. The Wasserstein loss function is defined as:

$W(P_r, P_g) = \inf_{\gamma \in \Pi(P_r, P_g)} \mathbb{E}_{(x, y) \sim \gamma} [||x - y||]$

where $P_r$ is the real data distribution, $P_g$ is the generated data distribution, and $\Pi(P_r, P_g)$ is the set of all joint probability distributions with marginals $P_r$ and $P_g$.



In [2]:
import numpy as np

def wasserstein_distance(P, Q):
    """
    Compute the Wasserstein distance (1D) between two distributions P and Q.

    Parameters:
    P (ndarray): 1D array of probabilities or histogram bins for distribution P.
    Q (ndarray): 1D array of probabilities or histogram bins for distribution Q.

    Returns:
    float: The Wasserstein distance between P and Q.
    """
    # Normalize the histograms (make sure they sum to 1)
    P = P / np.sum(P)
    Q = Q / np.sum(Q)
    
    # Compute cumulative distribution functions (CDF)
    cdf_P = np.cumsum(P)
    cdf_Q = np.cumsum(Q)
    
    # Compute Wasserstein distance (L1 distance between the CDFs)
    return np.sum(np.abs(cdf_P - cdf_Q))

# Example distributions (histograms)
P = np.array([0.1, 0.3, 0.4, 0.2])
Q = np.array([0.2, 0.1, 0.3, 0.4])

# Compute Wasserstein distance
distance = wasserstein_distance(P, Q)
print("Wasserstein distance:", distance)

Wasserstein distance: 0.3999999999999999


### Benefits of using Wasserstein distance for GANs:

In particular, compared to KL divergence for GANs, Wasserstein distance is more stable and provides better gradients for training the generator.

 Why is this? Because the Wasserstein distance is a metric function, it satisfies the triangle inequality, which makes it easier to optimize. 
 
 Why? Because the triangle inequality ensures that the distance between two points is always less than or equal to the sum of the distances between those points and a third point. This property makes the Wasserstein distance more stable and easier to optimize than other divergence functions.


### Example: Iris Data

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from scipy.spatial.distance import jensenshannon
from scipy.stats import wasserstein_distance
import numpy as np

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit Naive Bayes model
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_probs = nb_model.predict_proba(X_test)

# Fit Logistic Regression model
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train, y_train)
lr_probs = lr_model.predict_proba(X_test)

# Calculate KL divergence
kl_divergence = np.sum(nb_probs * np.log(nb_probs / lr_probs), axis=1).mean()

# Calculate JS distance
js_distance = jensenshannon(nb_probs, lr_probs, axis=1).mean()

# Calculate Wasserstein distance
wasserstein_dist = np.mean([wasserstein_distance(nb_probs[i], lr_probs[i]) for i in range(len(nb_probs))])

print(f"KL Divergence: {kl_divergence}")
print(f"JS Distance: {js_distance}")
print(f"Wasserstein Distance: {wasserstein_dist}")

KL Divergence: 0.07279188207395282
JS Distance: 0.13794915308916347
Wasserstein Distance: 0.052222296152880265
