In [2]:
from IPython.display import IFrame

# Optimal transport and Wassertein distance

This note book has been inspired by the following sources:
* [this wikipedia article](https://en.wikipedia.org/wiki/Wasserstein_metric)
* [this blog article](http://modelai.gettysburg.edu/2020/wgan/Resources/Lesson4/IntuitiveGuideOT1.htm)
* [This short course from Carnegie Mellon University](http://www.stat.cmu.edu/~larry/=sml/Opt.pdf)

In [3]:
#IFrame("doc/OptimalTransportWasserteinDistance/optimal_transport.pdf", width=1200, height=800)

## Introduction from Mindcodec

### Basic introduction

This content is a reproduction from [mindcodec website](http://modelai.gettysburg.edu/2020/wgan/Resources/Lesson4/IntuitiveGuideOT1.htm) all credit for this part pertains to the author.

Optimal transport problems can be formulated in a very intuitive way. Consider the following example: an online retailer has $N$ warehouses and there are $K$ customers who ordered e-book readers. The n-th storage area $x_n$
contains $m_n$ readers while the k-th customer $y_k$ ordered $h_k$ readers. The transport cost $c(x,y)$ is the distance between the storage area $x$ and the address of customer $y$.
The optimal transport problem consists of finding the least expensive way of moving all the readers stored in the storage areas to the customers who ordered them.
A transportation map $\Gamma$ is a matrix whose $\Gamma_{nk}$ entry represents the number of e-book readers sent from the n-th storage area to the k-th customer. For consistency, the sum of all the readers leaving the n-th storage areas has to be equal to the total number of readers stored in that area while the sum of all the readers arriving to a customer’s house has to be equal to the number of e-book readers she ordered. These are the hard constraints of the transport problem and can be written in formulas as follows:

\begin{align*}
  \sum_{k} \Gamma_{nk} = m_n
\end{align*}
and
\begin{align*}
  \sum_{n} \Gamma_{nk} = h_k
\end{align*}

The final constraint is that the entries of the matrix have to be positive-valued (for obvious reasons). The optimal solution is the transportation matrix that minimizes the total cost while respecting the constraints:

\begin{align*}
  \hat{T} = \underset{\Gamma \in \mathbb{R}^{N\times K+}}{argmin} \sum \Gamma_{nk} c(x_n, y_k)
\end{align*}

In this expression we are assuming that transporting L e-readers from $x_n$ to $y_k$ is $L$ times more expensive than transporting one reader. Note that this assumption is not realistic in most real world transportation problems since the transportation cost usually does not scale linearly with the number of transported units. Nevertheless, this simplified problem gives rise to a very elegant and useful mathematical theory.

### Probabilistic formulation
In machine learning and statistics it is often useful to reformulate the optimal transport problem in probabilistic terms. Consider two finite probability spaces $(X, P)$ and $(Y, Q)$ where $X$ and $Y$ are finite sets and $P$ and $Q$ are probability functions assigning a probability to each element of their set. The optimal transport between $P$ and $Q$ is the conditional probability function $\Gamma(y|x)$ that minimizes the following cost function:

\begin{align*}
  \underset{\Gamma}{argmin} \sum \Gamma(y_n|x_k) P(x_k) c(x_n, y_k)
\end{align*}
subject to the following marginalization constraint:
\begin{align*}
  \sum \Gamma(y_n|x_k) P(x_k) = Q(y_n)
\end{align*}
This simply means that the marginal distribution of the joint probability $\Gamma(y_n|x_k) P(x_k)$ is $Q(y_n)$

In other words, $\Gamma(y_n|x_k)$ is transporting the distribution $P(x)$ into the distribution $Q(y)$.
This transportation can be interpreted as a stochastic function that takes $x$ as input and outputs a $y$
 with probability $γ\Gamma(y|x)$. The problem thus consists of finding a stochastic transport that maps the probability distribution $P$ into the probability distribution $Q$ while minimizing the expected transportation cost. It is easy to see that this problem is formally identical to the deterministic problem that I introduced in the previous section. The transportation matrix $\Gamma_{nk}$ is given by $\Gamma(y_n|x_k) P(x_k)$.
This ensures that the first constraint (number of ebook per warehouse) is automatically fulfilled while the second constraint (number of ebook per customer) still needs to be enforced.

### Continuous formulation
It is straightforward to extend the definition of probabilistic optimal transport to continuous probability distributions. This can be done by replacing the probabilities $P(x)$ and $Q(x)$ with the probability densities 
$p(x)$ and $q(x)$ and the summation with an integration:

\begin{align*}
  \underset{\gamma}{argmin} \int \gamma(y|x) p(x) c(x, y) dxdy
\end{align*}

Analogously, the marginalization constraint becomes:
\begin{align*}
  \int \gamma(y|x) p(x) dx = q(y)
\end{align*}

This continuous optimal transport problem is usually introduced in a slightly different (and in my opinion less intuitive) form. I will denote the joint density $\gamma(y|x) p(x)$ as $\gamma(x,y)$.
It is easy to see that the problem can be reformulated as follows:

\begin{align*}
  \underset{\gamma}{argmin} \int \gamma(x,y) c(x, y) dxdy
\end{align*}

with the two marginalization constraints:
\begin{align*}
  \int \gamma(x,y)dx = q(y)
\end{align*}
and
\begin{align*}
  \int \gamma(x,y)dy = p(x)
\end{align*}

### Optimal transport divergences
In many situations the primary interest is not to obtain the optimal transportation map. Instead, we are often interested in using the optimal transportation cost as a statistical divergence between two probability distributions. A statistical divergence is a function that takes two probability distributions as input and outputs a non-negative number that is zero if and only if the two distributions are identical. Statistical divergences such as the $KL$ divergence are massively used in statistics and machine learning as a way of measuring dissimilarity between two probability distributions. Statistical divergences have a central role in several of the most active areas of statistical machine learning, such as generative modeling and variational Bayesian inference.

### Optimal transport divergences and the Wasserstein distance
An optimal transport divergence is defined as the optimal transportation cost between two probability distributions:

\begin{align*} 
  OT_c[p,q] = \underset{\gamma}{inf} \int \gamma(x,y) c(x,y) dxdy
\end{align*}

where the optimization is subject to the usual marginalization constraints. This expression provides a valid divergence as far as the cost is always non-negative and $c(x,x)$ vanishes for all values of $x$. Clearly, the properties of an optimal transport divergence depend on its cost function. A common choice is the squared Euclidean distance:

\begin{align*}
  c(x,y) = \|x-y\|_2^2
\enf{align*} 

Using the Euclidean distance as a cost function, we obtain the famous (squared) 2-Wasserstein distance:

\begin{align*} 
  W_2[p,q]^2 = \underset{\gamma}{inf} \int \gamma(x,y) \|x-y\|_2^2 dxdy
\end{align*}

The squared root of $W_2[p,q]^2$ is a proper metric function between probability distributions as it respects the triangle inequality. Using a proper metric such as the Wasserstein distance instead of other kinds of optimal transport divergences is not crucial for most machine learning applications, but it often simplifies the mathematical treatment. Finally, given an integer $k$, the k-Wasserstein distance is defined as follows:

\begin{align*} 
  W_k[p,q]^k = \underset{\gamma}{inf} \int \gamma(x,y) \|x-y\|_k^k dxdy
\end{align*}
 
where $\|\cdot\|_k^k$ denotes the $L_k$ norm.