In [1]:
#math and linear algebra stuff
import numpy as np

#Math and linear algebra stuff
import scipy.stats as scs

#plots
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (16.0, 9.0)
#mpl.rc('text', usetex = True)
import matplotlib.pyplot as plt
%matplotlib inline

# Normalizing flows for Deep Learning

## Introduction

This notebook is somehow a follow up the InformationTheoryOptimization notebook. Many more basic concepts exposed there will be used here. But we will give a short reminder.

## Definitions

### Bayes theorem

Although Bayesian statistics is not really a novel approach, the recent rise of probabilistic programming libraries, availability of a lot of data, and powerful computer to run markov chains, made the tedious task of running bayesian inference algorithms a lot easier.

Let's recall some simple elements about the Bayes theorem that we have been in other notebooks:

For two given random variables, $X$ and $Y$, we can write :
\begin{equation}
    P(\theta|x) = \frac{P(x|\theta)P(\theta)}{P(x)}
\end{equation}


We could use a slightly more formal version, that includes a model $\mathcal{M}$ that we believe is the underlying model for the random variable $x$, parametrized by $\theta$

\begin{equation}
    P(\theta|x,\mathcal{M}) = \frac{P(x|\theta, \mathcal{M})P(\theta|\mathcal{M})}{P(x|\mathcal{M})}
\end{equation}

* Usually, $\theta$ is the random variable that we would like to caracterize. It usually consist in a set of parameters for a given statistical model (mean and variance for a normal distribution for instance).

* On the contrary $x$, called the evidence, is usually derived from a known dataset, or from sampling a real life process. We will see shortly how we can define an empirical distribution from a set of samples. 

* We call $P(x|\theta)$ the likelihood of $x$, given the model distribution  $\theta$. This is the likelihood (or its logarithm) we tried to maximize with various instances of expectation maximization algorithm in the InformationTheory notebook by finding the optimal model parameters $\theta$.
Maximum likelihood had some success in the past for some instance of problems where likelihood or its logarithm were concave and differentiable, and a gradient based methods allowed to find a global maximum.

* We call $P(\theta|x)$ the posterior distribution of $\theta$, given a known (usually empirical distribution of data) $x$. As opposite to the prior, it gives an idea of the probability of the model AFTER some data ($x$) has been seen.

* We call $P(\theta)$ the a-priori distribution for the variable $\theta$. This one can be derived if we have a-priori knowledge on the model parameters $\theta$, it may consist in apriori knowledge of some surrogate parameters that we integrate as a marginal distribution and scale. A prior usually allows us to compute the probability of a given set of parameters $\theta \in \mathbb{R}^n$ that feed the model $\mathcal{M}$.

A priori usually writes:
  \begin{align*}
    y &\rightarrow P(\theta=y) \\
    \mathbb{R}^n &\mapsto [0,1]
  \end{align*}
  
A concrete example is for instance, the use of the framework of random markov field (or Gibbs random field) in a n-dimensional space like an image, were we can consider the pixels as a set of vertices of a graph, and only neighbouring pixels are connected by edges. The probability of a given graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ is given by $P(\mathcal{G}) = \alpha e^{-\beta U(\mathcal{G})}$ where $\alpha$ and $\beta$ are normalization factor, and $U(\mathcal{G}) = \sum_{c\in C} V_c(\mathcal(G))$ is a sum of clique potentials $V_c(\mathcal(G))$ over all possible cliques $C$. A very common instance of Gibbs measure is the gaussian-like distribution where, we have, for each neighbouring pair of pixel $p_1-p_2$ (a clique): $V_{p_1-p_2}(\mathcal{G}) = (\mathcal{G}_{p_1}-\mathcal{G}_{p_2})^2$
When one has no apriori on $\theta$, then the highest entropy hypothesis is implicitly used. More pragmatically, we write $P(\theta=y) \sim Uniform(\mathrm{supp} \theta)$ where $\mathrm{supp} y)$ is the support of $\theta$
Notice that, given a discrete process, as long as we received new data, the prior distribution can be taken just as being the previous step posterior.

* Usually $P(x)$ is the empirical distribution that correspond to a given set of outcomes (actual data). It represents the probability of the given outcome of the experiment regardless of the value of the underlying model $\theta$. We can see $P(x)$ as a marginalization of $\theta$ in the joint distribution $P(x,\theta)=P(x|\theta)P(\theta)$ i.e. $P(x) = \int_{\theta} P(x,\theta)$ such that it acts as a constant used to normalize likelihood  / prior product to make sure the result posterior is a distribution.

### Bayesian inference and the sampling task

As briefly seen in the InformationTheory notebook, in the general case, to compute the posterior over the whole support of $\theta$ is extremely complex problem, especially if $\theta$ is a vector from a high dimension space, and the actual distribution of the posterior is very complex (not smooth, many local dense areas, etc...). MCMC types of methods often comes into play here, where one computes $P(\theta=y_0|x)$ for a given $y_0$, then jumps to another point in space $y_1$, with a given probably, dependant on $P(\theta=y|x)$.
Little by little one can get a discrete approximation of the posterior, but it can be very slow.

## Kullback Leibler divergence

The Kullback Leibler divergence, also called the relative entropy or discrimination information, can be computed between two distributions, say $x$ and $y$ and writes $ D_{KL}(x\|y) $.
This metric represents the expectation of the difference of the number of bits needed to encode a symbol from the distribution x, weither the encoding is optimal for the distribution x or y.

The general case reads:

\begin{align}
    D_{KL}(x\|y) &= \mathbb{E}_x \left[ log \left( \frac{x}{y} \right) \right]
\end{align}

In the discrete case it reads:

\begin{align}
    D_{KL}(x\|y) &= \sum_{i\in I} p(x_i) log(p(x_i)) -\sum_{i\in I} p(x_i) log(p(y_i)) \\
                 &= -H(x) + H_y(x)
\end{align}
or equivalently
$$
    D_{KL}(x\|y) = \sum_{i\in I} p(x_i) log\left(\frac{p(x_i)}{p(y_i)}\right)
$$


### KL Divergence and maximum likelihood
This chapter was inspired by [this article](http://www.hongliangjie.com/2012/07/12/maximum-likelihood-as-minimize-kl-divergence/)
by Liangjie Hong

#### Definitions
Let's consider $P(x_i|\theta)$ a distribution used to generate (or sample) $N$ points $x_i, i=0,\dots,N-1$

We can define a general distribution model as:
$$
    P_\theta(x) = P(x|\theta)
$$
And, to handle data from real world, we will define the following empirical distribution (also called probability mass function):
$$
    P_D(x) = \sum_{i=0}^{N-1} \frac{1}{N} \delta(x-x_i)
$$

#### KL Divergence between model and data
We can now express the Kullback Leibler divergence between the model distribution and the empirical one:

\begin{align}
    D_{KL} \left( P_D(x)||P_\theta(x) \right) &= \int P_D(x) log(P_D(x)) dx -\int P_D(x) log(P_\theta(x)) dx \\
    &= H(P_D(x)) - \int P_D(x) log(P_\theta(x)) dx
\end{align}

As $H(P_D(x))$ is a constant term, relatd to the empirical distribution of real data, we will call it $\epsilon$, and we may get rid of it later.
Instead we will mainly consider, the $\theta$, ie, the model related term:
$$
   D_{KL} \left( P_D(x)||P_\theta(x) \right) = \epsilon -\langle P_D(x) , log(P_\theta(x)) \rangle_{L^2}
$$

Thanks to the definition of the empirical distribution, we can express this continuous expression as a discrete summation:

\begin{align}
    \langle P_D(x) , log(P_\theta(x)) \rangle_{L^2} &= \int \sum_{i=0}^{N-1} \frac{1}{N} \delta(x-x_i) log(P(x|\theta)) dx \\
    &= \sum_{i=0}^{N-1} \frac{1}{N} log(P(x_i|\theta)) \\
    &= \frac{1}{N} \sum_{i=0}^{N-1} log(P(x_i|\theta))
\end{align}

The last line amounts to the log-likelihood of the dataset. We can then conclude that maximizing the log-likelihood of a dataset, relatively to a set of distribution parameters $\theta$ and a given dataset $x$ is equivalent to minimizing the Kullback Leibler divergence between the empirical distribution and the model distribution.

### Practical KL Divergence between two data sets
We are interested in computing the Kullback Leibler divergence between a model $X$ and a real dataset $Y$, indexed by $i$ where $i$ is the index of a specific event whose occurence probability is $X_i$ according to the model, and $Y_i$ in practice.

As in the definition of the empirical distribution, our datasets must verify:
$$
    \sum_{i=0}^{N-1} X_i = 1
$$

then the expression of the kl divergence is simply:
$$
    D_{KL}(Y\|X) = \sum_{i=0}^{N-1} log \left( \frac{Y_i}{X_i} \right) Y_i
$$

Moreover there is an important condition that must be respected: 
$$
    X_i = 0 \implies Y_i = 0
$$
If so, if the event $k$ has no occurence, we have $X_k=Y_k=0$ and $log \left( \frac{y_k}{x_k} \right) y_k$ has a limit in $0$ which is $0$. Otherwise, Kullback Leibler cannot be computed.

The explanation is pretty straightforward: if an event, or a symbol from $Y$ has no occurence in $X$, it has no reason to be encoded using $X$ probability distribution, then the average number of bits needed to encode a symbol from $Y$ is a nonsense, as some symbols cannot be represented.

## What is variational inference ?

### Variational distribution for the posterior

We are still in the framework of bayes theorem, and we want to compute $p$ our posterior that stands for $p(\theta|x)$.

In variational inference,  you create a variational distribution $q$ over your model parameters $\theta$, that we will call latent variables here. A variational distribution is parametrized by a variational parameter, we will call $\nu$:

$$
 q(\theta|\nu)
$$

The first idea of variational inference, is that we will try to find $\nu$ such that $q(\theta|\nu)$ becomes close to our posterior distribution $p(\theta|x)$.

### KL-divergence for variational/posterior closeness

The closeness of those distribution will be computed using Kullback-Liebler divergence:

\begin{align}
    D_{KL}(q\|p) &= \mathbb{E}_q \left[ log \left( \frac{q(\theta)}{p(\theta|x)} \right) \right]
\end{align}

As already mentionned earlier, it is good to remember that KL-divergence is not symmetric, in terms of semantic, one must keep in mind that the measure is relates to the expectation over $q$ distribution. So basically, rare event with respect to $q$ won't result in a large diverge, even if their occurence is high in $p$ (you will use few bits to represent rare events, that's not a big deal). However, if high probably events from $q$ are rare in the posterior, (hence coded with high number of bits) you will need to code this event (or symbol) many time, with many bits, and the divergence with respect to an optimal coding will be high.

### Jensen's inequality and evidence lower bound (ELBO)

Let's recall, that, for concave functions, over a domain $d$ we have
\begin{align*}
  f\left(\mathbb{E}_d \left[ x \right] \right) \geq \mathbb{E}_d \left[f\left( x \right)\right]
\end{align*}

Let's now rewrite the log probability ouf our evidence, as latent variable marginalization, and introduce our variational distribution $q(\theta)$ with a small trick:
\begin{align*}
  $log(p(x)) &= log \left(\int_{\theta} p(x,\theta) \right) \\
  &= log \left(\int_{\theta} p(x,\theta) \frac{q(\theta)}{q(\theta)} \right) \\
  &= log \left(\mathbb{E}_q \left[ \frac{p(x,\theta)}{q(\theta)} \right] \right) \\
\end{align*}

As $log$ is a nice concave function, we can give a lower bound of the evidence in the previous expression
\begin{align*}
  log(p(x)) &= log \left(\mathbb{E}_q \left[ \frac{p(x,\theta)}{q(\theta)} \right] \right) \\
  log(p(x)) &\geq \mathbb{E}_q \left[ log \left(\frac{p(x,\theta)}{q(\theta)} \right) \right] \\
  log(p(x)) \geq \text{ELBO} &= \mathbb{E}_q \left[ log \left(p(x,\theta)\right) -log\left( q(\theta) \right) \right] \\
  \text{ELBO} &= \underbrace{ \mathbb{E}_q \left[log \left(p(x,\theta)\right)\right]}_{\text{Expectation under} \; q \; \text{of the joint evidence, latent variable probability}} - 
  \underbrace{\mathbb{E}_q \left[log \left( q(\theta) \right) \right]}_{\text{Entropy of our variational distribution }q} \\
\end{align*}

Ok what is the actual interest of this lower bound expression ? We need to understand how this later expression is linked with the KL divergence between $p$ and $q$, that reads:

\begin{align}
    D_{KL}(q\|p) &= \mathbb{E}_q \left[ log \left( \frac{q(\theta)}{p(\theta|x)} \right) \right] \\
    &= \mathbb{E}_q \left[ log \left( q(\theta) \right) \right] - \mathbb{E}_q \left[ log \left( p(\theta|x) \right) \right] \\
        &= \mathbb{E}_q \left[ log \left( q(\theta) \right) \right] - \mathbb{E}_q \left[ log \left( p(\theta,x) \right) \right] + \mathbb{E}_q \left[ log \left( p(x)\right) \right] \\
    &= -\left( \underbrace{\mathbb{E}_q \left[ log \left( p(x,\theta) \right) \right] - \mathbb{E}_q \left[ log \left( q(\theta) \right) \right]}_{\text{= ELBO}}\right) + log \left( p(x)\right)\\
\end{align}

The expectation in the last term $log \left( p(x)\right)$ goes away, as there is no dependance to $q$, it has somehow already been marginalized away. This overall evidence probability is a constant, hence can be discarded for our purpose, which is, finding an optimal surrogate $q(\theta)$ to approximate our posterior $p(\theta|x)$.

### Mean field variational inference

There are plenty of families of distributions that have been studied in the framework of variational inference, exponential families, neural networks, Gaussian processes, ...

Let's assume that the variational distribution factorizes:

\begin{align}
    q(\theta_0, \theta_1, \dots , \theta_{m-1}) = \prod_{j=0}^{m-1} q(\theta_j) \\
\end{align}

That would of course only work if all dimensions are independants. But that's one easy solution to start with a simple variational inference workflow:

1. Choose a model distribution $q$ (normal for instance ?)
2. Derive ELBO expression
3. Perform coordinate ascent on each of the $\theta_i$
4. Eventually repeat until convergence

## Transforming distributions with bijectors

### Introduction

The fundamental equation we need here, is known for a very long time, and relates to change of variable, initially used to integrate quantities in physics.
Given an invertible function:$f:\underset{Z}{z} \mapsto \underset{X}{x}$ and a distribution $p_{\theta}(z), \; z \in Z$, the change of variable forumla reads

\begin{align*}
  p_{\theta}'(x) &= p_{\theta}\left( f^{-1}(x) \right) |det \left( J_{f^{-1}} \right)| \\
  p_{\theta}'(x) &= p_{\theta}\left( f^{-1}(x) \right) \frac{1}{|det \left( J_{f}\right)|} \\
\end{align*}

Where $|det \left( J_{f}\right)|$ is the absolute value of the determinant of the Jacobian of $f$.

### Quick recall, what is a Jacobian ?

We first recall that, for a function $f:\underset{\mathbb{R}^n}{x} \mapsto \underset{\mathbb{R}^n}{f(x)}$, we can derive the matrix of partial derivative, called the jacobian matrix:

\begin{align*}
  J=\begin{pmatrix}
    \frac{\partial f_0}{\partial x_0} & \frac{\partial f_0}{\partial x_1} & \dots & \frac{\partial f_0}{\partial x_{n-1}}\\
    \frac{\partial f_1}{\partial x_0} & \frac{\partial f_1}{\partial x_1} & \dots & \frac{\partial f_1}{\partial x_{n-1}}\\
  \vdots & \vdots & \vdots & \vdots \\
  \frac{\partial f_{n-1}}{\partial x_0} & \frac{\partial f_{n-1}}{\partial x_1} & \dots & \frac{\partial f_{n-1}}{\partial x_{n-1}}
\end{pmatrix}
\end{align*}

### Chain rule for transformation composition

As seen in the DifferentiatingPerceptron notebook, for complex functions, that are applied each after another, like in most deep learning models (excepted here we are only interested in bijective layers), the chain rule applies, and we have, for an initial distribution:
\begin{align*}
  p_{\theta}(z), \; z \in Z
\end{align*}
And a chain of bijective transforms:
\begin{align*}
  x = f_{\theta}(z) = f_{K-1} \circ f_{K-2} \circ \dots \circ f_{1} \circ f_{0}(z)
\end{align*}
The following holds
\begin{align*}
  p_{\theta}'(x) &= p_{\theta}(z) \prod_{i=0}^{K-1}\frac{1}{|det \left( J_{f_i}\right)|} \\
  p_{\theta}'(x) &= p_{\theta}\left( f_{\theta}^{-1}(x) \right) \frac{1}{|det \left( J_{f_{\theta}}\right)|} \\
\end{align*}

Throughout this notebook, we will mostly use $z$ as a placeholder for the latent variable, $f_{\theta}$ as the bijective transformation, parametrized by $\theta$ that can generated and observed variable $x$, the later will often have larger dimension, to express images or time series.

Now, let's assume we would like to compute the (log) likelihood of a given sample $x$:
\begin{align*}
  log (p_{\theta}'(x)) &= log(p_{\theta}(z)) - \sum_{i=0}^{K-1} log \left( |det \left( J_{f_i}\right)| \right) \\
\end{align*}

Depending on the invertible set of transform you use, given you have acces to automatic differentiation, this expression of likelihood can be use within an optimization objective in order to learn useful representations encoded by $f_{\theta}$, given a set of training examples $\{x_0, x_1, \dots x_{n-1} \in X\}$.

### VAE vs GANs vs NF

As we have seen previously, VAE use variational inference, which is optimizing for the ELBO, which is a surrogate for the posterior distribution of latent variables given a sample for the encoder. We don't get the exact posterior.

In GAN, there is basically no encoder, hence it is impossible to compute likelihood of latent variable given a sample. We can only train the generator/discriminator in a min/max fashion, with the usual risks of mode collapse, or infinite oscillations.

From a theoretical point of view, NF has very interesting properties, as it allows exact likelihood evaluation, and potentially optimization during training by $log(p_{\theta}(z)) - \sum_{i=0}^{K-1} log \left( |det \left( J_{f_i}\right)| \right)$
Exact posterior inference for a given $x$ throught the invertible transform $z=f_{\theta}^{-1}(x)$

Unfortunately, all those theoretical advantages will come with complex numerical challenges, such as finding transformations with tractable and numerically stable Jacobian determinant computations.

## Normalizing flows: main references
Here are a few very important references we might refere to

Basic introduction with a great youtube video:
https://www.youtube.com/watch?v=i7LjDvsLWCg

In [2]:
from IPython.display import IFrame

Lets start with the amazing work of Mahdi Karami. The document is a PhD thesis dedicated to the study of tools for variational inference, normalizing flows, analysis and design of operator with tractable inverse jacobian computation, with additional numerical and convex optimization tools. It also contains applications to latent space representations of complex distributions met in signal/image processing, and much more:

In [6]:
IFrame("doc/NormalizingFlows/Karami_Mahdi_202008_PhD.pdf", width=1200, height=800)

Then lets move on to a very interesting review paper: Normalizing Flows: An Introduction and Review of Current Methods by Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker.

In [7]:
IFrame("doc/NormalizingFlows/1908.09257.pdf", width=1200, height=800)

Another reference paper, that exposes a lot of important properties of Normalizing flows, and design of methods to leverage them in machine learning / representation learning: Normalizing Flows for Probabilistic Modeling and Inference by George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, Balaji Lakshminarayanan

In [8]:
IFrame("doc/NormalizingFlows/1912.02762.pdf", width=1200, height=800)

Then a paper more targeted to application: Same Same But DifferNet: Semi-Supervised Defect Detection with Normalizing Flows by Marco Rudolph, Bastian Wandt and Bodo Rosenhahn:

In [10]:
IFrame("doc/NormalizingFlows/2008.12577.pdf", width=1200, height=800)