# Flow Model for Lattice Field Theory

Author: Haimeng Zhao

Email: haimengzhao@icloud.com

This notebook contains the code and a detailed note for utilizing flow-based models to sample from complicated probability distributions, especially those encountered in many-body systems.

The method implemented here is based on several papers ([arXiv:1904.12072](https://inspirehep.net/literature/1731778), [arXiv:2002.02428](https://inspirehep.net/literature/1779199), and [arXiv:2003.06413](https://inspirehep.net/literature/1785309)) and a tutorial [arXiv:2101.08176](https://arxiv.org/abs/2101.08176). 

We first import some useful packages and check whether GPUs are available (if not, CPUs will be used instead).

In [8]:
import numpy as np
import torch
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

# Use CPU or GPU
if torch.cuda.is_available():
    torch_device = 'cuda'
    float_dtype = np.float32 # single
    torch.set_default_tensor_type(torch.cuda.FloatTensor)
else:
    torch_device = 'cpu'
    float_dtype = np.float64 # double
    torch.set_default_tensor_type(torch.DoubleTensor)
print(f"TORCH DEVICE: {torch_device}")

TORCH DEVICE: cuda


Here we borrow some useful functions from 2101.08176.

- `grab` is used to move tensors to cpu. 

- `init_live_plot, moving_average, update_plots` is used to make live-updating plots for monitoring training process.

In [12]:
'''
Ref: 2101.08176
'''

def grab(var):
    return var.detach().cpu().numpy()

from IPython.display import display

def init_live_plot(dpi=125, figsize=(8,4)):
    fig, ax_ess = plt.subplots(1,1, dpi=dpi, figsize=figsize)
    plt.xlim(0, N_era*N_epoch)
    plt.ylim(0, 1)
    
    ess_line = plt.plot([0],[0], alpha=0.5) # dummy
    plt.grid(False)
    plt.ylabel('ESS')
    
    ax_loss = ax_ess.twinx()
    loss_line = plt.plot([0],[0], alpha=0.5, c='orange') # dummy
    plt.grid(False)
    plt.ylabel('Loss')
    
    plt.xlabel('Epoch')

    display_id = display(fig, display_id=True)

    return dict(
        fig=fig, ax_ess=ax_ess, ax_loss=ax_loss,
        ess_line=ess_line, loss_line=loss_line,
        display_id=display_id
    )

def moving_average(x, window=10):
    if len(x) < window:
        return np.mean(x, keepdims=True)
    else:
        return np.convolve(x, np.ones(window), 'valid') / window

def update_plots(history, fig, ax_ess, ax_loss, ess_line, loss_line, display_id):
    Y = np.array(history['ess'])
    Y = moving_average(Y, window=15)
    ess_line[0].set_ydata(Y)
    ess_line[0].set_xdata(np.arange(len(Y)))
    Y = history['loss']
    Y = moving_average(Y, window=15)
    loss_line[0].set_ydata(np.array(Y))
    loss_line[0].set_xdata(np.arange(len(Y)))
    ax_loss.relim()
    ax_loss.autoscale_view()
    fig.canvas.draw()
    display_id.update(fig) # need to force colab to update plot

## Introduction

The essential idea behind this method is to use a neural network to learn the transformation from a simple distribution (the prior), which can be easily sampled, to the complicated distribution we target at (see the figure below from 1904.12072).

![Fig. 1 of 1904.12072](./assets/normalizing_flow.png)

For example, we can choose the prior to be a simple normal distribution.

In [11]:
class NormalPrior:
    def __init__(self, mean, var):
        self.mean = torch.flatten(mean)
        self.var = torch.flatten(var)
        self.dist = torch.distributions.normal.Normal(self.mean, self.var)
        self.shape = mean.shape

    def log_prob(self, x):
        logp = self.dist.log_prob(x.reshape(x.shape[0], -1))
        return torch.sum(logp, dim=1)

    def sample(self, batch_size):
        x = self.dist.sample((batch_size,))
        return x.reshape(batch_size, *self.shape)

Note that here for the purpose of lattice field theory, we assume the target density $p(x)$ to be $e^{-S}/Z$, where $S$ is the action of the field theory and $Z$ the normalization constant.

When the neural network successfully learned the desired transformation, we can then sample from the prior and apply the neural network to map our samples to the desired distribution.

In [16]:
def apply_flow_to_prior(prior, layers, batch_size):
    # sample from prior
    x = prior.sample(batch_size)
    logq = prior.log_prob(x)

    # flow through the model
    for l in layers:
        x, logJ = layer.forward(x)
        logq = logq - logJ
    
    return x, logq

Since the neural network represents a gradual "flow" of distribution from the prior to the target, it is called a flow model. To further stress that each slice of the flow is a distribution itself which satisfies the normalizing condition $\int p(x) dx = 1$, the model is often called a normalizing flow.

However, the trained neural network is only an approximation, thus the samples drawn directly through the flow is biased. To produce unbiased samples and thus measure observables, we still need resample precedures such as Markov chain Monte Carlo (MCMC). (If you only need an approximation, direct sampling through the flow will be enough).

But this time MCMC will require significantly less burn-in time and is more robust to model parameters thanks to the flow. Furthermore, sampling with the flow is also free from the fermion sign problem (though it requires the flow to have a property called "equivariant", which happens to be met by the well-known transformer model from the field of natural language processing). Thus the resulting MCMC-flow hybrid sampler is much better than a simple MCMC.

To quantify the quality of flow model samples, we can use the effective sample size (ESS) (which serves a similar role as autocorrelation time in MCMC)
$$
ESS = \frac{ \left(\frac{1}{N} \sum_i p(x_i)/q(x_i) \right)^2 }{ \frac{1}{N} \sum_i \left( p(x_i)/q(x_i) \right)^2 } \in [0, 1].
$$
Here a larger ESS indicates better sampling, and $ESS=1$ represents a perfect sampling from the desired distribution.

For later use, we borrow the MCMC code from 2101.08176.

In [18]:
'''
Ref: 2101.08176
'''

def serial_sample_generator(model, action, batch_size, N_samples):
    '''
    Generate samples from prior and flow through the model, 
    yield logq and logp alongside.

    Here for lattice field theory purpose, we assume p = e^{-S}/Z, 
    where S is the action of the field theory.
    '''
    layers, prior = model['layers'], model['prior']
    layers.eval()
    x, logq, logp = None, None, None
    for i in range(N_samples):
        batch_i = i % batch_size
        if batch_i == 0:
            # we're out of samples to propose, generate a new batch
            x, logq = apply_flow_to_prior(prior, layers, batch_size=batch_size)
            logp = -action(x)
        yield x[batch_i], logq[batch_i], logp[batch_i]
    
def make_mcmc_ensemble(model, action, batch_size, N_samples):
    history = {
        'x' : [],
        'logq' : [],
        'logp' : [],
        'accepted' : []
    }

    # build Markov chain
    sample_gen = serial_sample_generator(model, action, batch_size, N_samples)
    for new_x, new_logq, new_logp in sample_gen:
        if len(history['logp']) == 0:
            # always accept first proposal, Markov chain must start somewhere
            accepted = True
        else: 
            # Metropolis acceptance condition
            last_logp = history['logp'][-1]
            last_logq = history['logq'][-1]
            p_accept = torch.exp((new_logp - new_logq) - (last_logp - last_logq))
            p_accept = min(1, p_accept)
            draw = torch.rand(1) # ~ [0,1]
            if draw < p_accept:
                accepted = True
            else:
                accepted = False
                new_x = history['x'][-1]
                new_logp = last_logp
                new_logq = last_logq
        # Update Markov chain
        history['logp'].append(new_logp)
        history['logq'].append(new_logq)
        history['x'].append(new_x)
        history['accepted'].append(accepted)
    return history

## Normalizing Flow

Now we dive into the detail of normalizing flow.

Translating the above intuition into rigorous math, we aim to find a transformation $f(z)$ which maps a random variable $z$ with a simple prior density $r(z)$ to the ouput variable $x = f(z)$ with density $q(x)$. By the change-of-variable formula we have

$$
q(x) = r(z)|J|^{-1} = r(z)\left |\det \frac{\partial f_i(z)}{\partial z_j}\right |^{-1},
$$

where $J = \det \frac{\partial f_i(z)}{\partial z_j}$ is the Jacobian. 

That's all it is! A change of variable. Now we only need to train a neural network to find the optimal $f$ which minimizes the distance between the output density $q(x)$ and the target density $p(x)$:
$$
f = \argmin_f d(q, p).
$$

A common choice use to measure the distance between two distributions is the Kullback-Leibler (KL) divergence
$$
D_{KL}(q||p) = \int dx \ q(x)[\log q(x) - \log p(x)],
$$
which can be estimated by
$$
\hat{D}_{KL}(q||p) = \frac{1}{N}\sum_i^N[\log q(x_i) - \log p(x_i)], \quad x_i \sim q.
$$

In [15]:
def kl_divergence(logp, logq):
    return torch.mean(logq - logp)

Here we can see the advantage of flow models compared with traditional method. We only need to sample from the "model distribution" $q(x)$, which can be generated easily from the prior, while traditional methods such as HMC require sampling from $p(x)$.

To make it short, our training procedure consists of
1. Drawing samples from the prior and flow through the model,
2. Estimate the KL divergence,
3. Use optimization methods such as SGD or Adam to minimize the KL divergence.

During training, we monitor the KL divergence (loss function) and ESS to keep track of the training process.

In [19]:
def train(model, action, optimizer, batch_size, metrics):
    layers, prior = model['layers'], model['prior']
    optimizer.zero_grad()

    x, logq = apply_flow_to_prior(prior, layers, batch_size)
    logp = -action(x)
    loss = kl_divergence(logp, logq)

    loss.backward()
    optimizer.step()

    metrics['loss'].append(grab(loss))
    metrics['logp'].append(grab(logp))
    metrics['logq'].append(grab(logq))
    metrics['ess'].append(grab(compute_ess(logp, logq)))

What we are left with now is how to design a flow $f$ that is expressive enough while keeping its Jacobian tractable.