# The basics

This notebook walks you through the basics of PyTorch/Zuko distributions and transformations, how to parametrize probabilistic models, how to instantiate pre-built normalizing flows and finally how to create custom flow architectures. Training is covered in other tutorials.

In [1]:
import torch
import zuko

## Distributions and transformations

PyTorch defines two components for probabilistic modeling: the [`Distribution`](torch.distributions.distribution.Distribution) and the [`Transform`](torch.distributions.transforms.Transform). A distribution object represents the probability distribution $p(X)$ of a random variable $X$. A distribution must implement the `sample` and `log_prob` methods, meaning that we can draw realizations $x \sim p(X)$ from the distribution and evaluate the log-likelihood $\log p(X = x)$ of realizations.

In [2]:
distribution = torch.distributions.Normal(torch.tensor(0.0), torch.tensor(1.0))

x = distribution.sample()
log_p = distribution.log_prob(x)

x, log_p

(tensor(0.2122), tensor(-0.9415))

A transform object represents a bijective transformation $f: X \mapsto Y$ from a domain to a co-domain. A transformation must implement a forward call $y = f(x)$, an inverse call $x = f^{-1}(y)$ and the `log_abs_det_jacobian` method to compute the log-absolute-determinant of the transfomation's Jacobian $\log \left| \det \frac{\partial f(x)}{\partial x} \right|$.

In [3]:
transform = torch.distributions.AffineTransform(torch.tensor(2.0), torch.tensor(3.0))

y = transform(x)
x_ = transform.inv(y)
ladj = transform.log_abs_det_jacobian(x, y)

y, x_, ladj

(tensor(2.6367), tensor(0.2122), tensor(1.0986))

Combining a base distribution $p(Z)$ and a transformation $f: X \mapsto Z$ defines a new distribution $p(X)$. The likelihood is given by the change of variables formula

$$ p(X = x) = p(Z = f(x)) \left| \det \frac{\partial f(x)}{\partial x} \right| $$

and sampling from $p(X)$ can be performed by first drawing realizations $z \sim p(Z)$ and then applying the inverse transformation $x = f^{-1}(z)$. Such combination of a base distribution and a bijective transformation is sometimes called a *normalizing flow* as the base distribution is often Gaussian (normal).

In [4]:
flow = zuko.distributions.NormalizingFlow(transform, distribution)

x = flow.sample()
log_p = flow.log_prob(x)

x, log_p

(tensor(-0.6321), tensor(0.1743))

## Lazy parametrization

When designing the distributions module, the PyTorch team decided that distributions and transformations should be lightweight objects that are used as part of computations but destroyed afterwards. Consequently, the [`Distribution`](torch.distributions.distribution.Distribution) and [`Transform`](torch.distributions.transforms.Transform) classes are not sub-classes of [`torch.nn.Module`](torch.nn.Module), which means that we cannot retrieve their parameters with `.parameters()`, send their internal tensor to GPU with `.to('cuda')` or train them as regular neural networks. In addition, the concepts of conditional distribution and transformation, which are essential for probabilistic inference, are impossible to express with the current interface.

To solve these problems, [`zuko`](zuko) defines two concepts: the [`LazyDistribution`](zuko.flows.core.LazyDistribution) and the [`LazyTransform`](zuko.flows.core.LazyTransform), which are modules whose forward pass returns a distribution or transformation, respectively. These components hold the parameters of the distributions/transformations as well as the recipe to build them, such that the actual distribution/transformation objects are lazily built and destroyed when necessary. Importantly, because the creation of the distribution/transformation object is delayed, an eventual condition can be easily taken into account. This design enables lazy distributions to act like distributions while retaining features inherent to modules, such as trainable parameters.

In the following cell, we define a simpe Gaussian model of the form $\mathcal{N}(x | \mu_\phi(c), \sigma_\phi^2(c))$ where $c$ is a context vector and $\phi$ are the parameters of the model.

In [5]:
class GaussianModel(zuko.flows.LazyDistribution):
    def __init__(self, context: int):
        super().__init__()

        self.hyper = torch.nn.Sequential(
            torch.nn.Linear(context, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, 2), # mu, log sigma
        )

    def forward(self, c: torch.Tensor):
        mu, log_sigma = self.hyper(c).unbind(dim=-1)
        
        return torch.distributions.Normal(mu, log_sigma.exp())

model = GaussianModel(8)
model

GaussianModel(
  (hyper): Sequential(
    (0): Linear(in_features=8, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=2, bias=True)
  )
)

Calling the forward method of the model with a context $c$ returns a distribution object, which we can use to draw realizations or evaluate the likelihood of realizations.

In [6]:
distribution = model(c=torch.randn(8))
distribution

Normal(loc: -0.05041343718767166, scale: 1.0097345113754272)

In [7]:
distribution.sample()

tensor(0.7096)

In [8]:
distribution.log_prob(torch.randn(3))

tensor([-2.2154, -1.0755, -1.0740], grad_fn=<SubBackward0>)

The result of `log_prob` is part of a computation graph (it has a `grad_fn`) and can be used to train the parameters of the model. Conversely, the result of `sample` is not part of a computation graph. If you need to train the parameters through sampling, you should use the `rsample` method instead, which is not implemented by all distributions.

In [9]:
distribution.rsample()

tensor(-0.6246, grad_fn=<AddBackward0>)

Importantly, if you modify the parameters of the model, for example with gradient descent steps, you must always remember to call the forward method again to re-build the distribution with the new parameters.

In [10]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for _ in range(8):
    c = torch.randn(8)
    x = torch.normal(torch.sum(c), torch.prod(torch.abs(c)))

    loss = -model(c).log_prob(x)  # -log p(x | c)
    loss.backward()

    optimizer.step()
    optimizer.zero_grad()

## Normalizing flows

Following the same spirit, a normalizing flow in Zuko is a sepcial `LazyDistribution` that contains a `LazyTransform` and a base `LazyDistribution`. To increase expressivity, the transformation is usually the composition of a sequence of "simple" transformations. For example, a masked autoregressive flow (MAF) usually has a few transformations and a standard Gaussian base distribution.

In [11]:
flow = zuko.flows.MAF(features=5, context=8, transforms=3, hidden_features=(64, 128, 256))
flow

MAF(
  (transform): LazyComposedTransform(
    (0): MaskedAutoregressiveTransform(
      (base): MonotonicAffineTransform()
      (order): [0, 1, 2, 3, 4]
      (hyper): MaskedMLP(
        (0): MaskedLinear(in_features=13, out_features=64, bias=True)
        (1): ReLU()
        (2): MaskedLinear(in_features=64, out_features=128, bias=True)
        (3): ReLU()
        (4): MaskedLinear(in_features=128, out_features=256, bias=True)
        (5): ReLU()
        (6): MaskedLinear(in_features=256, out_features=10, bias=True)
      )
    )
    (1): MaskedAutoregressiveTransform(
      (base): MonotonicAffineTransform()
      (order): [4, 3, 2, 1, 0]
      (hyper): MaskedMLP(
        (0): MaskedLinear(in_features=13, out_features=64, bias=True)
        (1): ReLU()
        (2): MaskedLinear(in_features=64, out_features=128, bias=True)
        (3): ReLU()
        (4): MaskedLinear(in_features=128, out_features=256, bias=True)
        (5): ReLU()
        (6): MaskedLinear(in_features=256, out_fea

In the previous cell, we instantiated a contitonal flow (3 sample features and 8 context features) with 3 affine autoregressive transformations, each parameterized by a masked MLP with increasing number of hidden neurons. Zuko provides many pre-built flow architectures including [`NICE`](zuko.flows.coupling.NICE), [`MAF`](zuko.flows.autoregressive.MAF), [`NSF`](zuko.flows.spline.NSF), [`CNF`](zuko.flows.continuous.CNF) and many others. We recommend users to try `MAF` and `NSF` first as they are efficient baselines.

### Custom architecture

Alternatively, a flow can be built as a custom [`Flow`](zuko.flows.core.Flow) object given a sequence of lazy transformations and a base lazy distribution. Follows a condensed example of many things that are possible in Zuko. But remember, with great power comes great responsibility (and great bugs).

In [12]:
from zuko.flows import (
    Flow,
    GeneralCouplingTransform,
    MaskedAutoregressiveTransform,
    NeuralAutoregressiveTransform,
    Unconditional,
)
from zuko.distributions import BoxUniform
from zuko.transforms import (
    AffineTransform,
    MonotonicRQSTransform,
    RotationTransform,
    SigmoidTransform,
)

flow = Flow(
    transform=[
        # Preprocessing
        Unconditional(  # [0, 255] -> ]0, 1[
            AffineTransform,  # y = loc + scale * x
            torch.tensor(1 / 512),  # loc
            torch.tensor(1 / 256),  # scale
            buffer=True,  # not trainable
        ),
        Unconditional(lambda: SigmoidTransform().inv),  # y = logit(x)
        # Transformations
        MaskedAutoregressiveTransform(  # autoregressive transform (affine by default)
            features=5,
            context=8,
            passes=5,  # fully-autoregressive
            hidden_features=(64, 64),
        ),
        Unconditional(RotationTransform, torch.randn(5, 5)),  # trainable rotation
        GeneralCouplingTransform(  # coupling transform
            features=5,
            context=8,
            univariate=MonotonicRQSTransform,  # rational-quadratic spline
            shapes=([8], [8], [7]),  # shapes of the spline parameters (8 bins)
            hidden_features=(128, 256, 512),
            activation=torch.nn.ELU,  # ELU activation in hyper-network
        ).inv,  # inverse
        Unconditional(  # ignore context
            NeuralAutoregressiveTransform(  # neural autoregressive transform
                features=5,
                order=[5, 2, 0, 3, 1],  # autoregressive order
                passes=2,  #  2-pass autoregressive (equivalent to coupling)
            )
        ),
    ],
    base=Unconditional(  # ignore context
        BoxUniform,
        torch.full((5,), -3.0),
        torch.full((5,), +3.0),
        buffer=True,  # not trainable
    ),
)

flow

Flow(
  (transform): LazyComposedTransform(
    (0): Unconditional(AffineTransform())
    (1): Unconditional(Inverse(SigmoidTransform()))
    (2): MaskedAutoregressiveTransform(
      (base): MonotonicAffineTransform()
      (order): [0, 1, 2, 3, 4]
      (hyper): MaskedMLP(
        (0): MaskedLinear(in_features=13, out_features=64, bias=True)
        (1): ReLU()
        (2): MaskedLinear(in_features=64, out_features=64, bias=True)
        (3): ReLU()
        (4): MaskedLinear(in_features=64, out_features=10, bias=True)
      )
    )
    (3): Unconditional(RotationTransform())
    (4): LazyInverse(
      (transform): GeneralCouplingTransform(
        (base): MonotonicRQSTransform(bins=8)
        (mask): [0, 1, 0, 1, 0]
        (hyper): MLP(
          (0): Linear(in_features=10, out_features=128, bias=True)
          (1): ELU(alpha=1.0)
          (2): Linear(in_features=128, out_features=256, bias=True)
          (3): ELU(alpha=1.0)
          (4): Linear(in_features=256, out_features=51