# Building a Basic Masked Autoregressive Flow (MAF) in PyTorch

Start by importing PyTorch and necessary modules.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

## Define a Masked Linear Layer
In an autoregressive model, we need to ensure that outputs do not depend on "future" inputs. A custom MaskedLinear layer multiplies its weights by a binary mask to enforce this.

In [3]:
class MaskedLinear(nn.Linear):
    def __init__(self, in_features, out_features, bias=True):
        super(MaskedLinear, self).__init__(in_features, out_features, bias)
        # Initialize the mask as ones. We'll set it later.
        self.register_buffer('mask', torch.ones(out_features, in_features))
        
    def set_mask(self, mask):
        # Copy the provided mask into our buffer.
        self.mask.data.copy_(mask)
        
    def forward(self, x):
        # Apply the mask to the weights before performing the linear transformation.
        return F.linear(x, self.weight * self.mask, self.bias)


**Explanation:**

- MaskedLinear extends nn.Linear.
- It holds a mask (binary matrix) that is applied element-wise to the weights before the linear transformation.
- This mask will be set later to enforce that each output only depends on a subset of inputs.

## Build a MADE Block

A MADE (Masked Autoencoder for Distribution Estimation) block uses masked layers to model an autoregressive factorization of the joint distribution. Each MADE outputs parameters (e.g., shift and log-scale) for each input variable.

In [4]:
class MADE(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        """
        Args:
            input_size: Number of input features (and output variables).
            hidden_size: Number of units in the hidden layer.
            output_size: Typically 2 * input_size (for shift and log-scale per variable).
        """
        super(MADE, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        # Define two masked linear layers: input -> hidden, then hidden -> output.
        self.fc1 = MaskedLinear(input_size, hidden_size)
        self.fc2 = MaskedLinear(hidden_size, output_size)

        # Create the masks to enforce the autoregressive property.
        self.create_masks()

    def create_masks(self):
        # Define degrees for input neurons: here we assume a natural ordering.
        input_degrees = torch.arange(1, self.input_size + 1)

        # Assign random degrees to hidden neurons between 1 and input_size - 1.
        hidden_degrees = torch.randint(1, self.input_size, (self.hidden_size,))

        # For the output, repeat the input degrees to match the output size.
        output_degrees = input_degrees.repeat(self.output_size // self.input_size)

        # Create the mask for the first layer:
        # Allow connection if hidden_degree >= input_degree.
        mask1 = (hidden_degrees.unsqueeze(1) >= input_degrees.unsqueeze(0)).float()
        self.fc1.set_mask(mask1)

        # Create the mask for the second layer:
        # Allow connection if output_degree > hidden_degree.
        mask2 = (output_degrees.unsqueeze(1) > hidden_degrees.unsqueeze(0)).float()
        self.fc2.set_mask(mask2)

    def forward(self, x):
        # Pass input through first masked layer with ReLU activation.
        h = F.relu(self.fc1(x))
        # The output layer provides parameters (e.g., mu and log_scale for each variable).
        out = self.fc2(h)
        return out


**Explanation:**

- **Degrees:** We assign an ordering (or “degree”) to inputs, hidden, and output neurons.
- **Masking:**
    - For fc1, a connection is allowed if the hidden neuron’s degree is greater than or equal to the input neuron’s degree.
    - For fc2, a connection is allowed if the output neuron’s degree is greater than the hidden neuron’s degree.
- **Output:** The MADE block produces a vector that is typically split into two parts (shift and log-scale) for each input variable.

## Build the MAF by Stacking MADE Blocks
A MAF (Masked Autoregressive Flow) stacks multiple MADE blocks. Each block transforms the input while accumulating the log-determinant of the Jacobian.

In [5]:
class MAF(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers):
        """
        Args:
            input_size: Dimensionality of the data.
            hidden_size: Number of hidden units per MADE block.
            n_layers: Number of MADE blocks to stack.
        """
        super(MAF, self).__init__()
        self.n_layers = n_layers
        # Create a ModuleList of MADE blocks.
        self.layers = nn.ModuleList([
            MADE(input_size, hidden_size, input_size * 2)
            for _ in range(n_layers)
        ])

    def forward(self, x):
        # Initialize the log-determinant of the Jacobian.
        log_det = torch.zeros(x.size(0), device=x.device)
        # Pass the data sequentially through each MADE block.
        for layer in self.layers:
            # Each layer outputs a vector which we split into shift (mu) and log-scale (log_scale).
            out = layer(x)
            mu, log_scale = out.chunk(2, dim=1)
            # Apply the affine transformation:
            # x_new = (x - mu) * exp(-log_scale)
            x = (x - mu) * torch.exp(-log_scale)
            # Update the log-determinant (negative log_scale contributes additively).
            log_det -= log_scale.sum(dim=1)
        return x, log_det


## Explanation:

- Stacking: We create several MADE blocks stored in a ModuleList.
- Transformation: Each block applies an affine transformation to x using the outputs of the MADE block (shift and scale).
- Jacobian: The log-determinant of the Jacobian is accumulated across layers. This is required for calculating the likelihood during training.


## Example Usage of the MAF
Finally, we create an instance of the MAF, feed in some dummy data, and inspect the output and log-determinant.

In [6]:
# Define the dimensionality of our data.
input_size = 2      # For example, 2-dimensional data
hidden_size = 10    # Number of hidden units in each MADE block
n_layers = 3        # Number of MADE blocks to stack

# Instantiate the MAF model.
maf = MAF(input_size, hidden_size, n_layers)

# Create a batch of random data (e.g., 5 samples).
x = torch.randn(5, input_size)

# Pass the data through the MAF.
z, log_det = maf(x)

print("Input x:")
print(x)
print("\nTransformed z (latent representation):")
print(z)
print("\nLog-determinant of the Jacobian:")
print(log_det)


Input x:
tensor([[-0.5727, -2.4861],
        [-0.7693, -0.7929],
        [ 0.0873, -0.2468],
        [-2.0814, -0.5374],
        [-0.8872, -0.1615]])

Transformed z (latent representation):
tensor([[-0.4669, -2.3552],
        [-0.5883, -0.8918],
        [-0.0591, -0.4736],
        [-1.3990, -0.3023],
        [-0.6612, -0.3560]], grad_fn=<MulBackward0>)

Log-determinant of the Jacobian:
tensor([-0.6593, -0.7061, -0.6280, -0.9459, -0.7251], grad_fn=<SubBackward0>)


This notebook demonstrated a basic implementation of a Masked Autoregressive Flow (MAF) by building custom masked layers, a MADE block, and then stacking these to form a flow. This structure is useful for transforming a simple base distribution into a complex target distribution—an essential step in simulation-based inference.

Feel free to modify the architecture or experiment with different parameters to further your understanding!