# Normalization Flows

Think of normalizing flows as a way to transform a simple probability distribution (like a Gaussian) into a more complex one that matches your data. It's similar to how a skilled sculptor can take a plain block of clay and shape it into an intricate statue through a series of careful transformations.

The key insight is that we can create complex distributions by applying a sequence of invertible transformations to a simple base distribution. Let's break this down:

1) The Base Distribution:
We start with a simple distribution that's easy to sample from and evaluate, typically a standard normal distribution. Let's call this z ∼ p(z).

2) The Transformation Process:
We apply a series of invertible functions (f₁, f₂, ..., fₖ) to transform samples from this simple distribution. Each transformation must be:
- Invertible (bijective): Every point in the input space maps to exactly one point in the output space, and vice versa
- Differentiable: We need to be able to compute derivatives for training

3) The Change of Variables Formula:
This is where the "normalizing" part comes in. When we transform a probability distribution, we need to account for how the transformation stretches or compresses space. The formula for this is:

$$p_x(x) = p_z(z) \left|\det\frac{\partial f^{-1}(x)}{\partial x}\right|$$

where x is our transformed variable and z = f⁻¹(x) is the inverse transformation.

Here's a concrete example:
Imagine you have a dataset of house prices that follows a complex multimodal distribution. You could:
1. Start with samples from a normal distribution
2. Apply a series of transformations that gradually "mold" this distribution into the shape of your house price distribution
3. Train the parameters of these transformations by maximizing the likelihood of your data

The power of normalizing flows comes from their ability to:
- Generate new samples by transforming samples from the base distribution
- Compute exact probabilities (unlike VAEs or GANs)
- Learn complex distributions through composition of simple transformations

Let me explain the key components of the following implementation for normalization flow:

1. Base Distribution and Transformation:
We start with a standard normal distribution as our base distribution. The `PlanarFlow` class implements a single transformation layer that modifies the input using the formula:

$$f(z) = z + u \cdot \tanh(w^T z + b)$$

where w, u, and b are learnable parameters. This transformation is invertible under certain conditions and can express complex distributions when composed multiple times.

2. Model Architecture:
The code includes three main classes:
- `PlanarFlow`: Implements a single planar transformation
- `NormalizingFlow`: Composes multiple planar transformations
- `HousePriceFlow`: Specializes the flow for house price modeling

3. Change of Variables Formula:
The log probability computation in `HousePriceFlow` implements the change of variables formula:

$$\log p_X(x) = \log p_Z(z) - \log \left|\det\frac{\partial f}{\partial z}\right|$$

where:
- p_Z is the base distribution (standard normal)
- z is the transformed variable
- The determinant term accounts for the volume change in the transformation

4. Training Process:
The model is trained by maximizing the log likelihood of the observed data. For house prices, we:
- Take the log of prices to make the distribution more manageable
- Normalize the log prices to have zero mean and unit variance
- Train the flow to learn this normalized distribution

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class PlanarFlow(nn.Module):
    """
    Implements a single layer of Planar Flow transformation.
    This is one of the simplest normalizing flow transformations.
    """
    def __init__(self, dim):
        super().__init__()
        # Initialize transformation parameters
        self.weight = nn.Parameter(torch.randn(1, dim))      # w vector
        self.bias = nn.Parameter(torch.randn(1))             # b scalar
        self.scale = nn.Parameter(torch.randn(1, dim))       # u vector
        
    def forward(self, z):
        # Compute activation: h(z) = tanh(w·z + b)
        activation = torch.tanh(torch.mm(z, self.weight.T) + self.bias)
        
        # Transform the input: f(z) = z + u·h(z)
        z_next = z + self.scale * activation
        
        # Compute log determinant of Jacobian
        phi = (1 - activation**2) * self.weight  # d/dz of tanh
        log_det = torch.log(torch.abs(1 + torch.mm(phi, self.scale.T)))
        
        return z_next, log_det

class NormalizingFlow(nn.Module):
    """
    Complete normalizing flow model composed of multiple planar transformations.
    """
    def __init__(self, dim, n_flows):
        super().__init__()
        # Create a sequence of planar flows
        self.flows = nn.ModuleList([PlanarFlow(dim) for _ in range(n_flows)])
        
    def forward(self, z):
        # Keep track of total log determinant
        total_log_det = torch.zeros(z.size(0), 1).to(z.device)
        
        # Apply sequence of transformations
        for flow in self.flows:
            z, log_det = flow(z)
            total_log_det += log_det
            
        return z, total_log_det

class HousePriceFlow(nn.Module):
    """
    Normalizing flow model specifically for house prices.
    Uses a standard normal as base distribution.
    """
    def __init__(self, input_dim=1, n_flows=4):
        super().__init__()
        self.flow = NormalizingFlow(input_dim, n_flows)
        
    def forward(self, x):
        # Transform data to base distribution (inverse flow)
        z, log_det = self.flow(x)
        
        # Compute log probability under base distribution (standard normal)
        log_prob_base = -0.5 * (z**2 + np.log(2 * np.pi))
        
        # Compute total log probability using change of variables formula
        log_prob = log_prob_base - log_det
        
        return -torch.mean(log_prob)  # Return negative log likelihood for minimization

# Training function
def train_flow_model(model, data_loader, n_epochs=100, lr=1e-3):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    
    for epoch in range(n_epochs):
        total_loss = 0
        for batch in data_loader:
            # Extract the tensor from the batch (batch is a tuple from DataLoader)
            x = batch[0]  # Get the first element of the batch tuple
            
            optimizer.zero_grad()
            
            # Forward pass
            loss = model(x)  # Now we pass the tensor, not the batch tuple
            
            # Backward pass
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            
        if (epoch + 1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{n_epochs}], Loss: {total_loss/len(data_loader):.4f}')

# Example usage
if __name__ == "__main__":
    # Generate synthetic house price data (log-normal distribution)
    np.random.seed(42)
    n_samples = 1000
    log_prices = np.random.normal(12, 0.5, n_samples)  # mu=12 corresponds to ~$160k median price
    prices = np.exp(log_prices)
    
    # Normalize data
    prices_normalized = (np.log(prices) - np.mean(log_prices)) / np.std(log_prices)
    
    # Convert to PyTorch tensors
    prices_tensor = torch.FloatTensor(prices_normalized).reshape(-1, 1)
    dataset = torch.utils.data.TensorDataset(prices_tensor)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True)
    
    # Create and train model
    model = HousePriceFlow(input_dim=1, n_flows=4)
    
    # Debug information
    print("Data shape check:")
    for batch in dataloader:
        x = batch[0]
        print(f"Batch shape: {x.shape}")
        print(f"Sample from batch:\n{x[:5]}")  # Print first 5 values
        break
    
    train_flow_model(model, dataloader)
    
    # Generate new samples
    with torch.no_grad():
        z = torch.randn(1000, 1)  # Sample from base distribution
        samples, _ = model.flow(z)
        
        # Transform back to original scale
        samples = samples.numpy()
        samples = np.exp(samples * np.std(log_prices) + np.mean(log_prices))
        
    print("\nGenerated house price statistics:")
    print(f"Mean: ${samples.mean():.2f}")
    print(f"Median: ${np.median(samples):.2f}")
    print(f"Std: ${samples.std():.2f}")

Data shape check:
Batch shape: torch.Size([64, 1])
Sample from batch:
tensor([[ 0.2014],
        [ 1.9023],
        [-1.3687],
        [-0.3077],
        [-1.4579]])
Epoch [10/100], Loss: -2.8643
Epoch [20/100], Loss: -5.0963
Epoch [30/100], Loss: -6.5152
Epoch [40/100], Loss: -4.9171
Epoch [50/100], Loss: -5.5873
Epoch [60/100], Loss: -6.4098
Epoch [70/100], Loss: -10.1757
Epoch [80/100], Loss: -11.7389
Epoch [90/100], Loss: -12.1170
Epoch [100/100], Loss: -11.7268

Generated house price statistics:
Mean: $134787.69
Median: $134753.33
Std: $536.49


In [3]:
import torch
import torch.nn as nn

class InvertibleLayer(nn.Module):
    """A simple invertible layer for demonstration"""
    def __init__(self, dim):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(dim, dim))
        
    def forward(self, x):
        return x + torch.tanh(x @ self.weight)
        
    def inverse(self, y):
        # In practice, you might need to use fixed-point iteration
        # This is a simplified inverse for demonstration
        x = y
        for _ in range(5):  # Fixed-point iteration
            x = y - torch.tanh(x @ self.weight)
        return x

class MemoryEfficientNetwork(nn.Module):
    def __init__(self, dim, num_layers):
        super().__init__()
        self.layers = nn.ModuleList([InvertibleLayer(dim) for _ in range(num_layers)])
        
    def forward(self, x):
        # Standard forward pass - store all intermediates
        intermediates = [x]
        current = x
        for layer in self.layers:
            current = layer(current)
            intermediates.append(current)
        return current, intermediates
    
    def forward_memory_efficient(self, x):
        # Memory efficient forward pass - only store final output
        current = x
        for layer in self.layers:
            current = layer(current)
        return current
        
    def reconstruct_intermediate(self, output, layer_index):
        # Reconstruct a specific intermediate activation
        current = output
        for layer in reversed(self.layers[layer_index:]):
            current = layer.inverse(current)
        return current

# Example usage
dim = 10
num_layers = 5
model = MemoryEfficientNetwork(dim, num_layers)
x = torch.randn(1, dim)

# Standard forward pass (storing all intermediates)
output_standard, intermediates = model(x)
print(f"Memory used with intermediates: {len(intermediates)} tensors")

# Memory efficient forward pass
output_efficient = model.forward_memory_efficient(x)
print(f"Memory used without intermediates: 1 tensor")

# Reconstruct middle layer activation
reconstructed = model.reconstruct_intermediate(output_efficient, layer_index=2)
print("\nReconstruction error:")
print(f"Max difference: {(intermediates[2] - reconstructed).abs().max().item():.6f}")

Memory used with intermediates: 6 tensors
Memory used without intermediates: 1 tensor

Reconstruction error:
Max difference: 0.798676
