# First-Order Data Generator Tutorial

##  Learning Objectives

In this tutorial you will learn:
1. What First-Order data is and why it is important
2. How to use the `FirstOrderDataGenerator`
3. How to save and load distributions
4. How to work with `FirstOrderDataset` and DataLoader
5. How to train a model with soft targets

**Prerequisites:** Basic knowledge of PyTorch


##  Part 1: Introduction and Setup

### What is First-Order Data?

In Machine Learning we work with the distribution $p(X, Y)$, where:
- $X$ = Input features
- $Y$ = Target labels

The **conditional distribution** $p(Y|X)$ tells us: "Given a specific $x$, how likely are the different classes?"

**Problem:** Normally we don't have access to $p(Y|X)$!

**Solution:** We approximate this with a well-trained model $\hat{h}$:
$$\hat{h}(x) \approx p(\cdot | x)$$

We call these approximations **First-Order Data**.

### Installation and Imports

In [None]:
# Imports
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import Dataset

# Import First-Order Generator
# NOTE: Adjust the import path to your project structure!
# PyTorch-specific import (recommended)
from probly.data_generator.torch_first_order_generator import (
    FirstOrderDataGenerator,
    FirstOrderDataset,
)

# Alternative: Factory Pattern (for multi-framework)
# from probly.data_generator.factory import create_data_generator  # noqa: ERA001
# generator = create_data_generator('pytorch', model, dataset)  # noqa: ERA001

print("All imports successful!")
print(f"PyTorch Version: {torch.__version__}")
print(f"Device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

##  Part 2: Prepare Example Data

We create a simple dataset and a "Teacher" model that we use as ground truth.

In [None]:
# Example dataset
class SimpleDataset(Dataset):
    """A simple dataset for demonstration purposes."""

    def __init__(self, n_samples: int = 200, input_dim: int = 10, n_classes: int = 3, seed: int = 42) -> None:
        """Initialize dataset."""
        torch.manual_seed(seed)
        self.n_samples = n_samples
        self.input_dim = input_dim
        self.n_classes = n_classes

        # Generate synthetic data
        self.data = torch.randn(n_samples, input_dim)
        self.labels = torch.randint(0, n_classes, (n_samples,))

    def __len__(self) -> int:
        """Return length."""
        return self.n_samples

    def __getitem__(self, idx: int) -> tuple:
        """Get item."""
        return self.data[idx], self.labels[idx]


# Create dataset
dataset = SimpleDataset(n_samples=200, input_dim=10, n_classes=3)
print(f"Dataset created: {len(dataset)} samples, {dataset.n_classes} classes")

# Look at example sample
sample_x, sample_y = dataset[0]
print("\nExample sample:")
print(f"  Input shape: {sample_x.shape}")
print(f"  Label: {sample_y}")

In [None]:
# Teacher model (represents the "ground truth")
class TeacherModel(nn.Module):
    """A simple neural network as Teacher model."""

    def __init__(self, input_dim: int = 10, n_classes: int = 3) -> None:
        """Initialize model."""
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(32, n_classes),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass."""
        return self.network(x)  # Returns logits


# Create model
teacher_model = TeacherModel(input_dim=10, n_classes=3)
teacher_model.eval()  # Important: In evaluation mode!

print("Teacher model created")
print("\nModel architecture:")
print(teacher_model)

##  Part 3: Generate First-Order Distributions

Now we use the `FirstOrderDataGenerator` to calculate a probability distribution for each sample in the dataset.

In [None]:
# Initialize generator
generator = FirstOrderDataGenerator(
    model=teacher_model,
    device="cuda" if torch.cuda.is_available() else "cpu",
    batch_size=32,
    output_mode="logits",  # Our model outputs logits
    model_name="teacher_v1",
)

print("FirstOrderDataGenerator initialized")
print(f"  Device: {generator.device}")
print(f"  Batch Size: {generator.batch_size}")
print(f"  Output Mode: {generator.output_mode}")

In [None]:
# Generate distributions
print("Generating First-Order distributions...\n")

distributions = generator.generate_distributions(
    dataset,
    progress=True,  # Shows progress
)

print(f"\n{len(distributions)} distributions generated!")

In [None]:
# Look at example distributions
print("Example distributions:\n")

for i in range(5):
    dist = distributions[i]
    print(f"Sample {i}:")
    print(f"  Distribution: [{dist[0]:.4f}, {dist[1]:.4f}, {dist[2]:.4f}]")
    print(f"  Sum: {sum(dist):.6f} (should be ≈ 1.0)")
    print(f"  Most likely class: {np.argmax(dist)}")
    print()

In [None]:
# Visualization: Distributions for first 10 samples
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
axes = axes.flatten()

for i in range(10):
    ax = axes[i]
    dist = distributions[i]

    ax.bar(range(len(dist)), dist, color=["blue", "orange", "green"])
    ax.set_title(f"Sample {i}")
    ax.set_xlabel("Class")
    ax.set_ylabel("Probability")
    ax.set_ylim([0, 1])
    ax.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.suptitle("First-Order Distributions for the First 10 Samples", y=1.02, fontsize=14)
plt.show()

print("Visualization: Each bar shows the probability for a class")

##  Part 4: Save and Load Distributions

We can save the generated distributions as a JSON file to reuse them later.

In [None]:
# Create directory
output_dir = Path("tutorial_output")
output_dir.mkdir(exist_ok=True)

# Define path
save_path = output_dir / "first_order_distributions.json"

# Define metadata
metadata = {
    "dataset": "SimpleDataset",
    "n_samples": len(dataset),
    "n_classes": dataset.n_classes,
    "input_dim": dataset.input_dim,
    "note": "Generated for tutorial purposes",
    "teacher_architecture": "Simple 3-layer network",
}

# Save
print(f"Saving distributions to: {save_path}")
generator.save_distributions(
    path=save_path,
    distributions=distributions,
    meta=metadata,
)
print("Successfully saved!")

# Show file size
file_size = save_path.stat().st_size / 1024  # in KB
print(f"\nFile size: {file_size:.2f} KB")

In [None]:
# Load distributions
print("Loading distributions...\n")

loaded_distributions, loaded_metadata = generator.load_distributions(save_path)

print("Successfully loaded!\n")
print("Metadata:")
for key, value in loaded_metadata.items():
    print(f"  - {key}: {value}")

# Verification
print("\n✓ Verification:")
print(f"  Number of distributions: {len(loaded_distributions)}")
print(f"  Data identical: {distributions == loaded_distributions}")

##  Part 5: Using FirstOrderDataset

`FirstOrderDataset` is a PyTorch Dataset wrapper that combines the original dataset with the First-Order distributions.

In [None]:
# Create FirstOrderDataset
fo_dataset = FirstOrderDataset(
    base_dataset=dataset,
    distributions=loaded_distributions,
)

print(f"FirstOrderDataset created with {len(fo_dataset)} samples\n")

# Get one sample
input_tensor, label, distribution = fo_dataset[0]

print("Sample 0:")
print(f"  Input Shape: {input_tensor.shape}")
print(f"  Label: {label}")
print(f"  Distribution Shape: {distribution.shape}")
print(f"  Distribution: [{distribution[0]:.4f}, {distribution[1]:.4f}, {distribution[2]:.4f}]")
print(f"  Summe: {distribution.sum():.6f}")

In [None]:
# Iterate through multiple samples
print("Iterating through multiple samples:\n")

for i in range(5):
    input_tensor, label, distribution = fo_dataset[i]
    predicted_class = torch.argmax(distribution).item()
    confidence = distribution[predicted_class].item()

    print(f"Sample {i}:")
    print(f"  Ground Truth Label: {label}")
    print(f"  Predicted Class: {predicted_class}")
    print(f"  Confidence: {confidence:.4f}")
    print(f"  Match: {'✓' if predicted_class == label else '✗'}")
    print()

##  Part 6: Create DataLoader

For training we need a DataLoader that provides batches with First-Order distributions.

In [None]:
# Create DataLoader with First-Order distributions
fo_loader = output_fo_dataloader(  # noqa: F821
    base_dataset=dataset,
    distributions=loaded_distributions,
    batch_size=32,
    shuffle=True,
    num_workers=0,  # For Windows compatibility
)

print("DataLoader created")
print("  Batch Size: 32")
print(f"  Number of batches: {len(fo_loader)}")
print("  Shuffle: True")

In [None]:
# Look at first batch
batch_inputs, batch_labels, batch_distributions = next(iter(fo_loader))

print("\nFirst batch:")
print(f"  Inputs Shape: {batch_inputs.shape}")
print(f"  Labels Shape: {batch_labels.shape}")
print(f"  Distributions Shape: {batch_distributions.shape}")
print("\n  First 3 distributions in batch:")
for i in range(3):
    dist = batch_distributions[i]
    print(f"    Sample {i}: [{dist[0]:.4f}, {dist[1]:.4f}, {dist[2]:.4f}]")

##  Part 7: Train Student Model with Soft Targets

Now we train a "Student" model that tries to learn the distributions of the Teacher model. This is called **Knowledge Distillation**.

In [None]:
# Student model (smaller network)
class StudentModel(nn.Module):
    """A smaller model that learns from the Teacher."""

    def __init__(self, input_dim: int = 10, n_classes: int = 3) -> None:
        """Initialize model."""
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 32),  # Smaller than Teacher
            nn.ReLU(),
            nn.Linear(32, n_classes),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass."""
        return self.network(x)


# Create Student model
student_model = StudentModel(input_dim=10, n_classes=3)
device = "cuda" if torch.cuda.is_available() else "cpu"
student_model = student_model.to(device)

print("Student model created")
print("\nComparison Teacher vs Student:")
print(f"  Teacher parameters: {sum(p.numel() for p in teacher_model.parameters())}")
print(f"  Student parameters: {sum(p.numel() for p in student_model.parameters())}")
ratio = sum(p.numel() for p in teacher_model.parameters()) / sum(p.numel() for p in student_model.parameters())
print(f"  Student is {ratio:.1f}x smaller!")

In [None]:
# Training setup
optimizer = torch.optim.Adam(student_model.parameters(), lr=0.001)
epochs = 10

# Lists for tracking
train_losses = []

print("Starting training...\n")

for epoch in range(epochs):
    student_model.train()
    epoch_loss = 0.0
    n_batches = 0

    for inputs, _labels, target_distributions in fo_loader:
        # Move data to device
        batch_inputs = inputs.to(device)
        batch_target_distributions = target_distributions.to(device)

        # Forward pass
        logits = student_model(batch_inputs)

        # KL Divergence Loss
        # Student tries to imitate the Teacher distributions
        log_probs = F.log_softmax(logits, dim=-1)
        loss = F.kl_div(
            log_probs,
            target_distributions,
            reduction="batchmean",
        )

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        n_batches += 1

    avg_loss = epoch_loss / n_batches
    train_losses.append(avg_loss)

    print(f"Epoch {epoch + 1}/{epochs} - Loss: {avg_loss:.4f}")

print("\nTraining completed!")

In [None]:
# Visualization: Training Loss
plt.figure(figsize=(10, 6))
plt.plot(range(1, epochs + 1), train_losses, marker="o", linewidth=2, markersize=8)
plt.xlabel("Epoch", fontsize=12)
plt.ylabel("KL Divergence Loss", fontsize=12)
plt.title("Training Loss over Epochs", fontsize=14, fontweight="bold")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nLoss reduction: {train_losses[0]:.4f} → {train_losses[-1]:.4f}")
print(f"   Improvement: {(1 - train_losses[-1] / train_losses[0]) * 100:.1f}%")

##  Part 8: Evaluation - Teacher vs Student

Let's compare the predictions of the Teacher model with those of the trained Student model.

In [None]:
# Evaluation mode
student_model.eval()
teacher_model.eval()

# Collect predictions
all_inputs = []
teacher_probs_list = []
student_probs_list = []
true_labels = []

with torch.no_grad():
    for i in range(len(dataset)):
        x, y = dataset[i]
        x = x.unsqueeze(0).to(device)  # Batch dimension

        # Teacher predictions
        teacher_logits = teacher_model(x)
        teacher_probs = F.softmax(teacher_logits, dim=-1)

        # Student predictions
        student_logits = student_model(x)
        student_probs = F.softmax(student_logits, dim=-1)

        all_inputs.append(x.cpu())
        teacher_probs_list.append(teacher_probs.cpu())
        student_probs_list.append(student_probs.cpu())
        true_labels.append(y)

# Convert to tensors
teacher_probs_all = torch.cat(teacher_probs_list, dim=0)
student_probs_all = torch.cat(student_probs_list, dim=0)
true_labels_all = torch.tensor(true_labels)

print("Evaluation completed")
print(f"\nEvaluated on {len(dataset)} samples")

In [None]:
# Calculate accuracy
teacher_predictions = torch.argmax(teacher_probs_all, dim=-1)
student_predictions = torch.argmax(student_probs_all, dim=-1)

teacher_accuracy = (teacher_predictions == true_labels_all).float().mean().item()
student_accuracy = (student_predictions == true_labels_all).float().mean().item()

print("Accuracy:")
print(f"  Teacher: {teacher_accuracy * 100:.2f}%")
print(f"  Student: {student_accuracy * 100:.2f}%")
print(f"\n  Difference: {abs(teacher_accuracy - student_accuracy) * 100:.2f} percentage points")

In [None]:
# Calculate KL Divergence between Teacher and Student
kl_div = F.kl_div(
    F.log_softmax(student_probs_all, dim=-1),
    teacher_probs_all,
    reduction="batchmean",
).item()

print(f"Average KL Divergence between Teacher and Student: {kl_div:.4f}")
print("\nInterpretation:")
print(f"   Low value ({kl_div:.4f}) means that the Student has")
print("   learned the Teacher distributions well!")

In [None]:
# Visualization: Teacher vs Student for some samples
n_samples_to_show = 6
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

for i in range(n_samples_to_show):
    ax = axes[i]

    teacher_dist = teacher_probs_all[i].numpy()
    student_dist = student_probs_all[i].numpy()
    true_label = true_labels_all[i].item()

    x = np.arange(len(teacher_dist))
    width = 0.35

    bars1 = ax.bar(x - width / 2, teacher_dist, width, label="Teacher", alpha=0.8)
    bars2 = ax.bar(x + width / 2, student_dist, width, label="Student", alpha=0.8)

    # Mark true label
    ax.axvline(x=true_label, color="red", linestyle="--", linewidth=2, label="True Label")

    ax.set_title(f"Sample {i} (True Label: {true_label})", fontweight="bold")
    ax.set_xlabel("Class")
    ax.set_ylabel("Probability")
    ax.set_ylim([0, 1])
    ax.legend()
    ax.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.suptitle("Comparison: Teacher vs Student Predictions", y=1.02, fontsize=16, fontweight="bold")
plt.show()

print("\nThe bars show how similar the Student predictions are to the Teacher predictions")

##  Part 9: Calculate Coverage Metric

Coverage measures how well the Student distributions "cover" the Teacher distributions.

In [None]:
def compute_coverage(pred_probs: torch.Tensor, target_probs: torch.Tensor, epsilon: float = 0.15) -> float:
    """Calculates epsilon-Credal Coverage.

    A prediction "covers" the target if the L1 distance <= epsilon.
    """
    l1_distance = torch.sum(torch.abs(pred_probs - target_probs), dim=-1)
    covered = (l1_distance <= epsilon).float()
    return covered.mean().item()


# Coverage for different epsilon values
epsilons = [0.05, 0.10, 0.15, 0.20, 0.25, 0.30]
coverages = []

for eps in epsilons:
    cov = compute_coverage(student_probs_all, teacher_probs_all, epsilon=eps)
    coverages.append(cov)
    print(f"Coverage at ε = {eps:.2f}: {cov * 100:.2f}%")

In [None]:
# Visualization: Coverage vs Epsilon
plt.figure(figsize=(10, 6))
plt.plot(epsilons, [c * 100 for c in coverages], marker="o", linewidth=2, markersize=10)
plt.xlabel("Epsilon (ε)", fontsize=12)
plt.ylabel("Coverage (%)", fontsize=12)
plt.title("Coverage as a Function of Epsilon", fontsize=14, fontweight="bold")
plt.grid(True, alpha=0.3)
plt.ylim([0, 105])

# Mark optimal point
optimal_idx = len(coverages) // 2
plt.axvline(x=epsilons[optimal_idx], color="red", linestyle="--", alpha=0.5, label="Example ε")
plt.legend()

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("   The higher the epsilon, the more predictions are counted as 'covered'.")
print("   A good model has high coverage at small epsilon!")

##  Part 10: Advanced Analyses

Let's look at which samples the Student performs best and worst on.

In [None]:
# Calculate L1 distances for all samples
l1_distances = torch.sum(torch.abs(student_probs_all - teacher_probs_all), dim=-1)

# Find best and worst samples
best_indices = torch.argsort(l1_distances)[:5]  # 5 best
worst_indices = torch.argsort(l1_distances, descending=True)[:5]  # 5 worst

print("Top 5 Samples (smallest L1 distance):")
for i, idx_tensor in enumerate(best_indices):
    idx = idx_tensor.item()
    dist = l1_distances[idx].item()
    print(f"  {i + 1}. Sample {idx}: L1-Distanz = {dist:.4f}")

print("\nBottom 5 Samples (largest L1 distance):")
for i, idx_tensor in enumerate(worst_indices):
    idx = idx_tensor.item()
    dist = l1_distances[idx].item()
    print(f"  {i + 1}. Sample {idx}: L1-Distanz = {dist:.4f}")

In [None]:
# Histogram of L1 distances
plt.figure(figsize=(10, 6))
plt.hist(l1_distances.numpy(), bins=30, edgecolor="black", alpha=0.7)
plt.xlabel("L1 Distance", fontsize=12)
plt.ylabel("Number of Samples", fontsize=12)
plt.title("Distribution of L1 Distances between Teacher and Student", fontsize=14, fontweight="bold")
plt.axvline(
    x=l1_distances.mean().item(),
    color="red",
    linestyle="--",
    linewidth=2,
    label=f"Mean: {l1_distances.mean().item():.3f}",
)
plt.legend()
plt.grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

print("\nL1 Distance Statistics:")
print(f"   Mean: {l1_distances.mean().item():.4f}")
print(f"   Median: {l1_distances.median().item():.4f}")
print(f"   Standard deviation: {l1_distances.std().item():.4f}")
print(f"   Min: {l1_distances.min().item():.4f}")
print(f"   Max: {l1_distances.max().item():.4f}")

##  Part 11: Summary and Best Practices

### What Have We Learned?

1. **First-Order Data** are approximations of the conditional distribution $p(Y|X)$
2. The **FirstOrderDataGenerator** makes it easy to generate these
3. Distributions can be **saved and loaded** (JSON format)
4. **FirstOrderDataset** combines data with distributions
5. **Knowledge Distillation** uses First-Order data as soft targets
6. **Coverage** is an important metric for uncertainty quantification

### Best Practices

 **DO:**
- Always set model to `eval()` mode before generation
- Add metadata when saving
- Use `shuffle=False` when generating
- Regularly verify distributions (sum = 1.0)

 **DON'T:**
- Don't leave model in training mode
- Don't ignore index alignment
- Don't save without metadata
- Don't forget device consistency

### Next Steps

1. Try it with your own models and datasets
2. Experiment with different `output_mode` settings
3. Implement custom `input_getter` functions
4. Explore advanced use cases (e.g. Credal Sets)
5. Compare different Teacher models

##  Practice Exercises

Try the following extensions:

1. **Experiment 1**: Change the Teacher architecture and observe the effects on Coverage
2. **Experiment 2**: Use different temperatures for Softmax (`F.softmax(logits/T, dim=-1)`)
3. **Experiment 3**: Implement an ensemble approach with multiple Teacher models
4. **Experiment 4**: Visualize confidence calibration
5. **Experiment 5**: Test with a real dataset (e.g. MNIST or CIFAR-10)

##  Further Resources

- **Documentation**: `data_generation_guide.md`
- **API Reference**: `api_reference.md`
- **Example Script**: `simple_usage.py`
- **Tests**: `test_first_order_generator.py`


