# Aggregation Experiments: Analysis

This notebook includes empirical analysis of the three mechanisms in toy reference generation tasks. For each mechanism, we analyze a task and compare prompting with aggregation for that task.

For each experiment we have:
- **Aggregation operation**: L_A, L_B (individual lists), L_A_B_agg (set aggregate), L_f (list from direct prompt likely to be close to L_A_B_agg)
- **Enumeration of all prompts**: Responses for all combinations of prompt dimensions through set operations 

We compute Wilson confidence intervals for each vector and find P_agg — the single prompt whose CI box is closest (min L1) to L_A_B_agg's CI box. P_agg having high L1 distance from L_A_B_agg indicates L_A_B_agg uncovers an output vector not directly elicitable through prompting.

## 1. Data Generation

The data in `experiment_data/` was generated using three scripts. Below are the commands (commented out) for reference.

In [None]:
# --- Data generation commands (do NOT run in notebook) ---

# Generate aggregate prompts (L_A, L_B, L_A_B_agg, L_f) for all 3 experiments:
# !python scripts/generate_aggregate.py --seeds 30 --list-length 30

# Enumerate all set-theoretic prompt combinations:
# !python scripts/enumerate_prompts.py --experiment binding_set --seeds 30
# !python scripts/enumerate_prompts.py --experiment feasibility --seeds 30
# !python scripts/enumerate_prompts.py --experiment support --seeds 30

## 2. Load Data

In [1]:
import json
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.abspath("__file__")))

from config import OUTPUT_DIMS, LIST_LABELS, get_list_name, get_vector_with_stats, get_vector
from analyze import wilson_ci, EXPERIMENTS

DATA_DIR = "experiment_data"
EXPERIMENT_NAMES = ["support", "binding_set", "feasibility"]

# Load aggregate + direct prompting data
aggregate_data = {}
all_prompts_data = {}

for exp in EXPERIMENT_NAMES:
    with open(os.path.join(DATA_DIR, f"{exp}_aggregate.json")) as f:
        aggregate_data[exp] = json.load(f)
    with open(os.path.join(DATA_DIR, f"{exp}_all_prompts.json")) as f:
        all_prompts_data[exp] = json.load(f)

# Print summary
for exp in EXPERIMENT_NAMES:
    cfg = aggregate_data[exp]["config"]
    n_prompts = len(all_prompts_data[exp]["prompts"])
    print(f"{exp}:")
    print(f"  Model: {cfg['model']}, Seeds: {cfg['num_seeds']}, List length: {cfg['list_length']}")
    print(f"  Dims: {OUTPUT_DIMS[exp]}")
    print(f"  Enumerated prompts: {n_prompts}")
    print()

support:
  Model: gpt-4o-mini, Seeds: 30, List length: 20
  Dims: ['x1', 'x2', 'x3']
  Enumerated prompts: 10

binding_set:
  Model: gpt-4o-mini, Seeds: 30, List length: 20
  Dims: ['x1', 'x2', 'x3', 'x4', 'x5']
  Enumerated prompts: 10

feasibility:
  Model: gpt-4o-mini, Seeds: 30, List length: 20
  Dims: ['x1', 'x2', 'x3']
  Enumerated prompts: 40



## 3. Wilson Confidence Interval Analysis

We treat each paper as a Bernoulli trial for each dimension, so `n = n_seeds * list_length`.

In [2]:
def compute_wilson_ci_box(stats, dims, n):
    """Compute Wilson CI for each dimension. Returns {dim: (lo, hi)}."""
    return {d: wilson_ci(stats[d]["mean"], n) for d in dims}


def format_ci(mean, lo, hi):
    """Format a CI as 'mean [lo, hi]'."""
    return f"{mean:.3f} [{lo:.3f}, {hi:.3f}]"

In [8]:
for exp in EXPERIMENT_NAMES:
    data = aggregate_data[exp]
    dims = OUTPUT_DIMS[exp]
    cfg = data["config"]
    n = cfg["num_seeds"] * cfg["list_length"]

    print(f"{'='*80}")
    print(f"{EXPERIMENTS[exp].name} (n = {cfg['num_seeds']} seeds x {cfg['list_length']} papers = {n})")
    print(f"{'='*80}")

    # Header
    dim_width = 22
    header = f"{'Label':<25s}" + "".join(f"{d:>{dim_width}s}" for d in dims)
    print(header)
    print("-" * len(header))

    # Aggregate prompts (L_A, L_B, L_A_B_agg, L_f)
    for generic in ["L_A", "L_B", "L_A_B_agg", "L_f"]:
        list_name = get_list_name(exp, generic)
        stats = get_vector_with_stats(data, exp, generic)
        ci_box = compute_wilson_ci_box(stats, dims, n)
        label = f"{generic} ({list_name})"
        row = f"{label:<25s}"
        for d in dims:
            row += f"{format_ci(stats[d]['mean'], ci_box[d][0], ci_box[d][1]):>{dim_width}s}"
        print(row)

    print()

Binding Set Contraction (n = 30 seeds x 20 papers = 600)
Label                                        x1                    x2                    x3                    x4                    x5
---------------------------------------------------------------------------------------------------------------------------------------
L_A (L_y1)                 0.832 [0.800, 0.859]  0.108 [0.086, 0.136]  0.000 [0.000, 0.005]  0.000 [0.000, 0.005]  0.010 [0.005, 0.022]
L_B (L_y2)                 0.167 [0.139, 0.199]  0.000 [0.000, 0.005]  0.728 [0.691, 0.762]  0.047 [0.033, 0.067]  0.014 [0.007, 0.027]
L_A_B_agg (L_intersection)  1.000 [0.995, 1.000]  0.000 [0.000, 0.005]  0.000 [0.000, 0.005]  0.000 [0.000, 0.005]  0.000 [0.000, 0.005]
L_f (L_x1_or)              0.885 [0.857, 0.908]  0.000 [0.000, 0.005]  0.070 [0.052, 0.093]  0.007 [0.003, 0.017]  0.000 [0.000, 0.005]

Feasibility Expansion (n = 30 seeds x 20 papers = 600)
Label                                        x1                    x2 

## 4. Find P_agg

P_agg is the single prompt (from the enumerated set) whose Wilson CI box has the smallest min-L1 distance to L_A_B_agg's CI box.

**Min-L1 between CI boxes:** For each coordinate, the gap is `max(0, lo_a - hi_b, lo_b - hi_a)` (zero if the intervals overlap). The min-L1 distance is the sum of gaps across all coordinates.

In [3]:
def min_l1_between_ci_boxes(box_a, box_b, dims):
    """
    Compute minimum L1 distance between two CI boxes.
    
    Each box is {dim: (lo, hi)}.
    For each coordinate: gap = max(0, lo_a - hi_b, lo_b - hi_a)
    Returns (total_gap, per_dim_gaps).
    """
    gaps = {}
    for d in dims:
        lo_a, hi_a = box_a[d]
        lo_b, hi_b = box_b[d]
        gap = max(0, lo_a - hi_b, lo_b - hi_a)
        gaps[d] = gap
    return sum(gaps.values()), gaps

In [4]:
for exp in EXPERIMENT_NAMES:
    dims = OUTPUT_DIMS[exp]
    data = aggregate_data[exp]
    cfg = data["config"]
    n_agg = cfg["num_seeds"] * cfg["list_length"]

    # Wilson CI box for L_A_B_agg
    agg_stats = get_vector_with_stats(data, exp, "L_A_B_agg")
    agg_box = compute_wilson_ci_box(agg_stats, dims, n_agg)

    print(f"{'='*80}")
    print(f"{EXPERIMENTS[exp].name}: Finding P_agg")
    print(f"{'='*80}")
    print(f"L_A_B_agg CI box (n={n_agg}):")
    for d in dims:
        print(f"  {d}: [{agg_box[d][0]:.4f}, {agg_box[d][1]:.4f}]  (mean={agg_stats[d]['mean']:.4f})")
    print()

    # Compute min-L1 for each prompt
    results = []
    prompts = all_prompts_data[exp]["prompts"]
    list_length = all_prompts_data[exp]["config"]["list_length"]

    for prompt in prompts:
        n_prompt = prompt["num_seeds"] * list_length
        prompt_box = {d: wilson_ci(prompt["mean"][d], n_prompt) for d in dims}
        total_gap, per_dim = min_l1_between_ci_boxes(agg_box, prompt_box, dims)
        results.append({
            "label": prompt["label"],
            "seeds": prompt["num_seeds"],
            "n": n_prompt,
            "min_l1": total_gap,
            "gaps": per_dim,
            "mean": prompt["mean"],
            "ci_box": prompt_box,
        })

    # Sort by min_l1
    results.sort(key=lambda r: r["min_l1"])

    # P_agg details
    p_agg = results[0]
    print(f"\nP_agg = '{p_agg['label']}' with min L1 = {p_agg['min_l1']:.4f}")
    print(f"\nPer-coordinate comparison:")
    print(f"{'Dim':<6s}{'L_A_B_agg CI':>25s}{'P_agg CI':>25s}{'Gap':>10s}")
    print("-" * 66)
    for d in dims:
        agg_ci = f"[{agg_box[d][0]:.4f}, {agg_box[d][1]:.4f}]"
        p_ci = f"[{p_agg['ci_box'][d][0]:.4f}, {p_agg['ci_box'][d][1]:.4f}]"
        print(f"{d:<6s}{agg_ci:>25s}{p_ci:>25s}{p_agg['gaps'][d]:>10.4f}")

    print()

Support Expansion: Finding P_agg
L_A_B_agg CI box (n=600):
  x1: [0.3189, 0.3953]  (mean=0.3562)
  x2: [0.3874, 0.4663]  (mean=0.4263)
  x3: [0.0369, 0.0728]  (mean=0.0520)


P_agg = 'y1∧¬y2' with min L1 = 0.5793

Per-coordinate comparison:
Dim                L_A_B_agg CI                 P_agg CI       Gap
------------------------------------------------------------------
x1             [0.3189, 0.3953]         [0.6022, 0.7314]    0.2068
x2             [0.3874, 0.4663]         [0.0000, 0.0149]    0.3725
x3             [0.0369, 0.0728]         [0.0347, 0.1019]    0.0000

Binding Set Contraction: Finding P_agg
L_A_B_agg CI box (n=600):
  x1: [0.9950, 1.0000]  (mean=1.0000)
  x2: [0.0000, 0.0050]  (mean=0.0000)
  x3: [0.0000, 0.0050]  (mean=0.0000)
  x4: [0.0000, 0.0050]  (mean=0.0000)
  x5: [0.0000, 0.0050]  (mean=0.0000)


P_agg = 'y1' with min L1 = 0.2143

Per-coordinate comparison:
Dim                L_A_B_agg CI                 P_agg CI       Gap
-------------------------------------