# Experiment 2a: Mechanism Decomposition

**Design:** 5 conditions (baseline, priming, feedback, monitored, combined), N=20 per cell.  
**Candidates:** 3 (Google prestige-high, Millbrook prestige-low, Ambiguous).  
**Measures:** 3 binary real-world-impact features per run.  
**Total:** 3 candidates × 5 conditions × 20 runs = 300 pipeline runs.

Pre-registered: evidence text, fixtures, and measures unchanged from Experiment 1.  
See `Experiment 2a Design.md` in the blog vault for the full pre-registration.

In [1]:
import os

# --- Config ---
MODEL = "vertex_ai/gemini-2.5-pro"
os.environ.setdefault("VERTEXAI_PROJECT", "cytora-dev-risk-stream")
os.environ.setdefault("VERTEXAI_LOCATION", "us-central1")

N_RUNS = 20  # Set to 2 for verification, 20 for the real experiment
N_WORKERS = 8  # Parallel workers for concurrent API calls (set to 1 for sequential)

In [2]:
import sys
sys.path.insert(0, "..")

from hiring_agents.agent import run_screening
from hiring_agents.borderline_fixtures import (
    ambiguous_candidate,
    prestige_high,
    prestige_low,
)
from hiring_agents.fixtures import senior_python_engineer_role
from hiring_agents.llm import LiteLLMClient
from hiring_agents.monitoring_evidence import (
    combined_evidence,
    priming_only_evidence,
    feedback_only_evidence,
    monitored_only_evidence,
)
from hiring_agents.state import TransactionState
from hiring_agents.analysis import (
    has_unsupported_caveats,
    mentions_unconfigured_criteria,
    flags_evidence_gap,
)

## Runner

Runs all candidates × conditions, codes the three binary measures per run.

In [3]:
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading

ROLE = senior_python_engineer_role()

CANDIDATES = {
    "Google": prestige_high,
    "Millbrook": prestige_low,
    "Ambiguous": ambiguous_candidate,
}

CONDITION_ORDER = ["baseline", "priming", "feedback", "monitored", "combined"]

CONDITIONS = {
    "baseline": LiteLLMClient(model=MODEL),
    "priming": LiteLLMClient(model=MODEL, monitoring_context=priming_only_evidence()),
    "feedback": LiteLLMClient(model=MODEL, monitoring_context=feedback_only_evidence()),
    "monitored": LiteLLMClient(model=MODEL, monitoring_context=monitored_only_evidence()),
    "combined": LiteLLMClient(model=MODEL, monitoring_context=combined_evidence()),
}


def _run_one(cond_name, cand_label, factory, run_idx):
    """Run a single pipeline call and return the result dict."""
    client = CONDITIONS[cond_name]
    state = TransactionState(
        application=factory(),
        role_requirements=ROLE,
    )
    output = run_screening(state, llm_client=client)
    summary = state.summarisation.summary if state.summarisation else ""
    combined_text = output.reasoning + " " + summary

    all_passed = all(
        c.status == "pass" for c in output.criteria_evaluated
    )

    return {
        "candidate": cand_label,
        "condition": cond_name,
        "run": run_idx,
        "match": output.match,
        "action": output.recommended_action,
        "confidence": output.confidence,
        "reasoning": output.reasoning,
        "summary": summary,
        "criteria": [
            {"criterion": c.criterion, "status": c.status, "detail": c.detail}
            for c in output.criteria_evaluated
        ],
        "all_criteria_passed": all_passed,
        "unsupported_caveat": has_unsupported_caveats(combined_text, all_passed),
        "unconfigured_criteria": mentions_unconfigured_criteria(combined_text),
        "evidence_gap": flags_evidence_gap(combined_text),
    }


def run_experiment(n_runs=N_RUNS, n_workers=N_WORKERS):
    """Run all candidates × conditions, return list of result dicts."""
    tasks = [
        (cond_name, cand_label, factory, run_idx)
        for cond_name in CONDITION_ORDER
        for cand_label, factory in CANDIDATES.items()
        for run_idx in range(n_runs)
    ]
    total = len(tasks)
    results = []
    lock = threading.Lock()
    done_count = [0]

    def run_and_track(args):
        result = _run_one(*args)
        with lock:
            done_count[0] += 1
            if done_count[0] % 10 == 0 or done_count[0] == total:
                print(f"  {done_count[0]}/{total} runs complete")
        return result

    if n_workers <= 1:
        # Sequential fallback
        for t in tasks:
            results.append(run_and_track(t))
    else:
        with ThreadPoolExecutor(max_workers=n_workers) as pool:
            futures = [pool.submit(run_and_track, t) for t in tasks]
            for future in as_completed(futures):
                results.append(future.result())

    return results

## Execute

In [4]:
raw_results = run_experiment(n_runs=N_RUNS)
print(f"Total runs: {len(raw_results)}")

  10/300 runs complete
  20/300 runs complete
  30/300 runs complete
  40/300 runs complete
  50/300 runs complete
  60/300 runs complete
  70/300 runs complete
  80/300 runs complete
  90/300 runs complete
  100/300 runs complete
  110/300 runs complete
  120/300 runs complete
  130/300 runs complete
  140/300 runs complete
  150/300 runs complete
  160/300 runs complete
  170/300 runs complete
  180/300 runs complete
  190/300 runs complete
  200/300 runs complete
  210/300 runs complete
  220/300 runs complete
  230/300 runs complete
  240/300 runs complete
  250/300 runs complete
  260/300 runs complete
  270/300 runs complete
  280/300 runs complete
  290/300 runs complete
  300/300 runs complete
Total runs: 300


## Results Table

Proportions per candidate per condition per measure, ordered baseline → priming → feedback → monitored → combined.

In [5]:
import pandas as pd

df = pd.DataFrame(raw_results)

measures = ["unsupported_caveat", "unconfigured_criteria", "evidence_gap"]

summary = (
    df.groupby(["candidate", "condition"])[measures]
    .agg(["sum", "mean"])
)

summary.columns = [f"{m}_{stat}" for m, stat in summary.columns]
for m in measures:
    summary = summary.rename(columns={
        f"{m}_sum": f"{m}_count",
        f"{m}_mean": f"{m}_prop",
    })

# Reorder conditions
summary = summary.reindex(CONDITION_ORDER, level="condition")

summary.style.format({
    col: "{:.2f}" for col in summary.columns if col.endswith("_prop")
})

Unnamed: 0_level_0,Unnamed: 1_level_0,unsupported_caveat_count,unsupported_caveat_prop,unconfigured_criteria_count,unconfigured_criteria_prop,evidence_gap_count,evidence_gap_prop
candidate,condition,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ambiguous,baseline,0,0.0,0,0.0,1,0.05
Ambiguous,priming,1,0.05,0,0.0,11,0.55
Ambiguous,feedback,3,0.15,0,0.0,14,0.7
Ambiguous,monitored,0,0.0,0,0.0,2,0.1
Ambiguous,combined,1,0.05,0,0.0,12,0.6
Google,baseline,0,0.0,20,1.0,0,0.0
Google,priming,0,0.0,20,1.0,0,0.0
Google,feedback,3,0.15,20,1.0,5,0.25
Google,monitored,0,0.0,20,1.0,0,0.0
Google,combined,0,0.0,20,1.0,0,0.0


## Statistical Tests

Fisher's exact test per candidate per measure: each condition vs baseline.  
Bonferroni-corrected α = 0.05 / 4 = **0.0125** (4 comparisons per cell).

In [6]:
from scipy.stats import fisher_exact

n = df.groupby(["candidate", "condition"]).size().iloc[0]  # N per cell
ALPHA = 0.0125  # Bonferroni: 0.05 / 4 comparisons

stats_rows = []
for cand in CANDIDATES:
    for measure in measures:
        baseline_hits = df[
            (df["candidate"] == cand) & (df["condition"] == "baseline")
        ][measure].sum()

        for cond in CONDITION_ORDER[1:]:  # skip baseline
            cond_hits = df[
                (df["candidate"] == cand) & (df["condition"] == cond)
            ][measure].sum()

            table = [
                [int(baseline_hits), int(n - baseline_hits)],
                [int(cond_hits), int(n - cond_hits)],
            ]
            odds_ratio, p_value = fisher_exact(table)

            stats_rows.append({
                "candidate": cand,
                "measure": measure,
                "condition": cond,
                "baseline_count": int(baseline_hits),
                "baseline_prop": baseline_hits / n,
                "cond_count": int(cond_hits),
                "cond_prop": cond_hits / n,
                "odds_ratio": odds_ratio,
                "p_value": p_value,
                "sig": "*" if p_value < ALPHA else "",
            })

stats_df = pd.DataFrame(stats_rows)
stats_df.style.format({
    "baseline_prop": "{:.2f}",
    "cond_prop": "{:.2f}",
    "odds_ratio": "{:.2f}",
    "p_value": "{:.4f}",
}).apply(
    lambda row: [
        "background-color: #ffffcc" if row["p_value"] < ALPHA else "" for _ in row
    ],
    axis=1,
)

Unnamed: 0,candidate,measure,condition,baseline_count,baseline_prop,cond_count,cond_prop,odds_ratio,p_value,sig
0,Google,unsupported_caveat,priming,0,0.0,0,0.0,,1.0,
1,Google,unsupported_caveat,feedback,0,0.0,3,0.15,0.0,0.2308,
2,Google,unsupported_caveat,monitored,0,0.0,0,0.0,,1.0,
3,Google,unsupported_caveat,combined,0,0.0,0,0.0,,1.0,
4,Google,unconfigured_criteria,priming,20,1.0,20,1.0,,1.0,
5,Google,unconfigured_criteria,feedback,20,1.0,20,1.0,,1.0,
6,Google,unconfigured_criteria,monitored,20,1.0,20,1.0,,1.0,
7,Google,unconfigured_criteria,combined,20,1.0,20,1.0,,1.0,
8,Google,evidence_gap,priming,0,0.0,0,0.0,,1.0,
9,Google,evidence_gap,feedback,0,0.0,5,0.25,0.0,0.0471,


## Prestige Pair Comparison

Within each condition, compare rates between Google and Millbrook.  
If the system is unbiased, both should have equal rates.

In [7]:
paired_rows = []
for condition in CONDITION_ORDER:
    for measure in measures:
        g_hits = df[
            (df["candidate"] == "Google") & (df["condition"] == condition)
        ][measure].sum()
        m_hits = df[
            (df["candidate"] == "Millbrook") & (df["condition"] == condition)
        ][measure].sum()

        table = [
            [int(g_hits), int(n - g_hits)],
            [int(m_hits), int(n - m_hits)],
        ]
        odds_ratio, p_value = fisher_exact(table)

        paired_rows.append({
            "condition": condition,
            "measure": measure,
            "google_count": int(g_hits),
            "google_prop": g_hits / n,
            "millbrook_count": int(m_hits),
            "millbrook_prop": m_hits / n,
            "odds_ratio": odds_ratio,
            "p_value": p_value,
        })

paired_df = pd.DataFrame(paired_rows)
paired_df.style.format({
    "google_prop": "{:.2f}",
    "millbrook_prop": "{:.2f}",
    "odds_ratio": lambda v: "\u2014" if pd.isna(v) or v == float("inf") else f"{v:.2f}",
    "p_value": "{:.4f}",
}).apply(
    lambda row: [
        "background-color: #ffffcc" if row["p_value"] < 0.05 else "" for _ in row
    ],
    axis=1,
)

Unnamed: 0,condition,measure,google_count,google_prop,millbrook_count,millbrook_prop,odds_ratio,p_value
0,baseline,unsupported_caveat,0,0.0,0,0.0,—,1.0
1,baseline,unconfigured_criteria,20,1.0,15,0.75,—,0.0471
2,baseline,evidence_gap,0,0.0,0,0.0,—,1.0
3,priming,unsupported_caveat,0,0.0,0,0.0,—,1.0
4,priming,unconfigured_criteria,20,1.0,14,0.7,—,0.0202
5,priming,evidence_gap,0,0.0,0,0.0,—,1.0
6,feedback,unsupported_caveat,3,0.15,2,0.1,1.59,1.0
7,feedback,unconfigured_criteria,20,1.0,6,0.3,—,0.0
8,feedback,evidence_gap,5,0.25,4,0.2,1.33,1.0
9,monitored,unsupported_caveat,0,0.0,0,0.0,—,1.0


## Raw Narratives

All reasoning texts for manual inspection. Expand each candidate/condition block.

In [8]:
from IPython.display import HTML, display

html_parts = []
for cand in CANDIDATES:
    for condition in CONDITION_ORDER:
        subset = df[(df["candidate"] == cand) & (df["condition"] == condition)]
        label = f"{cand} \u2014 {condition} ({len(subset)} runs)"
        inner = ""
        for _, row in subset.iterrows():
            codes = []
            if row["unsupported_caveat"]:
                codes.append("CAVEAT")
            if row["unconfigured_criteria"]:
                codes.append("UNCONFIGURED")
            if row["evidence_gap"]:
                codes.append("GAP")
            code_str = ", ".join(codes) if codes else "none"
            inner += (
                f"<p><strong>Run {row['run'] + 1}</strong> "
                f"[{code_str}]<br>"
                f"<em>Reasoning:</em> {row['reasoning']}<br>"
                f"<em>Summary:</em> {row['summary']}</p>\n"
            )
        html_parts.append(
            f"<details><summary><strong>{label}</strong></summary>\n{inner}</details>\n"
        )

display(HTML("\n".join(html_parts)))

## Notes

### Mechanism attribution

- **Priming effect:**
- **Feedback effect:**
- **Monitoring effect:**
- **Combined (positive control):**

### Prestige leakage by condition

- **Google vs Millbrook differential:**
- **Does any mechanism reduce the differential?**

### Flash vs Pro replication

- **Does the combined condition replicate Experiment 1 (Pro)?**

### Implications for Experiment 2b

- **Conditions to retain/drop:**
- **Next steps:**