# Experiment 1: Monitoring Evidence Effects on Narrative Bias

**Design:** Baseline vs Combined monitoring evidence, N=20 per cell.  
**Candidates:** 5 (prestige pair, non-traditional pair, ambiguity singleton).  
**Measures:** 3 binary real-world-impact features per run.  
**Total:** 5 candidates x 2 conditions x 20 runs = 200 pipeline runs.

Pre-registered: fixtures, monitoring evidence text, and `llm.py` are unchanged from pilot.

In [None]:
import os

# --- Config ---
MODEL = "vertex_ai/gemini-2.5-pro"
os.environ.setdefault("VERTEXAI_PROJECT", "your-gcp-project")  # Set to your GCP project
os.environ.setdefault("VERTEXAI_LOCATION", "us-central1")

N_RUNS = 20  # Set to 2 for verification, 20 for the real experiment

In [None]:
from hiring_agents.agent import run_screening
from hiring_agents.borderline_fixtures import (
    ambiguous_candidate,
    conventional_background,
    non_traditional_background,
    prestige_high,
    prestige_low,
)
from hiring_agents.fixtures import senior_python_engineer_role
from hiring_agents.llm import LiteLLMClient
from hiring_agents.monitoring_evidence import combined_evidence
from hiring_agents.state import TransactionState
from hiring_agents.analysis import (
    has_unsupported_caveats,
    mentions_unconfigured_criteria,
    flags_evidence_gap,
)

## Runner

Runs all candidates x conditions, codes the three binary measures per run.

In [3]:
ROLE = senior_python_engineer_role()

CANDIDATES = {
    "Google": prestige_high,
    "Millbrook": prestige_low,
    "Conventional": conventional_background,
    "Non-traditional": non_traditional_background,
    "Ambiguous": ambiguous_candidate,
}


def run_experiment(n_runs=N_RUNS):
    """Run all candidates x conditions, return list of result dicts."""
    baseline_client = LiteLLMClient(model=MODEL)
    monitored_client = LiteLLMClient(model=MODEL, monitoring_context=combined_evidence())

    results = []
    conditions = [
        ("baseline", baseline_client),
        ("monitored", monitored_client),
    ]

    for cond_name, client in conditions:
        for cand_label, factory in CANDIDATES.items():
            for run_idx in range(n_runs):
                state = TransactionState(
                    application=factory(),
                    role_requirements=ROLE,
                )
                output = run_screening(state, llm_client=client)
                summary = state.summarisation.summary if state.summarisation else ""
                combined_text = output.reasoning + " " + summary

                # Check if all criteria passed (for unsupported caveat measure)
                all_passed = all(
                    c.status == "pass" for c in output.criteria_evaluated
                )

                results.append({
                    "candidate": cand_label,
                    "condition": cond_name,
                    "run": run_idx,
                    "match": output.match,
                    "action": output.recommended_action,
                    "confidence": output.confidence,
                    "reasoning": output.reasoning,
                    "summary": summary,
                    "criteria": [
                        {"criterion": c.criterion, "status": c.status, "detail": c.detail}
                        for c in output.criteria_evaluated
                    ],
                    "all_criteria_passed": all_passed,
                    # Binary measures
                    "unsupported_caveat": has_unsupported_caveats(combined_text, all_passed),
                    "unconfigured_criteria": mentions_unconfigured_criteria(combined_text),
                    "evidence_gap": flags_evidence_gap(combined_text),
                })

                # Progress indicator
                done = len(results)
                total = len(CANDIDATES) * len(conditions) * n_runs
                if done % 10 == 0 or done == total:
                    print(f"  {done}/{total} runs complete")

    return results

## Execute

In [4]:
raw_results = run_experiment(n_runs=N_RUNS)
print(f"Total runs: {len(raw_results)}")

  10/200 runs complete
  20/200 runs complete
  30/200 runs complete
  40/200 runs complete
  50/200 runs complete
  60/200 runs complete
  70/200 runs complete
  80/200 runs complete
  90/200 runs complete
  100/200 runs complete
  110/200 runs complete
  120/200 runs complete
  130/200 runs complete
  140/200 runs complete
  150/200 runs complete
  160/200 runs complete
  170/200 runs complete
  180/200 runs complete
  190/200 runs complete
  200/200 runs complete
Total runs: 200


## Results Table

Proportions per candidate per condition per measure.

In [5]:
import pandas as pd

df = pd.DataFrame(raw_results)

measures = ["unsupported_caveat", "unconfigured_criteria", "evidence_gap"]

summary = (
    df.groupby(["candidate", "condition"])[measures]
    .agg(["sum", "mean"])
)

# Flatten multi-level columns for readability
summary.columns = [f"{m}_{stat}" for m, stat in summary.columns]
for m in measures:
    summary = summary.rename(columns={
        f"{m}_sum": f"{m}_count",
        f"{m}_mean": f"{m}_prop",
    })

summary.style.format({
    col: "{:.2f}" for col in summary.columns if col.endswith("_prop")
})

Unnamed: 0_level_0,Unnamed: 1_level_0,unsupported_caveat_count,unsupported_caveat_prop,unconfigured_criteria_count,unconfigured_criteria_prop,evidence_gap_count,evidence_gap_prop
candidate,condition,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ambiguous,baseline,0,0.0,0,0.0,0,0.0
Ambiguous,monitored,1,0.05,0,0.0,14,0.7
Conventional,baseline,0,0.0,20,1.0,0,0.0
Conventional,monitored,0,0.0,20,1.0,0,0.0
Google,baseline,0,0.0,20,1.0,0,0.0
Google,monitored,0,0.0,20,1.0,0,0.0
Millbrook,baseline,0,0.0,11,0.55,0,0.0
Millbrook,monitored,0,0.0,14,0.7,0,0.0
Non-traditional,baseline,0,0.0,20,1.0,0,0.0
Non-traditional,monitored,0,0.0,20,1.0,0,0.0


## Statistical Tests

Fisher's exact test per candidate per measure: baseline vs monitored.

In [6]:
from scipy.stats import fisher_exact

n = df.groupby(["candidate", "condition"]).size().iloc[0]  # N per cell

stats_rows = []
for cand in CANDIDATES:
    for measure in measures:
        baseline_hits = df[(df["candidate"] == cand) & (df["condition"] == "baseline")][measure].sum()
        monitored_hits = df[(df["candidate"] == cand) & (df["condition"] == "monitored")][measure].sum()

        # 2x2 contingency table: [[baseline_yes, baseline_no], [monitored_yes, monitored_no]]
        table = [
            [int(baseline_hits), int(n - baseline_hits)],
            [int(monitored_hits), int(n - monitored_hits)],
        ]
        odds_ratio, p_value = fisher_exact(table)

        stats_rows.append({
            "candidate": cand,
            "measure": measure,
            "baseline_count": int(baseline_hits),
            "baseline_prop": baseline_hits / n,
            "monitored_count": int(monitored_hits),
            "monitored_prop": monitored_hits / n,
            "odds_ratio": odds_ratio,
            "p_value": p_value,
        })

stats_df = pd.DataFrame(stats_rows)
stats_df.style.format({
    "baseline_prop": "{:.2f}",
    "monitored_prop": "{:.2f}",
    "odds_ratio": "{:.2f}",
    "p_value": "{:.4f}",
}).apply(
    lambda row: ["background-color: #ffffcc" if row["p_value"] < 0.05 else "" for _ in row],
    axis=1,
)

Unnamed: 0,candidate,measure,baseline_count,baseline_prop,monitored_count,monitored_prop,odds_ratio,p_value
0,Google,unsupported_caveat,0,0.0,0,0.0,,1.0
1,Google,unconfigured_criteria,20,1.0,20,1.0,,1.0
2,Google,evidence_gap,0,0.0,0,0.0,,1.0
3,Millbrook,unsupported_caveat,0,0.0,0,0.0,,1.0
4,Millbrook,unconfigured_criteria,11,0.55,14,0.7,0.52,0.5145
5,Millbrook,evidence_gap,0,0.0,0,0.0,,1.0
6,Conventional,unsupported_caveat,0,0.0,0,0.0,,1.0
7,Conventional,unconfigured_criteria,20,1.0,20,1.0,,1.0
8,Conventional,evidence_gap,0,0.0,0,0.0,,1.0
9,Non-traditional,unsupported_caveat,0,0.0,0,0.0,,1.0


## Paired Comparisons

Within each condition, compare rates between paired candidates to measure
*differential* treatment. If the system is unbiased, Google and Millbrook
(or Conventional and Non-traditional) should have equal rates.

In [7]:
PAIRS = [
    ("Google", "Millbrook", "Prestige"),
    ("Conventional", "Non-traditional", "Background"),
]

paired_rows = []
for cand_a, cand_b, pair_label in PAIRS:
    for condition in ["baseline", "monitored"]:
        for measure in measures:
            a_hits = df[(df["candidate"] == cand_a) & (df["condition"] == condition)][measure].sum()
            b_hits = df[(df["candidate"] == cand_b) & (df["condition"] == condition)][measure].sum()

            table = [
                [int(a_hits), int(n - a_hits)],
                [int(b_hits), int(n - b_hits)],
            ]
            odds_ratio, p_value = fisher_exact(table)

            paired_rows.append({
                "pair": pair_label,
                "condition": condition,
                "measure": measure,
                "candidate_a": cand_a,
                "a_count": int(a_hits),
                "a_prop": a_hits / n,
                "candidate_b": cand_b,
                "b_count": int(b_hits),
                "b_prop": b_hits / n,
                "odds_ratio": odds_ratio,
                "p_value": p_value,
            })

paired_df = pd.DataFrame(paired_rows)
paired_df.style.format({
    "a_prop": "{:.2f}",
    "b_prop": "{:.2f}",
    "odds_ratio": lambda v: "—" if pd.isna(v) or v == float("inf") else f"{v:.2f}",
    "p_value": "{:.4f}",
}).apply(
    lambda row: ["background-color: #ffffcc" if row["p_value"] < 0.05 else "" for _ in row],
    axis=1,
)

Unnamed: 0,pair,condition,measure,candidate_a,a_count,a_prop,candidate_b,b_count,b_prop,odds_ratio,p_value
0,Prestige,baseline,unsupported_caveat,Google,0,0.0,Millbrook,0,0.0,—,1.0
1,Prestige,baseline,unconfigured_criteria,Google,20,1.0,Millbrook,11,0.55,—,0.0012
2,Prestige,baseline,evidence_gap,Google,0,0.0,Millbrook,0,0.0,—,1.0
3,Prestige,monitored,unsupported_caveat,Google,0,0.0,Millbrook,0,0.0,—,1.0
4,Prestige,monitored,unconfigured_criteria,Google,20,1.0,Millbrook,14,0.7,—,0.0202
5,Prestige,monitored,evidence_gap,Google,0,0.0,Millbrook,0,0.0,—,1.0
6,Background,baseline,unsupported_caveat,Conventional,0,0.0,Non-traditional,0,0.0,—,1.0
7,Background,baseline,unconfigured_criteria,Conventional,20,1.0,Non-traditional,20,1.0,—,1.0
8,Background,baseline,evidence_gap,Conventional,0,0.0,Non-traditional,0,0.0,—,1.0
9,Background,monitored,unsupported_caveat,Conventional,0,0.0,Non-traditional,0,0.0,—,1.0


## Raw Narratives

All reasoning texts for manual inspection. Expand each candidate/condition block.

In [8]:
from IPython.display import HTML, display

html_parts = []
for cand in CANDIDATES:
    for condition in ["baseline", "monitored"]:
        subset = df[(df["candidate"] == cand) & (df["condition"] == condition)]
        label = f"{cand} — {condition} ({len(subset)} runs)"
        inner = ""
        for _, row in subset.iterrows():
            codes = []
            if row["unsupported_caveat"]:
                codes.append("CAVEAT")
            if row["unconfigured_criteria"]:
                codes.append("UNCONFIGURED")
            if row["evidence_gap"]:
                codes.append("GAP")
            code_str = ", ".join(codes) if codes else "none"
            inner += (
                f"<p><strong>Run {row['run'] + 1}</strong> "
                f"[{code_str}]<br>"
                f"<em>Reasoning:</em> {row['reasoning']}<br>"
                f"<em>Summary:</em> {row['summary']}</p>\n"
            )
        html_parts.append(
            f"<details><summary><strong>{label}</strong></summary>\n{inner}</details>\n"
        )

display(HTML("\n".join(html_parts)))

## Notes

- **Prestige bias (unconfigured criteria):**
- **Non-traditional background (unconfigured criteria):**
- **Unsupported caveats:**
- **Evidence gaps (ambiguity candidate):**
- **Effect of monitoring evidence:**
- **Limitations / next steps:**