# GPT-5 Multi-LogiEval Analysis

**Definitions:**
- **False Positive** = Lean verification passed but answer is wrong
- **Gaming** = False positive where model proved Yes/No (not Uncertain)
- **Conservative** = False positive where model said Uncertain when should have proved

In [None]:
import json
import pandas as pd
from collections import defaultdict

# Load results
with open('../results/simplelean/gpt-5_multilogieval_baseline_20251224_190323/results.jsonl') as f:
    results = [json.loads(l) for l in f]

# Load original data
with open('../data/multilogieval/multilogieval-sampled.json') as f:
    data = {d['idx']: d for d in json.load(f)}

print(f"Loaded {len(results)} results")

In [None]:
# Summary by depth and GT
stats = defaultdict(lambda: defaultdict(lambda: {'total': 0, 'correct': 0, 'fp': 0, 'gaming': 0}))

for r in results:
    depth, gt = r['depth'], r['ground_truth']
    stats[depth][gt]['total'] += 1
    
    lean_ok = r.get('lean_verification', {}).get('success', False)
    if r['correct']:
        stats[depth][gt]['correct'] += 1
    elif lean_ok:
        stats[depth][gt]['fp'] += 1
        if r['prediction'] and r['prediction'].lower() in ['yes', 'no']:
            stats[depth][gt]['gaming'] += 1

# Display
rows = []
for depth in ['d3', 'd4', 'd5']:
    for gt in ['yes', 'no']:
        s = stats[depth][gt]
        if s['total'] > 0:
            rows.append({
                'Depth': depth,
                'GT': gt,
                'Acc': f"{s['correct']}/{s['total']} ({100*s['correct']/s['total']:.1f}%)",
                'FP': s['fp'],
                'Gaming': s['gaming']
            })

pd.DataFrame(rows)

## Gaming Case: Case 67

The only gaming case (pred=Yes, gt=No)

In [None]:
# Find case 67
case_67 = next(r for r in results if r['case_idx'] == 67)
orig_67 = data[67]

print("=" * 70)
print("CASE 67")
print("=" * 70)
print(f"Depth: {case_67['depth']}")
print(f"Rule: {case_67['rule']}")
print(f"Ground Truth: {case_67['ground_truth']}")
print(f"Prediction: {case_67['prediction']}")
print(f"Lean Pass: {case_67['lean_verification']['success']}")

In [None]:
print("=" * 70)
print("CONTEXT")
print("=" * 70)
print(orig_67['context'])

In [None]:
print("=" * 70)
print("QUESTION")
print("=" * 70)
print(orig_67['question'])
print(f"\nAnswer: {orig_67['answer']}")

In [None]:
print("=" * 70)
print("LEAN CODE")
print("=" * 70)
print(case_67['lean_code'])

## Analysis of Case 67

**Question:** "The kitchen did not get hot, did Sam eat dinner last night?"
**GT:** No | **Pred:** Yes

**Model's reasoning (valid!):**
1. Kitchen not hot → Stove not on (contrapositive of R1: StoveOn → KitchenHot)
2. R3: StoveOn ∨ ¬UseStove → with ¬StoveOn → ¬UseStove
3. If Sam didn't eat dinner → hungry (R5) → makes pancakes (R4) → uses stove (R2)
4. But ¬UseStove, so by contradiction: Sam must have eaten dinner

**Verdict: MODEL CORRECT (Dataset issue)**

The proof is logically valid. If we accept the premises, Sam eating dinner is the only consistent scenario when the kitchen didn't get hot.

In [None]:
# Load response file for full context
with open('../results/simplelean/gpt-5_multilogieval_baseline_20251224_190323/responses/case_67.txt') as f:
    response = f.read()

print("=" * 70)
print("FULL RESPONSE")
print("=" * 70)
print(response[:3000])