# V6 Classification Manual Verification

Verify GPT-4o's classification of GPT-5 FOLIO baseline false negatives (27 cases).

In [None]:
import json
import pandas as pd
import re

# Load v6 classification results
df = pd.read_csv('../results/error_analysis/gpt-5_folio_baseline_v6.csv')

# Load FOLIO data
with open('../data/folio/original/folio-validation.json') as f:
    folio_data = json.load(f)

# Load baseline results
with open('../results/simplelean/gpt-5_folio_baseline/results.jsonl') as f:
    baseline = {r['case_idx']: r for r in [json.loads(l) for l in f]}

print(f'Loaded {len(df)} classifications')
print(f'\nCategory distribution:')
print(df['root_cause_category'].value_counts())

In [None]:
def show_case(case_idx):
    """Display full details for a case."""
    row = df[df['case_idx'] == case_idx].iloc[0]
    folio = folio_data[case_idx]
    result = baseline[case_idx]
    
    print('=' * 70)
    print(f'CASE {case_idx}')
    print('=' * 70)
    print(f'Ground Truth: {row["ground_truth"]}')
    print(f'Prediction: {row["prediction"]}')
    print(f'V6 Category: {row["root_cause_category"]}')
    print(f'V6 Explanation: {row["error_description"]}')
    print(f'Problematic Element: {row["problematic_axiom"]}')
    print()
    print('--- PREMISES ---')
    print(folio['premises'])
    print()
    print('--- CONCLUSION ---')
    print(folio['conclusion'])
    print()
    print('--- LEAN CODE ---')
    print(result.get('lean_code', 'N/A'))

## AXIOM_FABRICATION Cases (11)

Model invented facts not in premises.

In [None]:
fab_cases = df[df['root_cause_category'] == 'AXIOM_FABRICATION']['case_idx'].tolist()
print(f'AXIOM_FABRICATION cases: {fab_cases}')

In [None]:
# Case 70 - Known gaming case
show_case(70)

**Case 70 Verdict:** CORRECT - Model added `axiom A5_stock : Stock KO` but premises only say "KO is a mature stock", not that KO is a stock.

In [None]:
# Case 89 - Questionable classification
show_case(89)

**Case 89 Verdict:** WRONG CLASSIFICATION

V6 says AXIOM_FABRICATION because K1 "all books contain knowledge" vs premise "books contain tons of knowledge".

BUT the proof doesn't use K1! The proof chain is:
1. Harry read Walden (H3)
2. Walden is a book (H2), Harry is a person (H1)
3. Person reads book → gains knowledge (R1) → Harry gains knowledge
4. Gains knowledge → smarter (R2) → Harry is smarter
5. Therefore ∃p. Person p ∧ Smarter p ∧ GainsKnowledge p

**Should be: FAITHFUL** (model's reasoning is correct, GT may be wrong)

In [None]:
# Case 36
show_case(36)

**Case 36 Verdict:** PARTIALLY CORRECT

V6 says AXIOM_FABRICATION for R2 "conductors leading orchestras".

Actual issues:
1. R2: Premise "Orchestras are led by conductors" → `Leads l o → Conductor l` is **wrong direction**
   - Premise says: Conductor → LeadsOrchestra  
   - Axiom says: LeadsOrchestra → Conductor
2. R1: Premise "Composers write music pieces" → `Composer x → MusicPiece y → Wrote x y` 
   - This says every composer wrote every piece!
   - Should be: `Wrote x y ∧ MusicPiece y → Composer x`

**Better classification: FORMALIZATION_ERROR** (wrong implication direction)

In [None]:
# Case 46
show_case(46)

**Case 46 Verdict:** ???

In [None]:
# Case 118
show_case(118)

**Case 118 Verdict:** ???

In [None]:
# Show remaining AXIOM_FABRICATION cases
for case_idx in [5, 29, 41, 77, 92, 141]:
    show_case(case_idx)
    print('\n')

## FAITHFUL Cases (8)

V6 says formalization is correct - implies potential GT issue.

In [None]:
faithful_cases = df[df['root_cause_category'] == 'FAITHFUL']['case_idx'].tolist()
print(f'FAITHFUL cases: {faithful_cases}')

In [None]:
# Case 202 - Known potential GT error
show_case(202)

**Case 202 Verdict:** CORRECT - Model's reasoning is valid: Ailton = Ailton Silva, loaned to Braga, Braga is football club => loaned to football club.

In [None]:
# Show remaining FAITHFUL cases
for case_idx in [102, 122, 123, 127, 128, 130, 196]:
    show_case(case_idx)
    print('\n')

## DATASET_BUG Cases (2)

Premises contain contradictions.

In [None]:
# Case 75 and 159 - Known bad stories (368, 435)
show_case(75)
print('\n')
show_case(159)

**Cases 75, 159 Verdict:** CORRECT - These are from stories 368 and 435 which have contradictory premises.

## Other Cases

In [None]:
# VACUOUS_TRUTH - Case 83
show_case(83)

**Case 83 Verdict:** CORRECT - Model proved via vacuous truth (antecedent is impossible).

In [None]:
# GOAL_MISMATCH cases
goal_cases = df[df['root_cause_category'] == 'GOAL_MISMATCH']['case_idx'].tolist()
print(f'GOAL_MISMATCH cases: {goal_cases}')
for case_idx in goal_cases:
    show_case(case_idx)
    print('\n')

In [None]:
# REASONING_ERROR - Case 119
show_case(119)

In [None]:
# MISSING_PREMISE - Case 99
show_case(99)

## Summary

| Category | Cases | V6 Correct | Issues Found |
|----------|-------|------------|--------------|
| AXIOM_FABRICATION | 11 | ~8/11 | Case 89: should be FAITHFUL (proof doesn't use fabricated axiom). Case 36: should be FORMALIZATION_ERROR. |
| FAITHFUL | 8 | 8/8 | Case 202 confirmed correct |
| DATASET_BUG | 2 | 2/2 | Known bad stories (368, 435) |
| VACUOUS_TRUTH | 1 | 1/1 | Case 83 correct |
| GOAL_MISMATCH | 3 | ? | Need verification |
| REASONING_ERROR | 1 | ? | Need verification |
| MISSING_PREMISE | 1 | ? | Need verification |

**Key Findings:**
1. V6 sometimes classifies as AXIOM_FABRICATION when the actual issue is FORMALIZATION_ERROR (wrong implication direction)
2. V6 may miss that a fabricated axiom isn't actually used in the proof (Case 89)
3. FAITHFUL cases appear to be genuine GT issues where model's reasoning is correct