# Candidate Validator Model Analysis

Analyzing whether expensive reasoning tokens are necessary for the validator model.

Compares:
- Gemini 3 Flash with reasoning (current validator)
- Gemini 3 Flash without reasoning
- DeepSeek v3.2 with reasoning
- DeepSeek v3.2 without reasoning

In [1]:
import json
import pandas as pd
from pathlib import Path
from collections import defaultdict

In [2]:
# Load the most recent results file
# results_file = sorted(results_files)[-1] if results_files else None
results_file = "data/candidate_validators_questions_fast_20260122_084427_20260122_095954.json"
if results_file:
    with open(results_file) as f:
        results = json.load(f)
    print(f"Loaded {len(results)} questions from {results_file}")
else:
    print("No results files found. Run evaluate_candidate_validator_models.py first.")
    results = []

Loaded 313 questions from data/candidate_validators_questions_fast_20260122_084427_20260122_095954.json


In [3]:
if results:
    models = list(results[0]["evaluations"].keys())
    print(f"Models evaluated: {models}")

Models evaluated: ['gemini_thinking', 'gemini_no_thinking', 'deepseek_thinking', 'deepseek_no_thinking']


## Agreement with Generator (Grok)

In [4]:
if results:
    for model in models:
        agree = sum(1 for r in results 
                    if r["evaluations"].get(model, {}).get("answer") == r["generator_answer_idx"])
        pct = 100 * agree / len(results)
        print(f"{model:25s}: {agree}/{len(results)} ({pct:.1f}%) agree with generator")

gemini_thinking          : 254/313 (81.2%) agree with generator
gemini_no_thinking       : 219/313 (70.0%) agree with generator
deepseek_thinking        : 242/313 (77.3%) agree with generator
deepseek_no_thinking     : 262/313 (83.7%) agree with generator


## Agreement with Gemini Thinking (Current Validator)

In [5]:
if results and "gemini_thinking" in models:
    reference_model = "gemini_thinking"
    print(f"Agreement with {reference_model}:\n")
    
    for model in models:
        if model == reference_model:
            continue
        agree = sum(1 for r in results 
                    if r["evaluations"].get(model, {}).get("answer") == 
                       r["evaluations"].get(reference_model, {}).get("answer"))
        pct = 100 * agree / len(results)
        print(f"{model:25s}: {agree}/{len(results)} ({pct:.1f}%)")

Agreement with gemini_thinking:

gemini_no_thinking       : 219/313 (70.0%)
deepseek_thinking        : 247/313 (78.9%)
deepseek_no_thinking     : 260/313 (83.1%)


## Agreement with Gemini Thinking (Where GT Agrees with Generator)

In [6]:
if results and "gemini_thinking" in models:
    reference_model = "gemini_thinking"
    
    # Filter to cases where gemini_thinking agrees with generator
    gt_agrees_with_gen = [r for r in results 
                          if r["evaluations"].get(reference_model, {}).get("answer") == r["generator_answer_idx"]]
    
    print(f"Cases where {reference_model} agrees with generator: {len(gt_agrees_with_gen)}/{len(results)}\n")
    print(f"Agreement with {reference_model} (filtered, excluding nulls):\n")
    
    for model in models:
        if model == reference_model:
            continue
        
        # Get reference answer (we know it's not null since it agreed with generator)
        ref_not_null = [r for r in gt_agrees_with_gen 
                        if r["evaluations"].get(reference_model, {}).get("answer") is not None]
        
        # Count nulls for this model (where reference is not null)
        model_nulls = sum(1 for r in ref_not_null 
                         if r["evaluations"].get(model, {}).get("answer") is None)
        null_pct = 100 * model_nulls / len(ref_not_null) if ref_not_null else 0
        
        # Non-null cases for both models
        valid_cases = [r for r in ref_not_null 
                       if r["evaluations"].get(model, {}).get("answer") is not None]
        
        # Agreement among valid (non-null) cases
        agree = sum(1 for r in valid_cases 
                    if r["evaluations"].get(model, {}).get("answer") == 
                       r["evaluations"].get(reference_model, {}).get("answer"))
        pct = 100 * agree / len(valid_cases) if valid_cases else 0
        
        print(f"{model:25s}: {agree}/{len(valid_cases)} ({pct:.1f}%) agree | {model_nulls} nulls ({null_pct:.1f}%)")

Cases where gemini_thinking agrees with generator: 254/313

Agreement with gemini_thinking (filtered, excluding nulls):

gemini_no_thinking       : 201/243 (82.7%) agree | 11 nulls (4.3%)
deepseek_thinking        : 226/228 (99.1%) agree | 26 nulls (10.2%)
deepseek_no_thinking     : 241/248 (97.2%) agree | 6 nulls (2.4%)


## Pairwise Agreement Matrix

In [7]:
if results:
    agreement_matrix = {}
    for m1 in models:
        agreement_matrix[m1] = {}
        for m2 in models:
            agree = sum(1 for r in results 
                        if r["evaluations"].get(m1, {}).get("answer") == 
                           r["evaluations"].get(m2, {}).get("answer"))
            agreement_matrix[m1][m2] = agree / len(results)
    
    df = pd.DataFrame(agreement_matrix).T * 100
    print("Pairwise agreement (%):\n")
    print(df.round(1).to_string())

Pairwise agreement (%):

                      gemini_thinking  gemini_no_thinking  deepseek_thinking  deepseek_no_thinking
gemini_thinking                 100.0                70.0               78.9                  83.1
gemini_no_thinking               70.0               100.0               67.1                  69.6
deepseek_thinking                78.9                67.1              100.0                  83.1
deepseek_no_thinking             83.1                69.6               83.1                 100.0


## Cases Where Models Disagree

In [8]:
# if results and "gemini_thinking" in models and "gemini_no_thinking" in models:
#     disagree_cases = []
#     for r in results:
#         gt_ans = r["evaluations"].get("gemini_thinking", {}).get("answer")
#         no_think_ans = r["evaluations"].get("gemini_no_thinking", {}).get("answer")
#         if gt_ans != no_think_ans:
#             disagree_cases.append({
#                 "idx": r["idx"],
#                 "level": r["level"],
#                 "subject": r["subject"],
#                 "generator": r["generator_answer_idx"],
#                 "gemini_thinking": gt_ans,
#                 "gemini_no_thinking": no_think_ans
#             })
    
#     print(f"Gemini thinking vs no-thinking disagree on {len(disagree_cases)} questions:")
#     if disagree_cases:
#         df = pd.DataFrame(disagree_cases)
#         print(df.to_string())

## Agreement by Difficulty Level

In [9]:
if results:
    levels = sorted(set(r["level"] for r in results))
    print("Agreement with generator by level:\n")
    
    for level in levels:
        level_results = [r for r in results if r["level"] == level]
        print(f"Level {level} (n={len(level_results)}):")
        for model in models:
            agree = sum(1 for r in level_results 
                        if r["evaluations"].get(model, {}).get("answer") == r["generator_answer_idx"])
            pct = 100 * agree / len(level_results) if level_results else 0
            print(f"  {model:25s}: {pct:.1f}%")
        print()

Agreement with generator by level:

Level 1 (n=64):
  gemini_thinking          : 81.2%
  gemini_no_thinking       : 76.6%
  deepseek_thinking        : 79.7%
  deepseek_no_thinking     : 85.9%

Level 2 (n=66):
  gemini_thinking          : 81.8%
  gemini_no_thinking       : 77.3%
  deepseek_thinking        : 83.3%
  deepseek_no_thinking     : 89.4%

Level 3 (n=66):
  gemini_thinking          : 89.4%
  gemini_no_thinking       : 72.7%
  deepseek_thinking        : 81.8%
  deepseek_no_thinking     : 84.8%

Level 4 (n=61):
  gemini_thinking          : 78.7%
  gemini_no_thinking       : 67.2%
  deepseek_thinking        : 72.1%
  deepseek_no_thinking     : 83.6%

Level 5 (n=56):
  gemini_thinking          : 73.2%
  gemini_no_thinking       : 53.6%
  deepseek_thinking        : 67.9%
  deepseek_no_thinking     : 73.2%



## Token Usage and Cost Comparison

In [10]:
if results:
    usage_stats = defaultdict(lambda: {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0, "cost": 0.0})
    
    for r in results:
        for model, eval_data in r["evaluations"].items():
            usage = eval_data.get("usage", {})
            usage_stats[model]["prompt_tokens"] += usage.get("prompt_tokens", 0)
            usage_stats[model]["completion_tokens"] += usage.get("completion_tokens", 0)
            usage_stats[model]["total_tokens"] += usage.get("total_tokens", 0)
            usage_stats[model]["cost"] += usage.get("cost", 0) or 0
    
    print(f"Token usage and cost summary (n={len(results)}):\n")
    for model in models:
        stats = usage_stats[model]
        avg_completion = stats["completion_tokens"] / len(results) if results else 0
        cost_per_1k = 1000 * stats["cost"] / len(results) if results else 0
        print(f"{model}:")
        print(f"  Prompt tokens:     {stats['prompt_tokens']:,}")
        print(f"  Completion tokens: {stats['completion_tokens']:,} (avg {avg_completion:.0f}/question)")
        print(f"  Total tokens:      {stats['total_tokens']:,}")
        print(f"  Cost per 1000:     ${cost_per_1k:.2f}")
        print()

Token usage and cost summary (n=313):

gemini_thinking:
  Prompt tokens:     120,411
  Completion tokens: 1,623,383 (avg 5187/question)
  Total tokens:      1,743,794
  Cost per 1000:     $15.75

gemini_no_thinking:
  Prompt tokens:     115,074
  Completion tokens: 327,393 (avg 1046/question)
  Total tokens:      442,467
  Cost per 1000:     $3.32

deepseek_thinking:
  Prompt tokens:     96,896
  Completion tokens: 506,346 (avg 1618/question)
  Total tokens:      603,242
  Cost per 1000:     $0.89

deepseek_no_thinking:
  Prompt tokens:     112,267
  Completion tokens: 373,765 (avg 1194/question)
  Total tokens:      486,032
  Cost per 1000:     $0.71



## Error Analysis

In [11]:
if results:
    print("Error counts by model:\n")
    for model in models:
        errors = sum(1 for r in results if "error" in r["evaluations"].get(model, {}))
        parse_failures = sum(1 for r in results 
                            if r["evaluations"].get(model, {}).get("answer") is None 
                            and "error" not in r["evaluations"].get(model, {}))
        print(f"{model:25s}: {errors} API errors, {parse_failures} parse failures")

Error counts by model:

gemini_thinking          : 1 API errors, 22 parse failures
gemini_no_thinking       : 1 API errors, 20 parse failures
deepseek_thinking        : 45 API errors, 4 parse failures
deepseek_no_thinking     : 6 API errors, 3 parse failures


## Recommendation Summary

In [22]:
if results and "gemini_thinking" in models:
    reference = "gemini_thinking"
    
    usage_stats = defaultdict(lambda: {"cost": 0.0})
    for r in results:
        for model, eval_data in r["evaluations"].items():
            usage = eval_data.get("usage", {})
            usage_stats[model]["cost"] += usage.get("cost", 0) or 0
    
    gt_agrees = [r for r in results 
                 if r["evaluations"].get(reference, {}).get("answer") == r["generator_answer_idx"]]
    gt_disagrees = [r for r in results 
                    if r["evaluations"].get(reference, {}).get("answer") != r["generator_answer_idx"]
                    and r["evaluations"].get(reference, {}).get("answer") is not None]
    
    def calc_agreement(subset):
        rows = []
        for model in models:
            cost_per_1k = 1000 * usage_stats[model]["cost"] / len(results)
            ref_not_null = [r for r in subset if r["evaluations"].get(reference, {}).get("answer") is not None]
            model_nulls = sum(1 for r in ref_not_null if r["evaluations"].get(model, {}).get("answer") is None)
            valid = [r for r in ref_not_null if r["evaluations"].get(model, {}).get("answer") is not None]
            agree = sum(1 for r in valid 
                        if r["evaluations"].get(model, {}).get("answer") == 
                           r["evaluations"].get(reference, {}).get("answer"))
            cond_agree_pct = agree / len(valid) if valid else 0
            null_pct = model_nulls / len(ref_not_null) if ref_not_null else 0
            rows.append({"Cost per 1000": f"${cost_per_1k:.2f}", "Agree %": cond_agree_pct, "Null %": null_pct})
        df = pd.DataFrame(rows, index=models)
        df["Agree %"] = (df["Agree %"] * 100).round(1)
        df["Null %"] = (df["Null %"] * 100).round(1)
        return df
    
    print(f"When gemini_thinking AGREES with generator (n={len(gt_agrees)}):")
    display(calc_agreement(gt_agrees))
    
    # Breakdown by level
    levels = sorted(set(r["level"] for r in gt_agrees))
    agree_by_level = {model: {} for model in models}
    null_by_level = {model: {} for model in models}
    
    for level in levels:
        level_subset = [r for r in gt_agrees if r["level"] == level]
        for model in models:
            ref_not_null = [r for r in level_subset if r["evaluations"].get(reference, {}).get("answer") is not None]
            model_nulls = sum(1 for r in ref_not_null if r["evaluations"].get(model, {}).get("answer") is None)
            valid = [r for r in ref_not_null if r["evaluations"].get(model, {}).get("answer") is not None]
            agree = sum(1 for r in valid if r["evaluations"].get(model, {}).get("answer") == r["evaluations"].get(reference, {}).get("answer"))
            agree_by_level[model][level] = 100 * agree / len(valid) if valid else None
            null_by_level[model][level] = 100 * model_nulls / len(ref_not_null) if ref_not_null else None
    
    print("\nAgree % by level (where GT agrees with generator):")
    display(pd.DataFrame(agree_by_level).T.round(1))
    
    print("\nNull % by level (where GT agrees with generator):")
    display(pd.DataFrame(null_by_level).T.round(1))
    
    # Breakdown by subject
    subjects = sorted(set(r["subject"] for r in gt_agrees))
    agree_by_subject = {model: {} for model in models}
    null_by_subject = {model: {} for model in models}
    
    for subject in subjects:
        subj_subset = [r for r in gt_agrees if r["subject"] == subject]
        for model in models:
            ref_not_null = [r for r in subj_subset if r["evaluations"].get(reference, {}).get("answer") is not None]
            model_nulls = sum(1 for r in ref_not_null if r["evaluations"].get(model, {}).get("answer") is None)
            valid = [r for r in ref_not_null if r["evaluations"].get(model, {}).get("answer") is not None]
            agree = sum(1 for r in valid if r["evaluations"].get(model, {}).get("answer") == r["evaluations"].get(reference, {}).get("answer"))
            agree_by_subject[model][subject] = 100 * agree / len(valid) if valid else None
            null_by_subject[model][subject] = 100 * model_nulls / len(ref_not_null) if ref_not_null else None
    
    print("\nAgree % by subject (where GT agrees with generator):")
    display(pd.DataFrame(agree_by_subject).T.round(1))
    
    print("\nNull % by subject (where GT agrees with generator):")
    display(pd.DataFrame(null_by_subject).T.round(1))
    
    print(f"\nWhen gemini_thinking DISAGREES with generator (n={len(gt_disagrees)}):")
    display(calc_agreement(gt_disagrees))

When gemini_thinking AGREES with generator (n=254):


Unnamed: 0,Cost per 1000,Agree %,Null %
gemini_thinking,$15.75,100.0,0.0
gemini_no_thinking,$3.32,82.7,4.3
deepseek_thinking,$0.89,99.1,10.2
deepseek_no_thinking,$0.71,97.2,2.4



Agree % by level (where GT agrees with generator):


Unnamed: 0,1,2,3,4,5
gemini_thinking,100.0,100.0,100.0,100.0,100.0
gemini_no_thinking,88.5,88.2,85.5,79.2,67.6
deepseek_thinking,95.9,100.0,100.0,100.0,100.0
deepseek_no_thinking,98.0,100.0,94.8,95.7,97.4



Null % by level (where GT agrees with generator):


Unnamed: 0,1,2,3,4,5
gemini_thinking,0.0,0.0,0.0,0.0,0.0
gemini_no_thinking,0.0,5.6,6.8,0.0,9.8
deepseek_thinking,5.8,3.7,10.2,16.7,17.1
deepseek_no_thinking,1.9,0.0,1.7,2.1,7.3



Agree % by subject (where GT agrees with generator):


Unnamed: 0,algebra,counting_and_probability,geometry,intermediate_algebra,number_theory,prealgebra,precalculus
gemini_thinking,100.0,100.0,100.0,100.0,100.0,100.0,100.0
gemini_no_thinking,88.9,76.9,78.6,73.3,82.9,95.3,76.9
deepseek_thinking,100.0,100.0,96.0,100.0,100.0,97.5,100.0
deepseek_no_thinking,100.0,100.0,89.3,93.3,97.5,97.7,100.0



Null % by subject (where GT agrees with generator):


Unnamed: 0,algebra,counting_and_probability,geometry,intermediate_algebra,number_theory,prealgebra,precalculus
gemini_thinking,0.0,0.0,0.0,0.0,0.0,0.0,0.0
gemini_no_thinking,2.7,0.0,6.7,3.2,0.0,0.0,21.2
deepseek_thinking,2.7,7.7,16.7,12.9,7.3,7.0,21.2
deepseek_no_thinking,2.7,0.0,6.7,3.2,2.4,0.0,3.0



When gemini_thinking DISAGREES with generator (n=36):


Unnamed: 0,Cost per 1000,Agree %,Null %
gemini_thinking,$15.75,100.0,0.0
gemini_no_thinking,$3.32,46.7,16.7
deepseek_thinking,$0.89,54.5,38.9
deepseek_no_thinking,$0.71,52.9,5.6
