# Confidence vs Performance Analysis

This notebook analyzes the "Self-confidence index" of chess agents and correlates it with their performance (Elo, Game Duration).

## Metrics Definition
- **Self-confidence index**: Defined as the ratio of moves made without requesting `get_legal_moves`. Calculated as `get_legal_moves_count / make_move_count`.
- **Confidence Score**: `1 - min(1.0, confidence_ratio)`. Higher score means the model requests legal moves less often relative to making moves.
- **Performance**: Elo ratings (from `elo_refined.csv`) and Game Duration (from logs).

In [None]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from collections import defaultdict

# Add project root to path to allow importing from data_processing
# Assuming notebook is in analysis/ folder, project root is one level up
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

from data_processing.get_refined_csv import (
    load_game_logs, 
    GameMode, 
    MODEL_OVERRIDES, 
    ALIASES, 
    FILTER_OUT_MODELS,
    MODELS_METADATA_CSV
)

## Data Loading

We load game logs from both Random-vs-LLM and Dragon-vs-LLM modes. 
**Crucially**, we exclude `_logs/_pre_aug_2025/no_reflection` as requested, because those logs contain invalid action stats.

In [None]:
# Define directories, explicitly EXCLUDING "_logs/_pre_aug_2025/no_reflection"
RANDOM_LOGS_DIRS = [
    os.path.join(project_root, "_logs/rand_vs_llm"),
    os.path.join(project_root, "_logs/_pre_aug_2025/new"),
    # "_logs/_pre_aug_2025/no_reflection"  <-- EXCLUDED
]

ENGINE_LOGS_DIRS = [
    os.path.join(project_root, "_logs/engine_vs_llm"),
    os.path.join(project_root, "_logs/_pre_aug_2025/dragon_vs_llm"),
]

print("Loading Random-vs-LLM logs...")
random_logs = load_game_logs(
    logs_dirs=RANDOM_LOGS_DIRS, 
    model_overrides=MODEL_OVERRIDES, 
    mode=GameMode.RANDOM_VS_LLM
)

print("Loading Dragon-vs-LLM logs...")
dragon_logs = load_game_logs(
    logs_dirs=ENGINE_LOGS_DIRS, 
    model_overrides=MODEL_OVERRIDES, 
    mode=GameMode.DRAGON_VS_LLM
)

all_logs = random_logs + dragon_logs
print(f"Total games loaded: {len(all_logs)}")

## Feature Extraction

We aggregate `get_legal_moves_count` and `make_move_count` per model to calculate the confidence metrics.

In [None]:
model_stats = defaultdict(lambda: {
    "total_games": 0, 
    "get_legal_moves": 0, 
    "make_moves": 0,
    "total_game_duration": 0.0
})

for log in all_logs:
    model_name = log.player_black.model
    # Apply aliases
    model_name = ALIASES.get(model_name, model_name)
    
    if model_name in FILTER_OUT_MODELS:
        continue
        
    stats = model_stats[model_name]
    stats["total_games"] += 1
    
    # Handle missing counts gracefully (-1 indicates missing in some contexts, but usually 0 if not present)
    # In the provided logs example, these fields are integers.
    glm = log.player_black.get_legal_moves_count
    mm = log.player_black.make_move_count
    
    if glm >= 0:
        stats["get_legal_moves"] += glm
    if mm >= 0:
        stats["make_moves"] += mm
        
    stats["total_game_duration"] += log.game_duration

# Convert to DataFrame
data = []
for model, stats in model_stats.items():
    if stats["total_games"] < 10:  # Filter out models with very few games for stability
        continue
        
    make_moves = stats["make_moves"]
    get_legal = stats["get_legal_moves"]
    
    # Avoid division by zero
    if make_moves > 0:
        ratio = get_legal / make_moves
        # Confidence Score: 1 - ratio (clamped at 0)
        # If ratio > 1 (checking multiple times per move), confidence score goes to 0
        confidence_score = 1.0 - min(1.0, ratio)
    else:
        ratio = np.nan
        confidence_score = np.nan
        
    avg_duration = stats["total_game_duration"] / stats["total_games"]
    
    data.append({
        "Player": model,
        "Games": stats["total_games"],
        "Get_Legal_Moves": get_legal,
        "Make_Moves": make_moves,
        "Legal_Moves_per_Move_Ratio": ratio,
        "Confidence_Score": confidence_score,
        "Avg_Game_Duration": avg_duration
    })

df_confidence = pd.DataFrame(data)
df_confidence.sort_values("Confidence_Score", ascending=False, inplace=True)
df_confidence.head(10)

## Merge with Elo Data

We import the calculated Elo ratings from `elo_refined.csv`.

In [None]:
elo_file = os.path.join(project_root, "data_processing/elo_refined.csv")
df_elo = pd.read_csv(elo_file)

# Ensure Elo is numeric
df_elo["elo"] = pd.to_numeric(df_elo["elo"], errors="coerce")

# Merge
df_merged = pd.merge(df_confidence, df_elo[["Player", "elo", "win_loss", "player_wins_percent"]], on="Player", how="inner")

# Drop rows without Elo if analyzing Elo correlation
df_elo_analysis = df_merged.dropna(subset=["elo"])

print(f"Models with Confidence Stats: {len(df_confidence)}")
print(f"Models with Elo (matched): {len(df_elo_analysis)}")
df_merged.head()

## Analysis & Visualization

In [None]:
sns.set_theme(style="whitegrid")

# 1. Distribution of Confidence Scores
plt.figure(figsize=(10, 6))
sns.histplot(df_confidence["Confidence_Score"], bins=20, kde=True)
plt.title("Distribution of Self-Confidence Scores")
plt.xlabel("Confidence Score (1 - LegalRequests/Moves)")
plt.ylabel("Count of Models")
plt.show()

In [None]:
# 2. Confidence vs Elo
plt.figure(figsize=(12, 8))
sns.scatterplot(data=df_elo_analysis, x="Confidence_Score", y="elo", size="Games", sizes=(20, 200), alpha=0.7)

# Add regression line
sns.regplot(data=df_elo_analysis, x="Confidence_Score", y="elo", scatter=False, color="red")

# Label top models
for i, row in df_elo_analysis.iterrows():
    if row["elo"] > 2500 or row["Confidence_Score"] > 0.95 or row["Confidence_Score"] < 0.1:
        plt.text(row["Confidence_Score"]+0.01, row["elo"], row["Player"], fontsize=8, alpha=0.7)

plt.title("Model Performance (Elo) vs Self-Confidence")
plt.xlabel("Confidence Score (Higher = Fewer Legal Move Checks)")
plt.ylabel("Elo Rating")
plt.show()

correlation = df_elo_analysis["Confidence_Score"].corr(df_elo_analysis["elo"])
print(f"Correlation between Confidence and Elo: {correlation:.3f}")

In [None]:
# 3. Confidence vs Game Duration
plt.figure(figsize=(12, 8))
sns.scatterplot(data=df_merged, x="Confidence_Score", y="Avg_Game_Duration", size="Games", sizes=(20, 200), alpha=0.7)
sns.regplot(data=df_merged, x="Confidence_Score", y="Avg_Game_Duration", scatter=False, color="green")

plt.title("Game Duration vs Self-Confidence")
plt.xlabel("Confidence Score")
plt.ylabel("Average Game Duration (Normalized 0-1)")
plt.show()

corr_duration = df_merged["Confidence_Score"].corr(df_merged["Avg_Game_Duration"])
print(f"Correlation between Confidence and Game Duration: {corr_duration:.3f}")

In [None]:
# 4. Detailed Correlation Matrix
cols = ["elo", "Confidence_Score", "Legal_Moves_per_Move_Ratio", "Avg_Game_Duration", "win_loss", "player_wins_percent"]
corr_matrix = df_merged[cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix of Metrics")
plt.show()

## Confidence & Elo Table

Below is a summary table of models sorted by highest Self-Confidence.

In [None]:
# Prepare simple table with relevant columns
df_table = df_elo_analysis[["Player", "Confidence_Score", "elo"]].copy()

# Sort descending by Confidence Score
df_table.sort_values(by="Confidence_Score", ascending=False, inplace=True)

# Reset index for display
df_table.reset_index(drop=True, inplace=True)

# Display full table (or head/tail if desired, but user asked for "a simple table")
print("Models sorted by Self-Confidence (Highest to Lowest):\n")
print(df_table.to_string(index=True, float_format="{:.3f}".format))

## Findings Summary

1. **Confidence Metric**: We defined confidence as the inverse of how often the model checks for legal moves (`1 - get_legal/make_moves`). A score of 1.0 means the model never checks legal moves.
2. **Distribution**: The histogram shows how models vary in their reliance on the legal moves tool.
3. **Performance Link**: The scatter plots reveal whether "confident" models (those that don't check moves) tend to have higher Elo or last longer in games.
4. **Summary Table**: The table above lists each model's confidence score alongside its Elo, sorted by confidence.