## Anaylizing abnormal game ednings (instruction following issues)

Using a subset of logs after March 2025 (there's a known issue with tracking wrong actions/moves before that date).

During move chat the following "mistakes" are tracked:
- Wrong actions - that case when the bot processing LLM response can't match any of the allowed actions
- Wrong moves - the case when make_move action is identified, yet the requested move can't be made. It is typical that models even after requesting the list of legal moves can haluscinate the move not on the list already present in the dialog

A game can be interrupted due due to model failing to follow the game protocol and the reasosns are:
- Too many wrong actions - the model produced more than 2 responses that the game bot failed to parse OR make a valid move
- Max turns reached - while deciding on a next move the chat completions dialog lasted for more than 10 turns. This typically indicates loops, such as going in circles with actions like get_current_board/get_legal_moves
- Model Errors - e.g. timeouts when a model failed to respond within reasonable ammoint of time of when specific API code was returned meaning model error. Connectivity and infr issues are discarded (log deleted)

In [1]:
# Loaded variable 'df' from URI: /Users/admin/src/llm_chess/data_processing/refined.csv
import pandas as pd
df = pd.read_csv(r'refined_abn_analysis.csv')
# Print fields (columns) in the dataframe
print("Fields in the dataframe:")
print(df.columns.tolist())
print("\n" + "="*50 + "\n")

# Print basic statistics
print("Dataframe shape:")
print(f"Rows: {len(df)}, Columns: {len(df.columns)}")
print("\n" + "="*50 + "\n")

print("Data types:")
print(df.dtypes)
print("\n" + "="*50 + "\n")

print("Summary statistics:")
print(df.describe())
print("\n" + "="*50 + "\n")

print("Info:")
df.info()
print("\n" + "="*50 + "\n")

print("First few rows:")
print(df.head())



Fields in the dataframe:
['Player', 'total_games', 'player_wins', 'opponent_wins', 'draws', 'player_wins_percent', 'player_draws_percent', 'average_moves', 'moe_average_moves', 'total_moves', 'player_wrong_actions', 'player_wrong_moves', 'wrong_actions_per_1000moves', 'wrong_moves_per_1000moves', 'mistakes_per_1000moves', 'moe_mistakes_per_1000moves', 'player_avg_material', 'opponent_avg_material', 'material_diff_player_llm_minus_opponent', 'moe_material_diff_llm_minus_rand', 'completion_tokens_black_per_move', 'moe_completion_tokens_black_per_move', 'moe_black_llm_win_rate', 'moe_draw_rate', 'moe_black_llm_loss_rate', 'win_loss', 'moe_win_loss', 'win_loss_non_interrupted', 'moe_win_loss_non_interrupted', 'game_duration', 'moe_game_duration', 'games_interrupted', 'games_interrupted_percent', 'moe_games_interrupted', 'games_not_interrupted', 'games_not_interrupted_percent', 'moe_games_not_interrupted', 'average_game_cost', 'moe_average_game_cost', 'price_per_1000_moves', 'moe_price_per_

In [5]:
# %% cell 3 code - Statistics on abnormal game terminations

print("="*70)
print("STATISTICS ON ABNORMAL GAME TERMINATIONS")
print("="*70)

# Calculate number of models with non-100% game duration (i.e., some abnormal finishes)
models_with_abnormal_finishes = df[df['abnormal_finishes_percent'] > 0]
num_models_with_abnormal = len(models_with_abnormal_finishes)
total_models = len(df)

print(f"\nModels with abnormal finishes: {num_models_with_abnormal} out of {total_models}")
print(f"Percentage: {(num_models_with_abnormal / total_models * 100):.1f}%")

# For models with abnormal finishes, show breakdown of termination reasons
if num_models_with_abnormal > 0:
    print("\n" + "="*70)
    print("BREAKDOWN OF TERMINATION REASONS (relative to failed games)")
    print("="*70)
    
    # Create a working copy to avoid SettingWithCopyWarning
    maf = models_with_abnormal_finishes.copy()
    
    # Calculate relative percentages (share of the abnormal finishes)
    # We divide the specific reason % by the total abnormal % to get the portion of failures
    maf['rel_wrong'] = maf['abnormal_too_many_wrong_actions_percent'] / maf['abnormal_finishes_percent'] * 100
    maf['rel_max_turns'] = maf['abnormal_max_turns_percent'] / maf['abnormal_finishes_percent'] * 100
    maf['rel_unknown'] = maf['abnormal_unknown_issue_percent'] / maf['abnormal_finishes_percent'] * 100
    maf['rel_error'] = maf['abnormal_error_percent'] / maf['abnormal_finishes_percent'] * 100
    
    # Calculate average of these relative percentages across models
    avg_wrong_actions = maf['rel_wrong'].mean()
    avg_max_turns = maf['rel_max_turns'].mean()
    avg_unknown = maf['rel_unknown'].mean()
    avg_error = maf['rel_error'].mean()
    
    print(f"\nAverage breakdown of failure reasons (sums to ~100%):")
    print(f"  Too many wrong actions: {avg_wrong_actions:.2f}%")
    print(f"  Max turns reached:      {avg_max_turns:.2f}%")
    print(f"  Unknown issue:          {avg_unknown:.2f}%")
    print(f"  Error:                  {avg_error:.2f}%")
    
    # Add breakdown of wrong actions vs wrong moves
    print("\n" + "="*70)
    print("BREAKDOWN OF WRONG ACTIONS vs WRONG MOVES")
    print("="*70)
    
    # Calculate averages for wrong actions and wrong moves per 1000 moves
    avg_wrong_actions_per_1000 = maf['wrong_actions_per_1000moves'].mean()
    avg_wrong_moves_per_1000 = maf['wrong_moves_per_1000moves'].mean()
    
    # Calculate total and relative percentages
    total_mistakes_per_1000 = avg_wrong_actions_per_1000 + avg_wrong_moves_per_1000
    if total_mistakes_per_1000 > 0:
        rel_wrong_actions = (avg_wrong_actions_per_1000 / total_mistakes_per_1000) * 100
        rel_wrong_moves = (avg_wrong_moves_per_1000 / total_mistakes_per_1000) * 100
    else:
        rel_wrong_actions = 0
        rel_wrong_moves = 0
    
    print(f"\nAverage mistakes per 1000 moves:")
    print(f"  Wrong actions: {avg_wrong_actions_per_1000:.2f} ({rel_wrong_actions:.1f}%)")
    print(f"  Wrong moves:   {avg_wrong_moves_per_1000:.2f} ({rel_wrong_moves:.1f}%)")
    print(f"  Total:         {total_mistakes_per_1000:.2f}")
    
    # Show detailed breakdown for each model with abnormal finishes
    print("\n" + "-"*70)
    print("Per-model breakdown (of failed games):")
    print("-"*70)
    
    breakdown_df = maf[[
        'Player',
        'abnormal_finishes_percent',
        'rel_wrong',
        'rel_max_turns',
        'rel_unknown',
        'rel_error'
    ]].copy()
    
    breakdown_df = breakdown_df.rename(columns={
        'Player': 'Model',
        'abnormal_finishes_percent': 'Total Abn% (of games)',
        'rel_wrong': 'Wrong Actions%',
        'rel_max_turns': 'Max Turns%',
        'rel_unknown': 'Unknown%',
        'rel_error': 'Error%'
    })
    
    breakdown_df = breakdown_df.sort_values('Total Abn% (of games)', ascending=False)
    
    # Format the float columns for cleaner output
    format_mapping = {
        'Total Abn% (of games)': '{:.1f}%',
        'Wrong Actions%': '{:.1f}%',
        'Max Turns%': '{:.1f}%',
        'Unknown%': '{:.1f}%',
        'Error%': '{:.1f}%'
    }
    
    print(breakdown_df.to_string(index=False, formatters={k: v.format for k, v in format_mapping.items()}))
else:
    print("\nNo models with abnormal finishes found.")

STATISTICS ON ABNORMAL GAME TERMINATIONS

Models with abnormal finishes: 54 out of 76
Percentage: 71.1%

BREAKDOWN OF TERMINATION REASONS (relative to failed games)

Average breakdown of failure reasons (sums to ~100%):
  Too many wrong actions: 64.79%
  Max turns reached:      13.96%
  Unknown issue:          0.00%
  Error:                  21.25%

BREAKDOWN OF WRONG ACTIONS vs WRONG MOVES

Average mistakes per 1000 moves:
  Wrong actions: 122.70 (62.1%)
  Wrong moves:   74.86 (37.9%)
  Total:         197.56

----------------------------------------------------------------------
Per-model breakdown (of failed games):
----------------------------------------------------------------------
                                   Model Total Abn% (of games) Wrong Actions% Max Turns% Unknown% Error%
                gemma-3-4b-it@iq4_qs@PGN                100.0%          97.0%       3.0%     0.0%   0.0%
               google_gemma-3-4b-it@bf16                100.0%          98.5%       1.5%     