# Chess Tournament Evaluation

## Overview

This notebook demonstrates running competitive chess tournaments to evaluate your fine-tuned model. You'll learn how to:

- Run tournaments with TrueSkill ratings
- Leverage automatic request batching for throughput
- Analyze model performance with detailed metrics
- Compare against multiple Stockfish baselines

**Prerequisites:**
- Complete Chess-Deployment.ipynb
- vLLM server running with your chess model

**Duration:** 20-30 minutes

## Step 1: Verify vLLM Server is Running

Before starting the tournament, let's verify the vLLM server is still running.

In [None]:
import openai

MODEL_PATH = "kunhunjon/ChessLM_Qwen3_Trainium_AWS_Format"

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

try:
    response = client.chat.completions.create(
        model=MODEL_PATH,
        messages=[{"role": "user", "content": "Test"}],
        max_tokens=5,
        extra_body={"chat_template_kwargs": {"enable_thinking": False}}
    )
    print(" vLLM server is running and ready!")
except Exception as e:
    print(f" Server not responding: {e}")
    print("\nPlease start the server from Chess-Deployment.ipynb first.")

## Step 2: Play a Single Game

Let's start with a single game to verify everything works.

In [None]:
%cd /home/ubuntu/environment/neuron-workshops/labs/vLLM/Chess/

# Play single game: vLLM vs Stockfish (skill 5)
!python -m assets.run_game \
  --agent1 vllm \
  --agent2 stockfish-skill5-depth10 \
  --num-games 1 \
  --verbose

## Step 3: Run a Small Tournament (Sequential)

Now let's run a small tournament with **parallelism=1** (sequential games) to establish a baseline.

In [None]:
import time

print("Running 4 games sequentially (parallelism=1)...\n")

start_time = time.time()

!python -m assets.run_game \
  --agent vllm \
  --agent stockfish-skill1-depth1 \
  --num-games 4 \
  --parallelism 1 \
  --output-dir tournament_sequential

sequential_time = time.time() - start_time

print(f"\n Sequential tournament completed in {sequential_time:.1f} seconds")

## Step 4: Run Tournament with Concurrency

Now let's run the same tournament with **parallelism=4** to see the throughput improvement from automatic request batching.

In [None]:
print("Running 4 games in parallel (parallelism=4)...\n")

start_time = time.time()

!python -m assets.run_game \
  --agent vllm \
  --agent stockfish-skill1-depth1 \
  --num-games 4 \
  --parallelism 4 \
  --output-dir tournament_parallel

parallel_time = time.time() - start_time

print(f"\n Parallel tournament completed in {parallel_time:.1f} seconds")
print(f"\nSpeedup: {sequential_time/parallel_time:.2f}x faster with parallelism=4")
print(f"Time saved: {sequential_time - parallel_time:.1f} seconds ({(sequential_time - parallel_time)/sequential_time*100:.1f}%)")

### Understanding the Speedup

**How it works:**

1. **Process-level parallelism** (`--parallelism 4`):
   - Tournament scheduler runs 4 games simultaneously in separate processes
   - Each game makes HTTP requests to the vLLM server independently

2. **Request-level batching** (vLLM server):
   - Server configured with `max_num_seqs=4` and `continuous_batching=true`
   - When 4 games request moves at similar times, vLLM automatically batches them
   - Batched requests are processed together on Neuron cores

**Expected results:**
- Sequential (parallelism=1): ~0.65s per move, games run one after another
- Parallel (parallelism=4): ~1.86s per move, but 4 games run simultaneously
- **Throughput improvement: ~1.4x** (40% faster overall)

**Why not 4x speedup?**
- Individual request latency increases due to batching overhead
- Not all requests arrive at exactly the same time (timing variance)
- Server batch efficiency: ~35% of theoretical maximum
- Still significant savings: games complete much faster overall

## Step 5: Run Full Tournament

Now let's run a comprehensive tournament against multiple opponents to evaluate model strength.

In [None]:
!python -m assets.run_game \
  --agent vllm \
  --agent stockfish-skill1-depth2 \
  --agent stockfish-skill5-depth10 \
  --agent stockfish-skill10-depth15 \
  --num-games 20 \
  --parallelism 4 \
  --output-dir tournament_full

print("\n Tournament complete! Results saved to tournament_full/")

## Step 6: Analyze Results

Let's load and analyze the tournament results.

In [None]:
import json
import pandas as pd

# Load tournament results
with open('tournament_full/tournament.json') as f:
    results = json.load(f)

agents_data = []
for agent_name, stats in results['agents'].items():
    totals = stats['totals']
    rating = stats['rating']

    games  = totals.get('games', 0)
    wins   = totals.get('wins', 0)
    losses = totals.get('losses', 0)
    draws  = totals.get('draws', 0)

    win_rate = (wins + 0.5 * draws) / games * 100 if games > 0 else 0.0

    agents_data.append({
        'Agent': agent_name,
        'Games': games,
        'Wins': wins,
        'Losses': losses,
        'Draws': draws,
        'Win Rate': f"{win_rate:.1f}%",
        'Rating': f"{rating['conservative']:.1f}",
        'Mu': f"{rating['mu']:.2f}",
        'Sigma': f"{rating['sigma']:.2f}",
    })

df = pd.DataFrame(agents_data)
df = df.sort_values('Rating', ascending=False)

print("Tournament Standings:")
print("=" * 80)
print(df.to_string(index=False))
print("\nNote: Rating = mu - 3*sigma (conservative estimate)")


## Step 7: Analyze Your Model's Performance

Let's look at detailed metrics for your vLLM model.

In [None]:
import json

with open('tournament_full/tournament.json') as f:
    results = json.load(f)

# Get vLLM model statistics
vllm_stats = results['agents']['vllm']
totals = vllm_stats['totals']
rating = vllm_stats['rating']
engine = vllm_stats.get('engine_metrics_avg', {})
illegal = vllm_stats.get('illegal_metrics', {})

games  = totals.get('games', 0)
wins   = totals.get('wins', 0)
losses = totals.get('losses', 0)
draws  = totals.get('draws', 0)
as_white = totals.get('as_white', 0)
as_black = totals.get('as_black', 0)

win_rate = (wins + 0.5 * draws) / games * 100 if games > 0 else 0.0

print("Your Model Performance:")
print("=" * 60)
print(f"\nGames Played: {games}")
print(f"Record: {wins}-{losses}-{draws}")
print(f"Win Rate: {win_rate:.1f}%")

print(f"\nTrueSkill Rating:")
print(f"  Mu (skill estimate): {rating['mu']:.2f}")
print(f"  Sigma (uncertainty): {rating['sigma']:.2f}")
print(f"  Conservative rating (mu - 3Ïƒ): {rating['conservative']:.1f}")

# Engine metrics (per-game averages from Stockfish analysis)
if engine:
    print(f"\nMove Quality (engine-based):")
    print(f"  Accuracy: {engine.get('accuracy_pct', 0.0):.1f}% (matches Stockfish top move)")
    print(f"  Avg Centipawn Loss: {engine.get('acpl', 0.0):.1f}")

# Illegal move stats
if illegal:
    print(f"\nIllegal Move Metrics:")
    print(f"  Illegal attempts: {illegal.get('attempts', 0)}")
    print(f"  Total move attempts: {illegal.get('move_attempts', 0)}")
    print(f"  Illegal %: {illegal.get('illegal_pct', 0.0):.2f}%")

print(f"\nGames as White: {as_white}")
print(f"Games as Black: {as_black}")


### Interpreting Results

**TrueSkill Rating:**
- Conservative rating 15-20: ~1200-1400 ELO (beginner)
- Conservative rating 20-25: ~1400-1600 ELO (intermediate)
- Conservative rating 25-30: ~1600-1800 ELO (advanced)
- Conservative rating 30+: ~1800+ ELO (expert)

**Move Quality:**
- Accuracy >60%: Model frequently finds Stockfish's top move
- ACPL <50: Good move quality (average mistake < half a pawn)
- ACPL <30: Excellent move quality (near-optimal play)

## Step 8: View Sample Game

Let's look at a sample game from the tournament.

In [None]:
# Get first game involving vLLM (either side)
vllm_game = None
for game in results['games']:
    if 'vllm' in [game['white_agent_spec'], game['black_agent_spec']]:
        vllm_game = game
        break

if vllm_game:
    print("Sample Game:")
    print("=" * 60)
    print(f"Game ID: {vllm_game['id']}")
    print(f"White: {vllm_game['white_agent_spec']}")
    print(f"Black: {vllm_game['black_agent_spec']}")
    print(f"Result: {vllm_game['result']}")  # "1-0", "0-1", "1-1", or "*"
    print(f"Moves: {vllm_game['moves_played']}")
    print(f"Reason: {vllm_game['game_over_reason']}")
    print(f"Final FEN: {vllm_game['final_fen']}")
    
    # PGN path on disk
    pgn_path = vllm_game.get('pgn_path', '')
    if pgn_path:
        print(f"\nPGN file: {pgn_path}")
        print("You can open this file and paste the PGN into: https://lichess.org/paste")
    else:
        print("\nNo PGN path recorded for this game.")
else:
    print("No vLLM games found in results.")


## Step 9: Compare Different Parallelism Levels

Let's visualize how concurrency affects performance.

In [None]:
# This data comes from the concurrency benchmark we ran
concurrency_data = {
    'Concurrency': [1, 2, 4, 8],
    'Throughput (req/s)': [1.53, 1.89, 2.15, 2.17],
    'Mean Latency (s)': [0.653, 1.057, 1.860, 3.315],
    'Speedup': [1.00, 1.23, 1.40, 1.42]
}

df_concurrency = pd.DataFrame(concurrency_data)

print("Concurrency Performance:")
print("=" * 70)
print(df_concurrency.to_string(index=False))
print("\nKey Insights:")
print("- Best performance at parallelism=4 (matches server batch_size)")
print("- 1.4x throughput improvement with automatic batching")
print("- Individual latency increases, but total time decreases")
print("- Minimal benefit beyond parallelism=4 (server saturated)")

## Summary

Congratulations! You've successfully:

- Run chess tournaments with TrueSkill ratings  
- Leveraged automatic request batching for 1.4x throughput  
- Evaluated model strength against multiple baselines  
- Analyzed performance with detailed metrics  
- Understood concurrency trade-offs  

**Key Takeaways:**

1. **Automatic Concurrency**: Process-level parallelism (`p_map`) + vLLM continuous batching work together automatically
2. **Optimal Parallelism**: Set `--parallelism` to match `max_num_seqs` (typically 4) for best throughput
3. **Throughput vs Latency**: Individual requests slower, but total throughput higher
4. **Real-world Benefit**: Tournament games complete ~40% faster with parallelism=4

**Next Steps:**

- Fine-tune your model further based on tournament weaknesses
- Experiment with different training datasets
- Test against stronger opponents (Stockfish skill 15-20)
- Deploy to production with learned configurations
- Share your results with the community!