# Chess Tournament Evaluation

## Overview

This notebook demonstrates running competitive chess tournaments to evaluate your fine-tuned model. You'll learn how to:

- Run tournaments with TrueSkill ratings
- Leverage automatic request batching for throughput
- Analyze model performance with detailed metrics
- Compare against multiple Stockfish baselines

**Prerequisites:**
- Complete Chess-Deployment.ipynb
- vLLM server running with your chess model

**Duration:** 20-30 minutes

## Step 1: Verify vLLM Server is Running

Before starting the tournament, let's verify the vLLM server is still running.

In [2]:
import openai

MODEL_ID = "kunhunjon/ChessLM_Qwen3_Trainium_AWS_Format"

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

try:
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[{"role": "user", "content": "Test"}],
        max_tokens=5,
        extra_body={"chat_template_kwargs": {"enable_thinking": False}}
    )
    print(" vLLM server is running and ready!")
except Exception as e:
    print(f" Server not responding: {e}")
    print("\nPlease start the server from Chess-Deployment.ipynb first.")

 vLLM server is running and ready!


## Step 2: Play a Single Game

Let's start with a single game to verify everything works.

In [8]:
%cd /home/ubuntu/environment/neuron-workshops/labs/vLLM/Chess/

# Play single game: vLLM vs Stockfish (skill 5)
!python -m assets.run_game \
  --agent1 vllm \
  --agent2 stockfish-skill5-depth10 \
  --num-games 1 \
  --verbose

/home/ubuntu/environment/neuron-workshops/labs/vLLM/Chess


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


=== Multi-Game Chess Tournament (Duel Mode) ===
üéÆ Agent 1: vllm
üéÆ Agent 2: stockfish-skill5-depth10
üìä Number of Games: 1
‚è±Ô∏è  Max Moves per Game: 200
‚è∞ Time Limit per Move: 15.0s
üîä Verbose: True
üíæ Output File: games.pgn

üöÄ Running 1 game...
  0%|                                                     | 0/1 [00:00<?, ?it/s]üé≤ Game 1: vllm (White) vs stockfish-skill5-depth10 (Black)
Starting new game: VLLMAgent (White) vs StockfishAgent (Black)
Initial position: rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1

Initial Position:
 [1;36m 8 [0m  [30;47m ‚ôú [0m  [37;40m ‚ôû [0m  [30;47m ‚ôù [0m  [37;40m ‚ôõ [0m  [30;47m ‚ôö [0m  [37;40m ‚ôù [0m  [30;47m ‚ôû [0m  [37;40m ‚ôú [0m 
 [1;36m 7 [0m  [37;40m ‚ôü [0m  [30;47m ‚ôü [0m  [37;40m ‚ôü [0m  [30;47m ‚ôü [0m  [37;40m ‚ôü [0m  [30;47m ‚ôü [0m  [37;40m ‚ôü [0m  [30;47m ‚ôü [0m 
 [1;36m 6 [0m  [30;47m ¬∑ [0m  [37;40m ¬∑ [0m  [30;47m ¬∑ [0m  [37;40m ¬∑ [0m  [30;47m ¬

## Step 3: Run a Small Tournament (Sequential)

Now let's run a small tournament with **parallelism=1** (sequential games) to establish a baseline.

In [12]:
import time

print("Running 4 games sequentially (parallelism=1)...\n")

start_time = time.time()

!python -m assets.run_game \
  --agent vllm \
  --agent stockfish-skill1-depth1 \
  --num-games 4 \
  --parallelism 1 \
  --output-dir tournament_sequential

sequential_time = time.time() - start_time

print(f"\n‚úÖ Sequential tournament completed in {sequential_time:.1f} seconds")

Running 4 games sequentially (parallelism=1)...



=== N-agent TrueSkill Tournament ===
üë• Agents (2): vllm, stockfish-skill1-depth1
üìä Target Games: 4
‚è±Ô∏è  Max Moves per Game: 200
‚è∞ Time Limit per Move: 15.0s
üóÇÔ∏è  Output Dir: tournament_sequential
üßÆ Scheduler: trueskill
üéØ Max Games/Agent (soft): unlimited
  0%|                                                     | 0/1 [00:00<?, ?it/s]üìä Game 1: 1 moves | Total: 1 moves across all games
üìä Game 1: 10 moves | Total: 10 moves across all games
üìä Game 1: 20 moves | Total: 20 moves across all games
üìä Game 1: 30 moves | Total: 30 moves across all games
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:12<00:00, 12.87s/it]
[32m‚ï≠‚îÄ[0m[32m‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ[0m[32m üèÜ Tournament Progress [0m[32m‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ[0m[32m‚îÄ‚ïÆ[0m
[32m‚î

## Step 4: Run Tournament with Concurrency

Now let's run the same tournament with **parallelism=4** to see the throughput improvement from automatic request batching.

In [None]:
print("Running 4 games in parallel (parallelism=4)...\n")

start_time = time.time()

!python -m assets.run_game \
  --agent vllm \
  --agent stockfish-skill1-depth1 \
  --num-games 4 \
  --parallelism 4 \
  --output-dir tournament_parallel

parallel_time = time.time() - start_time

print(f"\n Parallel tournament completed in {parallel_time:.1f} seconds")
print(f"\nSpeedup: {sequential_time/parallel_time:.2f}x faster with parallelism=4")
print(f"Time saved: {sequential_time - parallel_time:.1f} seconds ({(sequential_time - parallel_time)/sequential_time*100:.1f}%)")

Running 4 games in parallel (parallelism=4)...



=== N-agent TrueSkill Tournament ===
üë• Agents (2): vllm, stockfish-skill1-depth2
üìä Target Games: 4
‚è±Ô∏è  Max Moves per Game: 200
‚è∞ Time Limit per Move: 15.0s
üóÇÔ∏è  Output Dir: tournament_parallel
üßÆ Scheduler: trueskill
üéØ Max Games/Agent (soft): unlimited
  0%|                                                     | 0/4 [00:00<?, ?it/s]üìä Game 3: 1 moves | Total: 1 moves across all gamesüìä Game 1: 1 moves | Total: 1 moves across all gamesüìä Game 2: 1 moves | Total: 1 moves across all games

üìä Game 4: 1 moves | Total: 1 moves across all games

üìä Game 1: 10 moves | Total: 10 moves across all games
üìä Game 3: 10 moves | Total: 10 moves across all games
üìä Game 2: 10 moves | Total: 10 moves across all games
üìä Game 4: 10 moves | Total: 10 moves across all games
üìä Game 3: 20 moves | Total: 20 moves across all games
üìä Game 1: 20 moves | Total: 20 moves across all games
üìä Game 1: 30 moves | Total: 30 moves across all games
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚

### Understanding the Speedup

**How it works:**

1. **Process-level parallelism** (`--parallelism 4`):
   - Tournament scheduler runs 4 games simultaneously in separate processes
   - Each game makes HTTP requests to the vLLM server independently

2. **Request-level batching** (vLLM server):
   - Server configured with `max_num_seqs=4` and `continuous_batching=true`
   - When 4 games request moves at similar times, vLLM automatically batches them
   - Batched requests are processed together on Neuron cores

**Expected results:**
- Sequential (parallelism=1): ~0.65s per move, games run one after another
- Parallel (parallelism=4): ~1.86s per move, but 4 games run simultaneously
- **Throughput improvement: ~1.4x** (40% faster overall)

**Why not 4x speedup?**
- Individual request latency increases due to batching overhead
- Not all requests arrive at exactly the same time (timing variance)
- Server batch efficiency: ~35% of theoretical maximum
- Still significant savings: games complete much faster overall

## Step 5: Run Full Tournament

Now let's run a comprehensive tournament against multiple opponents to evaluate model strength.

In [13]:
!python -m assets.run_game \
  --agent vllm \
  --agent stockfish-skill1-depth2 \
  --agent stockfish-skill5-depth10 \
  --agent stockfish-skill10-depth15 \
  --num-games 20 \
  --parallelism 4 \
  --output-dir tournament_full

print("\n Tournament complete! Results saved to tournament_full/")

=== N-agent TrueSkill Tournament ===
üë• Agents (4): vllm, stockfish-skill1-depth2, stockfish-skill5-depth10, stockfish-skill10-depth15
üìä Target Games: 20
‚è±Ô∏è  Max Moves per Game: 200
‚è∞ Time Limit per Move: 15.0s
üóÇÔ∏è  Output Dir: tournament_full
üßÆ Scheduler: trueskill
üéØ Max Games/Agent (soft): unlimited
  0%|                                                     | 0/4 [00:00<?, ?it/s]üìä Game 4: 1 moves | Total: 1 moves across all games
üìä Game 4: 10 moves | Total: 10 moves across all games
üìä Game 4: 20 moves | Total: 20 moves across all games
üìä Game 4: 30 moves | Total: 30 moves across all games
üìä Game 3: 1 moves | Total: 1 moves across all games
üìä Game 2: 1 moves | Total: 1 moves across all games
üìä Game 1: 1 moves | Total: 1 moves across all games
üìä Game 4: 40 moves | Total: 40 moves across all games
üìä Game 4: 50 moves | Total: 50 moves across all games
üìä Game 4: 60 moves | Total: 60 moves across all games
üìä Game 4: 70 moves | Total: 70 

## Step 6: Analyze Results

Let's load and analyze the tournament results.

In [14]:
import json
import pandas as pd

# Load tournament results
with open('tournament_full/tournament.json') as f:
    results = json.load(f)

# Extract agent statistics
agents_data = []
for agent_name, stats in results['agents'].items():
    agents_data.append({
        'Agent': agent_name,
        'Games': stats['games'],
        'Wins': stats['wins'],
        'Losses': stats['losses'],
        'Draws': stats['draws'],
        'Win Rate': f"{stats['wins'] / stats['games'] * 100:.1f}%",
        'Rating': f"{stats['conservative']:.1f}",
        'Mu': f"{stats['mu']:.2f}",
        'Sigma': f"{stats['sigma']:.2f}"
    })

df = pd.DataFrame(agents_data)
df = df.sort_values('Rating', ascending=False)

print("Tournament Standings:")
print("=" * 80)
print(df.to_string(index=False))
print("\nNote: Rating = mu - 3*sigma (conservative estimate)")

KeyError: 'games'

## Step 7: Analyze Your Model's Performance

Let's look at detailed metrics for your vLLM model.

In [None]:
# Get vLLM model statistics
vllm_stats = results['agents']['vllm']

print("Your Model Performance:")
print("=" * 60)
print(f"\nGames Played: {vllm_stats['games']}")
print(f"Record: {vllm_stats['wins']}-{vllm_stats['losses']}-{vllm_stats['draws']}")
print(f"Win Rate: {vllm_stats['wins'] / vllm_stats['games'] * 100:.1f}%")
print(f"\nTrueSkill Rating:")
print(f"  Mu (skill estimate): {vllm_stats['mu']:.2f}")
print(f"  Sigma (uncertainty): {vllm_stats['sigma']:.2f}")
print(f"  Conservative rating: {vllm_stats['conservative']:.1f}")

# Engine metrics (if available)
if 'engine_metrics' in results and 'vllm' in results['engine_metrics']:
    metrics = results['engine_metrics']['vllm']
    print(f"\nMove Quality:")
    print(f"  Accuracy: {metrics.get('accuracy_pct', 'N/A')}% (matches Stockfish top move)")
    print(f"  Avg Centipawn Loss: {metrics.get('acpl', 'N/A')}")

print(f"\nGames as White: {vllm_stats['as_white']}")
print(f"Games as Black: {vllm_stats['as_black']}")

### Interpreting Results

**TrueSkill Rating:**
- Conservative rating 15-20: ~1200-1400 ELO (beginner)
- Conservative rating 20-25: ~1400-1600 ELO (intermediate)
- Conservative rating 25-30: ~1600-1800 ELO (advanced)
- Conservative rating 30+: ~1800+ ELO (expert)

**Move Quality:**
- Accuracy >60%: Model frequently finds Stockfish's top move
- ACPL <50: Good move quality (average mistake < half a pawn)
- ACPL <30: Excellent move quality (near-optimal play)

## Step 8: View Sample Game

Let's look at a sample game from the tournament.

In [None]:
# Get first game involving vLLM
vllm_game = None
for game in results['games']:
    if 'vllm' in [game['white_agent'], game['black_agent']]:
        vllm_game = game
        break

if vllm_game:
    print("Sample Game:")
    print("=" * 60)
    print(f"White: {vllm_game['white_agent']}")
    print(f"Black: {vllm_game['black_agent']}")
    print(f"Result: {vllm_game['result']}")
    print(f"Moves: {vllm_game['moves_played']}")
    print(f"Reason: {vllm_game['termination_reason']}")
    print(f"\nPGN file: {vllm_game['pgn_file']}")
    print(f"\nYou can view this game at: https://lichess.org/paste")
else:
    print("No vLLM games found in results.")

## Step 9: Compare Different Parallelism Levels

Let's visualize how concurrency affects performance.

In [None]:
# This data comes from the concurrency benchmark we ran
concurrency_data = {
    'Concurrency': [1, 2, 4, 8],
    'Throughput (req/s)': [1.53, 1.89, 2.15, 2.17],
    'Mean Latency (s)': [0.653, 1.057, 1.860, 3.315],
    'Speedup': [1.00, 1.23, 1.40, 1.42]
}

df_concurrency = pd.DataFrame(concurrency_data)

print("Concurrency Performance:")
print("=" * 70)
print(df_concurrency.to_string(index=False))
print("\nKey Insights:")
print("- Best performance at parallelism=4 (matches server batch_size)")
print("- 1.4x throughput improvement with automatic batching")
print("- Individual latency increases, but total time decreases")
print("- Minimal benefit beyond parallelism=4 (server saturated)")

## Summary

Congratulations! You've successfully:

‚úÖ Run chess tournaments with TrueSkill ratings  
‚úÖ Leveraged automatic request batching for 1.4x throughput  
‚úÖ Evaluated model strength against multiple baselines  
‚úÖ Analyzed performance with detailed metrics  
‚úÖ Understood concurrency trade-offs  

**Key Takeaways:**

1. **Automatic Concurrency**: Process-level parallelism (`p_map`) + vLLM continuous batching work together automatically
2. **Optimal Parallelism**: Set `--parallelism` to match `max_num_seqs` (typically 4) for best throughput
3. **Throughput vs Latency**: Individual requests slower, but total throughput higher
4. **Real-world Benefit**: Tournament games complete ~40% faster with parallelism=4

**Next Steps:**

- Fine-tune your model further based on tournament weaknesses
- Experiment with different training datasets
- Test against stronger opponents (Stockfish skill 15-20)
- Deploy to production with learned configurations
- Share your results with the community!