# Chess Model Deployment with vLLM on Trainium

## Overview

This notebook demonstrates deploying a fine-tuned chess model using vLLM on AWS Trainium. You'll learn how to:

- Set up vLLM with Neuron backend for Trainium
- Deploy a compiled chess model with batch support
- Test the deployed model with interactive games
- Validate performance and understand concurrency benefits

**Prerequisites:**
- Complete the fine-tuning lab or use a pre-trained chess model
- Trainium instance (trn1.2xlarge or larger)
- Neuron SDK 2.20+ (pre-installed in Workshop Studio)

**Duration:** 30-40 minutes (including model compilation)

## Step 1: Install Dependencies

First, let's install the chess environment and its dependencies.

In [None]:
%cd /home/ubuntu/environment/neuron-workshops/labs/vLLM/Chess/assets

# Install chess environment dependencies
!pip install -q -r requirements.txt

print("\n✅ Dependencies installed successfully!")

## Step 2: Verify Neuron SDK Environment

Let's check that your Neuron SDK installation has the required versions.

In [None]:
import subprocess
import sys

def get_version(package):
    try:
        result = subprocess.run([sys.executable, "-m", "pip", "show", package], 
                              capture_output=True, text=True, check=True)
        for line in result.stdout.split('\n'):
            if line.startswith('Version:'):
                return line.split(':')[1].strip()
    except:
        return "Not installed"

print("Neuron SDK Environment:")
print(f"  neuronx-cc: {get_version('neuronx-cc')}")
print(f"  torch-neuronx: {get_version('torch-neuronx')}")
print(f"  optimum-neuron: {get_version('optimum-neuron')}")
print(f"  torch: {get_version('torch')}")

# Check minimum versions
min_versions = {
    'neuronx-cc': '2.15',
    'torch-neuronx': '2.1', 
    'optimum-neuron': '0.3.0'
}

print("\nVersion Check:")
all_good = True
for pkg, min_ver in min_versions.items():
    current = get_version(pkg)
    if current != "Not installed" and current >= min_ver:
        print(f"✓ {pkg}: {current} (minimum: {min_ver})")
    else:
        print(f"⚠ {pkg}: {current} (minimum required: {min_ver})")
        all_good = False

if all_good:
    print("\n✅ Environment ready for chess deployment!")
else:
    print("\n⚠️  Some packages may need updating. See README troubleshooting section.")

## Step 3: Check Neuron Hardware

Verify Neuron cores are available and not in use.

In [None]:
!neuron-ls

**Expected output:** You should see 2 Neuron cores available for trn1.2xlarge.

If cores are in use, you can free them with: `!pkill -f vllm`

## Step 4: Configure Model Path

Set the path to your fine-tuned chess model. If you completed the fine-tuning lab, update the path below.

Otherwise, we'll use a pre-compiled model from the repository.

In [None]:
import os

# Option 1: Use your fine-tuned model (update path if you completed fine-tuning lab)
# MODEL_PATH = "/home/ubuntu/environment/ChessLM_Qwen3_Trainium"

# Option 2: Use pre-compiled model from repository (default)
MODEL_PATH = "/home/ubuntu/ChessLM_Qwen3_Trainium"

# Verify model exists
if os.path.exists(MODEL_PATH):
    print(f"✅ Model found at: {MODEL_PATH}")
    
    # Check if model is compiled
    if os.path.exists(os.path.join(MODEL_PATH, "model.pt")):
        print("✅ Pre-compiled model found (model.pt)")
        
        # Check neuron_config
        import json
        with open(os.path.join(MODEL_PATH, "neuron_config.json")) as f:
            config = json.load(f)
        print(f"   Batch size: {config.get('batch_size', 'N/A')}")
        print(f"   Continuous batching: {config.get('continuous_batching', 'N/A')}")
        print(f"   Tensor parallel: {config.get('tp_degree', 'N/A')}")
    else:
        print("⚠️  Model not compiled yet. Will compile on first run (10-30 min).")
else:
    print(f"❌ Model not found at: {MODEL_PATH}")
    print("   Please update MODEL_PATH or complete the fine-tuning lab first.")

## Step 5: Start vLLM Server

Now let's start the vLLM server with Neuron backend. The server is configured for:
- **batch_size=4**: Can process up to 4 requests concurrently
- **continuous_batching=true**: Automatically groups concurrent requests
- **tensor_parallel_size=2**: Uses both Neuron cores on trn1.2xlarge

**Note:** First-time startup includes model compilation which takes 10-30 minutes. Subsequent starts are fast (<1 min).

In [None]:
# Update vllm.sh with correct model path
import os

vllm_script = "vllm-server/vllm.sh"
with open(vllm_script, 'r') as f:
    content = f.read()

# Update MODEL_PATH in script
content = content.replace(
    'MODEL_PATH="/home/ubuntu/ChessLM_Qwen3_Trainium"',
    f'MODEL_PATH="{MODEL_PATH}"'
)

with open(vllm_script, 'w') as f:
    f.write(content)

print(f"✅ Updated vllm.sh with model path: {MODEL_PATH}")

In [None]:
# Start vLLM server in background
!cd vllm-server && bash vllm.sh

print("\nvLLM server starting...")
print("This may take 10-30 minutes on first run (model compilation).")
print("Subsequent runs will be fast (<1 minute).")
print("\nWait for 'Application startup complete' message before proceeding.")

**While waiting for server startup:**

You can monitor the server logs in the terminal:
```bash
tail -f /home/ubuntu/environment/neuron-workshops/labs/vLLM/Chess/assets/vllm-server/vllm-server.log
```

Or check Neuron core usage:
```bash
watch -n 2 neuron-ls
```

## Step 6: Test Server Connection

Once the server has started, let's test the connection.

In [None]:
import time
import openai

# Give server a moment to finish startup
print("Waiting for server to be fully ready...")
time.sleep(10)

# Initialize OpenAI client pointing to vLLM server
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM doesn't require authentication
)

# Test connection
try:
    response = client.chat.completions.create(
        model=MODEL_PATH,
        messages=[{"role": "user", "content": "Hello"}],
        max_tokens=10,
        extra_body={"chat_template_kwargs": {"enable_thinking": False}}
    )
    print("✅ vLLM server is running!")
    print(f"\nTest response: {response.choices[0].message.content}")
except Exception as e:
    print(f"❌ Connection failed: {e}")
    print("\nTroubleshooting:")
    print("1. Check if server is still starting (check logs)")
    print("2. Verify Neuron cores are not in use: neuron-ls")
    print("3. Check for errors: tail vllm-server/vllm-server.log")

## Step 7: Play Your First Chess Game

Now that the server is running, let's play a quick game against a random opponent to verify everything works.

In [None]:
# Import chess environment
import sys
sys.path.insert(0, '/home/ubuntu/environment/neuron-workshops/labs/vLLM/Chess/assets')

from env import ChessEnvironment
from agents import VLLMAgent, RandomAgent

# Create agents
vllm_agent = VLLMAgent(
    base_url="http://localhost:8000/v1",
    model=MODEL_PATH,
    temperature=0.1
)

random_agent = RandomAgent()

# Create environment
env = ChessEnvironment(
    white_agent=vllm_agent,
    black_agent=random_agent,
    max_moves=100
)

# Play game
print("Playing chess game: vLLM vs Random\n")
result = env.play_game(verbose=True)

print("\n" + "="*60)
print("Game Result")
print("="*60)
print(f"Result: {result['result']}")
print(f"Moves played: {result['moves_played']}")
print(f"Reason: {result['game_over_reason']}")
print(f"\nWhite (vLLM) illegal attempts: {result['white_illegal_attempts']}")
print(f"Black (Random) illegal attempts: {result['black_illegal_attempts']}")

## Step 8: Test Against Stockfish

Now let's test against a real chess engine to get a sense of the model's strength.

In [None]:
from agents import StockfishAgent

# Create Stockfish opponent (skill level 1 = beginner)
stockfish = StockfishAgent(skill_level=1, depth=2)

# Create environment
env = ChessEnvironment(
    white_agent=vllm_agent,
    black_agent=stockfish,
    max_moves=100
)

# Play game
print("Playing chess game: vLLM vs Stockfish (skill=1)\n")
result = env.play_game(verbose=True)

print("\n" + "="*60)
print("Game Result")
print("="*60)
print(f"Result: {result['result']}")
print(f"Moves played: {result['moves_played']}")
print(f"Reason: {result['game_over_reason']}")
print(f"\nYour model's illegal move attempts: {result['white_illegal_attempts']}")

## Step 9: Validate Performance

Let's measure single-move latency to understand performance.

In [None]:
import chess
import time

# Create a fresh board
board = chess.Board()
legal_moves = list(board.legal_moves)

# Measure latency over 10 moves
latencies = []
print("Measuring latency over 10 moves...\n")

for i in range(10):
    start = time.time()
    move, comment = vllm_agent.choose_move(
        board=board.copy(),
        legal_moves=legal_moves,
        move_history=[],
        side_to_move="White"
    )
    latency = time.time() - start
    latencies.append(latency)
    print(f"Move {i+1}: {latency:.3f}s")

# Calculate statistics
import statistics
print("\n" + "="*60)
print("Performance Metrics")
print("="*60)
print(f"Mean latency: {statistics.mean(latencies):.3f}s")
print(f"Median latency: {statistics.median(latencies):.3f}s")
print(f"Min latency: {min(latencies):.3f}s")
print(f"Max latency: {max(latencies):.3f}s")
print(f"Throughput: {1/statistics.mean(latencies):.2f} moves/second")

# Expected performance
expected_latency = 0.65  # seconds
if statistics.mean(latencies) < expected_latency * 1.2:
    print("\n✅ Performance looks good!")
else:
    print(f"\n⚠️  Latency higher than expected (~{expected_latency}s). Check server configuration.")

## Understanding Concurrency Benefits

Your vLLM server is configured with:
- **`max_num_seqs=4`**: Can batch up to 4 requests
- **`continuous_batching=true`**: Automatically groups concurrent requests

### How It Works:

When you run multiple games in parallel (covered in Chess-Tournament.ipynb), the tournament scheduler uses Python multiprocessing to run games concurrently. Each game makes HTTP requests to the same vLLM server.

**With 1 game (sequential):**
- Latency: ~0.65s per move
- Throughput: 1.54 moves/second

**With 4 games (parallel):**
- Latency per game: ~1.86s per move (games wait in small batches)
- Total throughput: 2.15 moves/second across all games
- **Improvement: 1.4x faster** (games complete ~40% faster overall)

**Why the improvement?**
The vLLM server batches concurrent requests together, processing them simultaneously on the Neuron cores. This amortizes the overhead of model inference across multiple requests.

### Testing Concurrency:

You'll see this in action in the next notebook (Chess-Tournament.ipynb), where running with `--parallelism 4` enables automatic batching.

## Summary

Congratulations! You've successfully:

✅ Deployed a chess model via vLLM on Trainium  
✅ Configured vLLM for batching (max_num_seqs=4, continuous_batching)  
✅ Tested the model with interactive games  
✅ Validated performance metrics  
✅ Understood concurrency benefits (1.4x throughput improvement)

**Next Steps:**

Continue to [Chess-Tournament.ipynb](Chess-Tournament.ipynb) to:
- Run competitive tournaments with TrueSkill ratings
- Evaluate model strength against various Stockfish levels
- See automatic batching in action with parallel games
- Analyze detailed performance metrics

**Keep the vLLM server running** for the next notebook!