<a href="https://colab.research.google.com/github/kuds/reinforce-tactics/blob/main/notebooks/ppo_baseline_benchmark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforce Tactics — PPO Baseline Training Benchmarks

This notebook trains a **MaskablePPO** agent against `SimpleBot` on the 6×6 beginner map
and records reference metrics at four training checkpoints:

| Checkpoint | Timesteps |
|------------|-----------|
| 1 | 10,000 |
| 2 | 50,000 |
| 3 | 200,000 |
| 4 | 1,000,000 |

At each checkpoint the agent is evaluated over **50 episodes** and we record:
- **Win rate** (% of games won against SimpleBot)
- **Average episode reward**
- **Average episode length** (steps)

The goal is to provide a **reference curve** so that users can run the same
notebook and compare their results to known-good training runs.

**Runtime:** CPU is fine (~20–40 min total). GPU will be faster.

---

### Why MaskablePPO?

The game has a `MultiDiscrete` action space where many action combinations
are invalid at any given time (e.g. you can’t attack a tile with no enemy).
**Action masking** prevents the agent from sampling these invalid actions,
which typically yields 2–3× faster convergence compared to plain PPO.

## 1. Setup

In [None]:
# Install dependencies
!pip install -q gymnasium stable-baselines3 sb3-contrib tensorboard pandas numpy torch matplotlib

import torch
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Device: {DEVICE}")
if DEVICE == 'cuda':
    print(f"  GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Clone repo and install as a package
import os, sys
from pathlib import Path

REPO_DIR = Path('reinforce-tactics')
if REPO_DIR.exists():
    os.chdir(REPO_DIR)
elif Path('notebooks').exists():
    # Already inside the repo
    os.chdir('..')
else:
    print('Cloning repository...')
    !git clone https://github.com/kuds/reinforce-tactics.git
    os.chdir(REPO_DIR)

# Install the package so all imports resolve
!pip install -q -e .

if os.getcwd() not in sys.path:
    sys.path.insert(0, os.getcwd())

print(f"Working directory: {os.getcwd()}")

## 2. Imports

In [None]:
import json
import time
from datetime import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sb3_contrib import MaskablePPO
from stable_baselines3.common.callbacks import CheckpointCallback, BaseCallback

from reinforcetactics.rl.masking import make_maskable_env, make_maskable_vec_env

print('All imports successful.')

## 3. Configuration

In [None]:
# --- Benchmark settings ---
MAP_FILE        = 'maps/1v1/beginner.csv'   # 6x6 beginner map
OPPONENT        = 'bot'                      # SimpleBot
MAX_STEPS       = 500                        # max steps per episode
N_ENVS          = 4                          # parallel training envs
SEED            = 42

# Action space mode:
#   'flat_discrete'  — exact per-action masks (recommended, eliminates invalid actions)
#   'multi_discrete' — per-dimension masks (over-approximation, original behaviour)
ACTION_SPACE    = 'flat_discrete'

# Checkpoints to evaluate
CHECKPOINTS     = [10_000, 50_000, 200_000, 1_000_000]
EVAL_EPISODES   = 50                         # episodes per evaluation

# PPO hyperparameters
PPO_CONFIG = dict(
    learning_rate = 3e-4,
    n_steps       = 2048,
    batch_size    = 64,
    n_epochs      = 10,
    gamma         = 0.99,
    gae_lambda    = 0.95,
    clip_range    = 0.2,
    ent_coef      = 0.01,
    vf_coef       = 0.5,
    max_grad_norm = 0.5,
)

# Output paths
BENCHMARK_DIR = Path('benchmarks/ppo_vs_simplebot')
BENCHMARK_DIR.mkdir(parents=True, exist_ok=True)

print(f'Map:          {MAP_FILE}')
print(f'Opponent:     {OPPONENT}')
print(f'Action space: {ACTION_SPACE}')
print(f'Checkpoints:  {CHECKPOINTS}')
print(f'Eval eps:     {EVAL_EPISODES}')
print(f'Output dir:   {BENCHMARK_DIR}')

## 4. Create environments

In [None]:
# Training envs (vectorized, headless)
vec_env = make_maskable_vec_env(
    n_envs=N_ENVS,
    map_file=MAP_FILE,
    opponent=OPPONENT,
    max_steps=MAX_STEPS,
    seed=SEED,
    use_subprocess=False,   # DummyVecEnv (safer in notebooks)
    action_space_type=ACTION_SPACE,
)

# Separate eval env (single, deterministic)
eval_env = make_maskable_env(
    map_file=MAP_FILE,
    opponent=OPPONENT,
    max_steps=MAX_STEPS,
    action_space_type=ACTION_SPACE,
)

print(f'Observation space: {vec_env.observation_space}')
print(f'Action space:      {vec_env.action_space}')

## 5. Create MaskablePPO model

In [None]:
model = MaskablePPO(
    'MultiInputPolicy',
    vec_env,
    verbose=0,
    tensorboard_log=str(BENCHMARK_DIR / 'tensorboard'),
    device=DEVICE,
    seed=SEED,
    **PPO_CONFIG,
)

print('MaskablePPO model created.')
print(f'Policy:  {model.policy.__class__.__name__}')
print(f'Device:  {model.device}')

## 6. Evaluation helper

In [None]:
def evaluate_model(model, env, n_episodes=50):
    """
    Evaluate a trained model and return summary statistics.

    Returns dict with: win_rate, avg_reward, std_reward,
    avg_length, std_length, wins, losses, draws
    """
    wins, losses, draws = 0, 0, 0
    rewards, lengths = [], []

    for _ in range(n_episodes):
        obs, _ = env.reset()
        done = False
        ep_reward = 0.0
        ep_len = 0

        while not done:
            masks = env.action_masks()
            action, _ = model.predict(obs, deterministic=True, action_masks=masks)
            obs, reward, terminated, truncated, info = env.step(action)
            ep_reward += reward
            ep_len += 1
            done = terminated or truncated

        rewards.append(ep_reward)
        lengths.append(ep_len)

        winner = info.get('winner')
        if winner == 1:
            wins += 1
        elif winner is not None:
            losses += 1
        else:
            draws += 1

    return {
        'win_rate':    wins / n_episodes,
        'avg_reward':  float(np.mean(rewards)),
        'std_reward':  float(np.std(rewards)),
        'avg_length':  float(np.mean(lengths)),
        'std_length':  float(np.std(lengths)),
        'wins':        wins,
        'losses':      losses,
        'draws':       draws,
    }

print('evaluate_model() defined.')

## 7. Train and evaluate at each checkpoint

We train incrementally: 0 → 10K → 50K → 200K → 1M timesteps,
evaluating at each checkpoint.

In [None]:
results = []
trained_so_far = 0
start_time = time.time()

for checkpoint_ts in CHECKPOINTS:
    steps_to_train = checkpoint_ts - trained_so_far
    print(f'\n{"="*60}')
    print(f'Training {trained_so_far:,} -> {checkpoint_ts:,} '
          f'({steps_to_train:,} steps)...')
    print(f'{"="*60}')

    t0 = time.time()
    model.learn(
        total_timesteps=steps_to_train,
        reset_num_timesteps=False,
        progress_bar=True,
    )
    train_time = time.time() - t0
    trained_so_far = checkpoint_ts

    # Save checkpoint
    ckpt_path = BENCHMARK_DIR / f'model_{checkpoint_ts}.zip'
    model.save(str(ckpt_path))
    print(f'Saved checkpoint: {ckpt_path}')

    # Evaluate
    print(f'Evaluating over {EVAL_EPISODES} episodes...')
    metrics = evaluate_model(model, eval_env, n_episodes=EVAL_EPISODES)
    metrics['timesteps'] = checkpoint_ts
    metrics['train_time_s'] = round(train_time, 1)
    results.append(metrics)

    print(f'  Win rate:       {metrics["win_rate"]*100:.1f}%')
    print(f'  Avg reward:     {metrics["avg_reward"]:.1f} '
          f'(+/- {metrics["std_reward"]:.1f})')
    print(f'  Avg length:     {metrics["avg_length"]:.1f} '
          f'(+/- {metrics["std_length"]:.1f})')
    print(f'  W/L/D:          {metrics["wins"]}/{metrics["losses"]}/{metrics["draws"]}')
    print(f'  Training time:  {train_time:.1f}s')

total_time = time.time() - start_time
print(f'\nTotal wall time: {total_time/60:.1f} minutes')

## 8. Results table

In [None]:
df = pd.DataFrame(results)
df['win_rate_pct'] = (df['win_rate'] * 100).round(1)
df['avg_reward'] = df['avg_reward'].round(1)
df['avg_length'] = df['avg_length'].round(1)

display_df = df[['timesteps', 'win_rate_pct', 'avg_reward', 'avg_length',
                  'wins', 'losses', 'draws', 'train_time_s']].copy()
display_df.columns = ['Timesteps', 'Win Rate (%)', 'Avg Reward',
                       'Avg Length', 'Wins', 'Losses', 'Draws',
                       'Train Time (s)']
display_df = display_df.set_index('Timesteps')
display_df

## 9. Training curves

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

ts = [r['timesteps'] for r in results]

# Win rate
ax = axes[0]
wr = [r['win_rate'] * 100 for r in results]
ax.plot(ts, wr, 'o-', color='#2196F3', linewidth=2, markersize=8)
ax.set_xlabel('Timesteps')
ax.set_ylabel('Win Rate (%)')
ax.set_title('Win Rate vs SimpleBot')
ax.set_xscale('log')
ax.set_ylim(-5, 105)
ax.axhline(y=70, color='green', linestyle='--', alpha=0.5, label='70% target')
ax.legend()
ax.grid(True, alpha=0.3)

# Average reward
ax = axes[1]
avg_r = [r['avg_reward'] for r in results]
std_r = [r['std_reward'] for r in results]
ax.plot(ts, avg_r, 'o-', color='#FF9800', linewidth=2, markersize=8)
ax.fill_between(ts,
                [a - s for a, s in zip(avg_r, std_r)],
                [a + s for a, s in zip(avg_r, std_r)],
                alpha=0.2, color='#FF9800')
ax.set_xlabel('Timesteps')
ax.set_ylabel('Average Reward')
ax.set_title('Average Episode Reward')
ax.set_xscale('log')
ax.grid(True, alpha=0.3)

# Episode length
ax = axes[2]
avg_l = [r['avg_length'] for r in results]
std_l = [r['std_length'] for r in results]
ax.plot(ts, avg_l, 'o-', color='#4CAF50', linewidth=2, markersize=8)
ax.fill_between(ts,
                [a - s for a, s in zip(avg_l, std_l)],
                [a + s for a, s in zip(avg_l, std_l)],
                alpha=0.2, color='#4CAF50')
ax.set_xlabel('Timesteps')
ax.set_ylabel('Average Length (steps)')
ax.set_title('Average Episode Length')
ax.set_xscale('log')
ax.grid(True, alpha=0.3)

fig.suptitle('PPO Baseline Benchmarks  |  6x6 beginner map  |  vs SimpleBot',
             fontsize=13, fontweight='bold', y=1.02)
fig.tight_layout()

fig.savefig(str(BENCHMARK_DIR / 'training_curves.png'),
            dpi=150, bbox_inches='tight')
print(f'Saved plot: {BENCHMARK_DIR / "training_curves.png"}')
plt.show()

## 9b. Diagnose training failures

If you see **0% win rate** and all games ending as **draws at 500 steps**, the agent
is not learning to win — it's farming shaping rewards. Run the cell below to
diagnose the specific failure mode and get targeted recommendations.

In [None]:
def diagnose_training(results):
    """Analyze benchmark results and print a diagnosis."""
    if not results:
        print("No results to analyze.")
        return

    # --- Collect signals ---
    win_rates = [r['win_rate'] for r in results]
    avg_rewards = [r['avg_reward'] for r in results]
    avg_lengths = [r['avg_length'] for r in results]
    draw_counts = [r['draws'] for r in results]
    loss_counts = [r['losses'] for r in results]
    win_counts = [r['wins'] for r in results]
    n_eval = results[0].get('wins', 0) + results[0].get('losses', 0) + results[0].get('draws', 0)

    all_draws = all(d == n_eval for d in draw_counts)
    all_max_len = all(abs(l - MAX_STEPS) < 1.0 for l in avg_lengths)
    rewards_positive = all(r > 0 for r in avg_rewards)
    rewards_declining = len(avg_rewards) >= 2 and avg_rewards[-1] < avg_rewards[0] * 0.7
    no_wins = all(w == 0.0 for w in win_rates)
    late_losses = loss_counts[-1] > 0 and all(lc == 0 for lc in loss_counts[:-1])
    all_losses_late = loss_counts[-1] == n_eval if len(loss_counts) > 0 else False
    negative_rewards = all(r < -1000 for r in avg_rewards[:-1]) if len(avg_rewards) > 1 else False

    print("=" * 65)
    print("  TRAINING DIAGNOSTICS")
    print("=" * 65)

    # --- Summary table ---
    print(f"\n{'Timesteps':>12}  {'WR':>6}  {'Reward':>10}  {'Length':>8}  {'W/L/D'}")
    print("-" * 58)
    for r in results:
        ts = r['timesteps']
        wr = r['win_rate'] * 100
        rw = r['avg_reward']
        ln = r['avg_length']
        wld = f"{r['wins']}/{r['losses']}/{r['draws']}"
        print(f"{ts:>12,}  {wr:>5.1f}%  {rw:>10.1f}  {ln:>8.1f}  {wld}")

    # --- Failure mode detection ---
    print(f"\n{'─' * 65}")
    print("  DIAGNOSIS")
    print(f"{'─' * 65}")

    if no_wins and all_draws and all_max_len and rewards_positive:
        # Flat-discrete stalemate pattern
        print("""
FAILURE MODE: Shaping-reward stalemate (flat_discrete)

The agent takes only valid actions (exact masking is working), earns
positive shaping rewards by creating units and controlling structures,
but never finishes a game within the step limit.

Root cause: The agent has NEVER experienced a +1000 win or -1000 loss
reward. With no terminal signal, it optimizes shaping rewards instead
of pursuing victory. Both sides create units and trade blows without
either achieving total elimination within 500 steps.
""")
        if rewards_declining:
            print(f"⚠  Reward dropped from {avg_rewards[0]:.0f} → {avg_rewards[-1]:.0f}")
            print("   at 1M steps, suggesting policy degradation over time.\n")

    elif no_wins and negative_rewards and all_max_len:
        # Multi-discrete invalid action penalty pattern
        print("""
FAILURE MODE: Invalid-action penalty flood (multi_discrete)

~99% of sampled actions are game-invalid due to per-dimension mask
over-approximation. The -10 penalty per invalid action dominates the
reward signal (~4,950 per episode), making it impossible to learn
from actual gameplay.
""")
        if all_losses_late:
            print("At 1M steps the agent collapsed to spamming end_turn to avoid\n"
                  "penalties, letting SimpleBot win every game.\n")

    elif no_wins and late_losses:
        print("""
FAILURE MODE: Policy collapse

Early episodes stalemate (draws), then the agent collapses to a
degenerate strategy (e.g., always ending turn) and starts losing.
""")
    else:
        # Partial learning or unknown
        if max(win_rates) > 0:
            best_idx = win_rates.index(max(win_rates))
            print(f"\nBest win rate: {max(win_rates)*100:.1f}% at "
                  f"{results[best_idx]['timesteps']:,} timesteps.")
            if win_rates[-1] < max(win_rates):
                print("Win rate is declining — possible overfitting or learning rate too high.")
        else:
            print("\n0% win rate across all checkpoints. Review environment configuration.")

    # --- Should I train longer? ---
    print(f"{'─' * 65}")
    print("  SHOULD YOU TRAIN LONGER?")
    print(f"{'─' * 65}")

    if no_wins and all_draws:
        print("""
  NO. Training longer will not help.

  The agent is stuck in a local optimum (maximizing shaping rewards
  without learning to win). More timesteps will not change this —
  the agent needs a different reward structure to break out.
""")
    elif no_wins:
        print("""
  NO. The current configuration has a fundamental issue preventing
  the agent from learning to win. Fix the issues below first.
""")
    elif win_rates[-1] > win_rates[-2] if len(win_rates) >= 2 else False:
        print("""
  MAYBE. Win rate is still increasing — more training could help.
  But check if the rate of improvement is slowing significantly.
""")
    else:
        print("""
  PROBABLY NOT. Win rate has plateaued or is declining.
  Consider tuning hyperparameters or reward configuration.
""")

    # --- Recommendations ---
    print(f"{'─' * 65}")
    print("  RECOMMENDED FIXES (in priority order)")
    print(f"{'─' * 65}")

    fixes = []
    if no_wins and all_draws and all_max_len:
        fixes = [
            ("Reduce max_steps to 200",
             "  500 steps is far too many for a 6×6 map. Shorter episodes force\n"
             "  earlier confrontation and make terminal rewards reachable.\n"
             "  → Change: MAX_STEPS = 200"),
            ("Add a truncation penalty",
             "  When episodes are truncated (timeout), the agent sees no terminal\n"
             "  signal. Penalize truncation so the agent learns that stalling is bad.\n"
             "  → Add to reward_config: 'draw': -200.0"),
            ("Reduce shaping reward magnitudes",
             "  Shaping rewards (structure_control=5.0, unit_diff=1.0) are too\n"
             "  generous and create a comfortable local optimum. Scale them down.\n"
             "  → Set: structure_control=1.0, unit_diff=0.3, income_diff=0.05"),
            ("Increase turn_penalty",
             "  The current -0.1 per end_turn creates no urgency. Make stalling\n"
             "  costly so the agent pushes to end games decisively.\n"
             "  → Set: turn_penalty=-1.0"),
            ("Increase ent_coef for exploration",
             "  ent_coef=0.01 allows the policy to narrow too quickly. Higher\n"
             "  entropy keeps the agent exploring aggressive strategies.\n"
             "  → Set: ent_coef=0.05 (try range 0.02–0.1)"),
            ("Start with a weaker opponent",
             "  Train against 'random' first so the agent experiences wins early,\n"
             "  then graduate to 'bot' (SimpleBot).\n"
             "  → Change: OPPONENT = 'random' for initial training"),
        ]
    elif negative_rewards:
        fixes = [
            ("Switch to flat_discrete action space",
             "  Eliminates 99% invalid actions caused by per-dimension masking.\n"
             "  → Set: ACTION_SPACE = 'flat_discrete'  (already the notebook default)"),
            ("Reduce invalid_action penalty",
             "  If using multi_discrete, reduce from -10 to -0.1.\n"
             "  → Add to reward_config: 'invalid_action': -0.1"),
        ]

    if not fixes:
        fixes = [
            ("Review environment and reward configuration",
             "  Check that action masking, opponent, and rewards are correctly set."),
        ]

    for i, (title, detail) in enumerate(fixes, 1):
        print(f"\n  {i}. {title}\n{detail}")

    print(f"\n{'=' * 65}")


diagnose_training(results)

## 9c. Evaluation replay log

Record every move the agent makes during evaluation and save to JSON.
This lets you inspect exactly what the agent is doing each step — is it
spamming end_turn? Never attacking? Ignoring enemies?

Set `N_REPLAY_EPISODES` to control how many episodes to record
(default 3 to keep file sizes small).

In [None]:
N_REPLAY_EPISODES = 3   # episodes to record per checkpoint

ACTION_NAMES = [
    'create_unit', 'move', 'attack', 'seize', 'heal',
    'end_turn', 'paralyze', 'haste', 'defence_buff', 'attack_buff',
]
UNIT_NAMES = ['W', 'M', 'C', 'A', 'K', 'R', 'S', 'B']


def _snapshot_game_state(env):
    """Capture a summary of the current game state."""
    gs = env.unwrapped.game_state
    ap = env.unwrapped.agent_player
    opp = 3 - ap
    return {
        'agent_gold': gs.player_gold.get(ap, 0),
        'opponent_gold': gs.player_gold.get(opp, 0),
        'agent_units': sum(1 for u in gs.units if u.player == ap),
        'opponent_units': sum(1 for u in gs.units if u.player == opp),
        'agent_structures': len(gs.grid.get_capturable_tiles(player=ap)),
        'opponent_structures': len(gs.grid.get_capturable_tiles(player=opp)),
        'turn': gs.turn_number,
    }


def evaluate_with_replay(model, env, n_episodes=3):
    """
    Run evaluation episodes and record every action to a replay log.

    Returns:
        List of episode dicts, each containing 'steps', 'outcome', etc.
    """
    episodes = []

    for ep_idx in range(n_episodes):
        obs, _ = env.reset()
        done = False
        ep_reward = 0.0
        steps = []
        step_num = 0

        while not done:
            masks = env.action_masks()
            action, _ = model.predict(obs, deterministic=True, action_masks=masks)

            # Decode the action BEFORE stepping
            raw_action = action
            if ACTION_SPACE == 'flat_discrete':
                action_idx = int(action)
                inner = env.unwrapped
                if 0 <= action_idx < len(inner._current_actions):
                    action_arr = inner._current_actions[action_idx]
                else:
                    action_arr = np.array([5, 0, 0, 0, 0, 0])
            else:
                action_arr = np.asarray(action)

            action_type = int(action_arr[0])
            action_name = ACTION_NAMES[action_type] if action_type < len(ACTION_NAMES) else f'unknown_{action_type}'
            unit_type = UNIT_NAMES[int(action_arr[1]) % 8]
            from_pos = [int(action_arr[2]), int(action_arr[3])]
            to_pos = [int(action_arr[4]), int(action_arr[5])]

            # Step
            obs, reward, terminated, truncated, info = env.step(raw_action)
            ep_reward += reward
            step_num += 1
            done = terminated or truncated

            step_record = {
                'step': step_num,
                'action': action_name,
                'unit_type': unit_type if action_type == 0 else None,
                'from': from_pos,
                'to': to_pos,
                'reward': round(float(reward), 3),
                'cumulative_reward': round(float(ep_reward), 3),
                'valid': info.get('valid_action', True),
                'game_state': _snapshot_game_state(env),
            }
            steps.append(step_record)

        winner = info.get('winner')
        if winner == 1:
            outcome = 'win'
        elif winner is not None:
            outcome = 'loss'
        else:
            outcome = 'draw'

        episodes.append({
            'episode': ep_idx,
            'outcome': outcome,
            'total_reward': round(float(ep_reward), 2),
            'length': step_num,
            'steps': steps,
        })

    return episodes


print('evaluate_with_replay() defined.')

In [None]:
# Record replays from the final checkpoint
print(f'Recording {N_REPLAY_EPISODES} replay episodes from final model...\n')
replay_episodes = evaluate_with_replay(model, eval_env, n_episodes=N_REPLAY_EPISODES)

# Save to JSON
replay_path = BENCHMARK_DIR / 'eval_replays.json'
replay_data = {
    'metadata': {
        'timesteps': CHECKPOINTS[-1],
        'map': MAP_FILE,
        'opponent': OPPONENT,
        'action_space': ACTION_SPACE,
        'max_steps': MAX_STEPS,
    },
    'episodes': replay_episodes,
}
with open(replay_path, 'w') as f:
    json.dump(replay_data, f, indent=2)
print(f'Saved replays: {replay_path}')

# --- Print summary for each episode ---
for ep in replay_episodes:
    print(f'\n{"─" * 55}')
    print(f'Episode {ep["episode"]}  |  outcome={ep["outcome"]}  |  '
          f'length={ep["length"]}  |  reward={ep["total_reward"]}')
    print(f'{"─" * 55}')

    # Action distribution
    from collections import Counter
    action_counts = Counter(s['action'] for s in ep['steps'])
    total = len(ep['steps'])
    print(f'\n  Action distribution ({total} steps):')
    for action, count in action_counts.most_common():
        pct = count / total * 100
        bar = '#' * int(pct / 2)
        print(f'    {action:15s}  {count:4d}  ({pct:5.1f}%)  {bar}')

    # Invalid action count
    invalid = sum(1 for s in ep['steps'] if not s['valid'])
    if invalid > 0:
        print(f'\n  Invalid actions: {invalid}/{total} ({invalid/total*100:.1f}%)')

    # Game state at end
    final = ep['steps'][-1]['game_state']
    print(f'\n  Final state (step {ep["length"]}):')
    print(f'    Agent:    {final["agent_units"]} units, '
          f'{final["agent_structures"]} structures, '
          f'{final["agent_gold"]} gold')
    print(f'    Opponent: {final["opponent_units"]} units, '
          f'{final["opponent_structures"]} structures, '
          f'{final["opponent_gold"]} gold')

    # Show first 10 and last 10 moves
    steps = ep['steps']
    print(f'\n  First 10 moves:')
    for s in steps[:10]:
        gs = s['game_state']
        ut = f' ({s["unit_type"]})' if s['unit_type'] else ''
        valid_marker = '' if s['valid'] else ' [INVALID]'
        print(f'    step {s["step"]:3d}  {s["action"]:15s}{ut:5s}  '
              f'{s["from"]}→{s["to"]}  r={s["reward"]:+.1f}  '
              f'units={gs["agent_units"]}v{gs["opponent_units"]}  '
              f'turn={gs["turn"]}{valid_marker}')

    if len(steps) > 20:
        print(f'    ... ({len(steps) - 20} steps omitted) ...')

    if len(steps) > 10:
        print(f'  Last 10 moves:')
        for s in steps[-10:]:
            gs = s['game_state']
            ut = f' ({s["unit_type"]})' if s['unit_type'] else ''
            valid_marker = '' if s['valid'] else ' [INVALID]'
            print(f'    step {s["step"]:3d}  {s["action"]:15s}{ut:5s}  '
                  f'{s["from"]}→{s["to"]}  r={s["reward"]:+.1f}  '
                  f'units={gs["agent_units"]}v{gs["opponent_units"]}  '
                  f'turn={gs["turn"]}{valid_marker}')

## 10. Save results

In [None]:
# Save benchmark results as JSON
benchmark_data = {
    'metadata': {
        'date': datetime.now().isoformat(),
        'map': MAP_FILE,
        'opponent': OPPONENT,
        'max_steps': MAX_STEPS,
        'n_envs': N_ENVS,
        'eval_episodes': EVAL_EPISODES,
        'seed': SEED,
        'device': DEVICE,
        'ppo_config': PPO_CONFIG,
    },
    'results': results,
}

results_path = BENCHMARK_DIR / 'benchmark_results.json'
with open(results_path, 'w') as f:
    json.dump(benchmark_data, f, indent=2)

print(f'Saved results:  {results_path}')

# Also save as CSV for easy viewing
csv_path = BENCHMARK_DIR / 'benchmark_results.csv'
df.to_csv(csv_path, index=False)
print(f'Saved CSV:      {csv_path}')

# List all saved files
print(f'\nAll benchmark files:')
for p in sorted(BENCHMARK_DIR.iterdir()):
    size = p.stat().st_size
    if size > 1024 * 1024:
        size_str = f'{size / 1024 / 1024:.1f} MB'
    elif size > 1024:
        size_str = f'{size / 1024:.1f} KB'
    else:
        size_str = f'{size} B'
    print(f'  {p.name:40s}  {size_str}')

## 11. TensorBoard (optional)

Launch TensorBoard to inspect detailed training metrics (loss, entropy, etc.).

In [None]:
# Uncomment to launch TensorBoard inline:
# %load_ext tensorboard
# %tensorboard --logdir benchmarks/ppo_vs_simplebot/tensorboard

print('To view TensorBoard locally, run:')
print(f'  tensorboard --logdir {BENCHMARK_DIR / "tensorboard"}')

## 12. Interpreting the results

### What to expect

| Timesteps | Expected Win Rate | Notes |
|-----------|-------------------|-------|
| 10K | 0–15% | Agent is mostly random, learning basic actions |
| 50K | 15–40% | Agent starts making meaningful moves |
| 200K | 40–70% | Competent play, learns unit creation and combat |
| 1M | 60–90%+ | Strong play against SimpleBot |

**Note:** Exact numbers depend on hardware and random seed. The important
thing is that your curve has a similar *shape* — monotonically increasing
win rate with diminishing returns after ~200K steps.

### If your results differ significantly

- **Much worse:** Check that action masking is working (the agent should
  rarely attempt invalid actions). Verify the map file path is correct.
- **Much better:** You may have found better hyperparameters! Consider
  contributing them back.
- **Unstable (oscillating win rate):** Try reducing the learning rate
  or increasing the batch size.

### Next steps

1. **Try different maps:** Larger maps (10×10, 14×14) are harder
2. **Tune hyperparameters:** Adjust `ent_coef`, `learning_rate`, etc.
3. **Self-play training:** See `train/train_self_play.py`
4. **AlphaZero:** See `train/train_alphazero.py` for MCTS-based training

In [None]:
# Clean up environments
vec_env.close()
eval_env.close()
print('Done.')