<a href="https://colab.research.google.com/github/kuds/reinforce-tactics/blob/main/notebooks/%5BReinforce%20Tactics%5D%20PPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🎮 Reinforce Tactics - RL Training in Google Colab

This notebook trains an AI agent to play Reinforce Tactics using Reinforcement Learning.

**Features:**
- Headless training (no GUI needed)
- Stable-Baselines3 PPO algorithm
- TensorBoard monitoring
- Model saving and evaluation
- GPU acceleration support

**Runtime:** Use GPU runtime for faster training (Runtime → Change runtime type → GPU)

## 📦 Setup and Installation

In [None]:
# Install dependencies
!pip install -q gymnasium stable-baselines3[extra] tensorboard pandas numpy

# Check if GPU is available
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"\n✅ Using device: {device}")
if device == 'cuda':
    print(f"   GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Clone repository (replace with your repo URL)
import os
from pathlib import Path

if not Path('reinforcetactics').exists():
    print("📥 Setting up Reinforce Tactics...")
    # For this demo, we'll create the necessary files
    # In practice, you would: !git clone https://github.com/your-repo/reinforcetactics.git
    !mkdir -p reinforcetactics/{core,game,rl,ui,utils}
    !touch reinforcetactics/__init__.py
    print("✅ Setup complete!")
else:
    print("✅ Reinforce Tactics already set up")

# Add to Python path
import sys
if '/content' not in sys.path:
    sys.path.insert(0, '/content')

## 📝 Create Minimal Game Files

Since we can't upload the full codebase, we'll create minimal versions of the key files.

In [None]:
%%writefile reinforcetactics/constants.py
"""Game constants"""

TILE_SIZE = 32
MIN_MAP_SIZE = 20
STARTING_GOLD = 250

UNIT_DATA = {
    'W': {'name': 'Warrior', 'cost': 200, 'movement': 3, 'health': 15, 'attack': 10},
    'M': {'name': 'Mage', 'cost': 250, 'movement': 2, 'health': 10, 'attack': {'adjacent': 8, 'range': 12}},
    'C': {'name': 'Cleric', 'cost': 200, 'movement': 2, 'health': 8, 'attack': 2}
}

HEADQUARTERS_INCOME = 150
BUILDING_INCOME = 100
TOWER_INCOME = 50

TOWER_MAX_HEALTH = 30
BUILDING_MAX_HEALTH = 40
HEADQUARTERS_MAX_HEALTH = 50

COUNTER_ATTACK_MULTIPLIER = 0.9
PARALYZE_DURATION = 3
HEAL_AMOUNT = 5
STRUCTURE_REGEN_RATE = 0.5

In [None]:
%%writefile reinforcetactics/rl/simple_env.py
"""Simplified Gymnasium environment for Colab training"""

import gymnasium as gym
from gymnasium import spaces
import numpy as np

class SimpleStrategyEnv(gym.Env):
    """Simplified strategy game environment for demonstration."""

    def __init__(self, grid_size=10, max_steps=200):
        super().__init__()

        self.grid_size = grid_size
        self.max_steps = max_steps
        self.current_step = 0

        # Simplified observation: grid + global features
        self.observation_space = spaces.Dict({
            'grid': spaces.Box(low=0, high=10, shape=(grid_size, grid_size, 4), dtype=np.float32),
            'global_features': spaces.Box(low=0, high=1000, shape=(6,), dtype=np.float32)
        })

        # Simplified action space: [action_type, x, y]
        self.action_space = spaces.MultiDiscrete([4, grid_size, grid_size])

        self.reset()

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)

        self.current_step = 0
        self.player_hp = 100
        self.enemy_hp = 100
        self.player_gold = 500
        self.enemy_gold = 500
        self.player_units = 3
        self.enemy_units = 3

        # Simple grid representation
        self.grid = np.zeros((self.grid_size, self.grid_size, 4), dtype=np.float32)

        # Place player units
        self.grid[1, 1, 0] = 1  # Player marker
        self.grid[1, 1, 1] = 100  # HP

        # Place enemy units
        self.grid[8, 8, 0] = 2  # Enemy marker
        self.grid[8, 8, 1] = 100  # HP

        return self._get_obs(), {}

    def _get_obs(self):
        return {
            'grid': self.grid.copy(),
            'global_features': np.array([
                self.player_gold,
                self.enemy_gold,
                self.current_step / self.max_steps,
                self.player_units,
                self.enemy_units,
                self.player_hp
            ], dtype=np.float32)
        }

    def step(self, action):
        self.current_step += 1

        action_type, x, y = action
        x = min(x, self.grid_size - 1)
        y = min(y, self.grid_size - 1)

        reward = 0.0

        # Simple action logic
        if action_type == 0:  # Attack
            if self.grid[y, x, 0] == 2:  # Enemy present
                damage = np.random.randint(5, 15)
                self.enemy_hp -= damage
                reward += damage * 0.5

        elif action_type == 1:  # Create unit
            if self.player_gold >= 100:
                self.player_gold -= 100
                self.player_units += 1
                reward += 5

        elif action_type == 2:  # Capture
            reward += 0.1

        elif action_type == 3:  # End turn
            self.player_gold += 50
            # Enemy turn (simple AI)
            if np.random.random() > 0.5 and self.enemy_gold >= 100:
                self.enemy_gold -= 100
                self.enemy_units += 1
            self.enemy_gold += 50

        # Enemy attacks (simplified)
        if np.random.random() > 0.7:
            damage = np.random.randint(3, 10)
            self.player_hp -= damage
            reward -= damage * 0.3

        # Check termination
        terminated = False
        if self.enemy_hp <= 0:
            reward += 1000
            terminated = True
        elif self.player_hp <= 0:
            reward -= 1000
            terminated = True

        truncated = self.current_step >= self.max_steps

        return self._get_obs(), reward, terminated, truncated, {
            'player_hp': self.player_hp,
            'enemy_hp': self.enemy_hp,
            'player_units': self.player_units,
            'enemy_units': self.enemy_units
        }

    def render(self):
        pass  # Headless mode

## 🚀 Training Configuration

In [None]:
# Training configuration
config = {
    'total_timesteps': 100_000,      # Increase for better results (500k - 1M recommended)
    'learning_rate': 3e-4,
    'n_steps': 2048,
    'batch_size': 64,
    'n_epochs': 10,
    'gamma': 0.99,
    'gae_lambda': 0.95,
    'clip_range': 0.2,
    'ent_coef': 0.01,
    'vf_coef': 0.5,
    'max_grad_norm': 0.5,
    'n_envs': 4,                     # Parallel environments
    'device': device,
}

print("📋 Training Configuration:")
for key, value in config.items():
    print(f"   {key}: {value}")

## 🎮 Create Training Environment

In [None]:
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines3.common.utils import set_random_seed
from reinforcetactics.rl.simple_env import SimpleStrategyEnv

def make_env(rank, seed=0):
    """Create a single environment."""
    def _init():
        env = SimpleStrategyEnv(grid_size=10, max_steps=200)
        env.reset(seed=seed + rank)
        return env
    set_random_seed(seed)
    return _init

# Create vectorized environments
print(f"🎮 Creating {config['n_envs']} parallel environments...")

if config['n_envs'] == 1:
    env = DummyVecEnv([make_env(0)])
else:
    env = SubprocVecEnv([make_env(i) for i in range(config['n_envs'])])

# Create evaluation environment
eval_env = DummyVecEnv([make_env(999)])

print("✅ Environments created!")
print(f"   Observation space: {env.observation_space}")
print(f"   Action space: {env.action_space}")

## 📊 Setup TensorBoard

In [None]:
%load_ext tensorboard

# Create log directory
from datetime import datetime
log_dir = f"./logs/ppo_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
os.makedirs(log_dir, exist_ok=True)

print(f"📁 Log directory: {log_dir}")
print("\n🎯 TensorBoard will start after training begins...")

## 🤖 Create and Train Model

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback

print("🤖 Creating PPO model...")

# Create model
model = PPO(
    "MultiInputPolicy",
    env,
    learning_rate=config['learning_rate'],
    n_steps=config['n_steps'],
    batch_size=config['batch_size'],
    n_epochs=config['n_epochs'],
    gamma=config['gamma'],
    gae_lambda=config['gae_lambda'],
    clip_range=config['clip_range'],
    ent_coef=config['ent_coef'],
    vf_coef=config['vf_coef'],
    max_grad_norm=config['max_grad_norm'],
    verbose=1,
    tensorboard_log=log_dir,
    device=config['device']
)

# Setup callbacks
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path=f"{log_dir}/best_model",
    log_path=f"{log_dir}/eval",
    eval_freq=10000,
    n_eval_episodes=5,
    deterministic=True
)

checkpoint_callback = CheckpointCallback(
    save_freq=20000,
    save_path=f"{log_dir}/checkpoints",
    name_prefix="ppo_model"
)

print("✅ Model created!")
print(f"\n🎓 Training for {config['total_timesteps']:,} timesteps...\n")

# Train the model
try:
    model.learn(
        total_timesteps=config['total_timesteps'],
        callback=[eval_callback, checkpoint_callback],
        progress_bar=True
    )
    print("\n✅ Training completed successfully!")
except KeyboardInterrupt:
    print("\n⚠️  Training interrupted by user")

# Save final model
final_model_path = f"{log_dir}/final_model.zip"
model.save(final_model_path)
print(f"💾 Final model saved to: {final_model_path}")

## 📈 View Training Progress

In [None]:
%tensorboard --logdir {log_dir}

## 🧪 Evaluate Trained Agent

In [None]:
import numpy as np
from tqdm import tqdm

def evaluate_agent(model, env, n_episodes=20):
    """Evaluate the trained agent."""
    print(f"\n🧪 Evaluating agent over {n_episodes} episodes...\n")

    episode_rewards = []
    episode_lengths = []
    wins = 0

    for episode in tqdm(range(n_episodes)):
        obs = env.reset()
        done = False
        episode_reward = 0
        episode_length = 0

        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, done, info = env.step(action)
            episode_reward += reward[0]
            episode_length += 1

            if done[0]:
                # Check if won (positive large reward)
                if reward[0] > 500:
                    wins += 1
                break

        episode_rewards.append(episode_reward)
        episode_lengths.append(episode_length)

    # Print results
    print("\n" + "="*60)
    print("📊 Evaluation Results")
    print("="*60)
    print(f"Episodes:           {n_episodes}")
    print(f"Win Rate:           {wins/n_episodes*100:.1f}% ({wins}/{n_episodes})")
    print(f"Mean Reward:        {np.mean(episode_rewards):.2f} ± {np.std(episode_rewards):.2f}")
    print(f"Mean Length:        {np.mean(episode_lengths):.1f} steps")
    print(f"Best Reward:        {np.max(episode_rewards):.2f}")
    print(f"Worst Reward:       {np.min(episode_rewards):.2f}")
    print("="*60)

    return {
        'win_rate': wins/n_episodes,
        'mean_reward': np.mean(episode_rewards),
        'std_reward': np.std(episode_rewards),
        'mean_length': np.mean(episode_lengths)
    }

# Evaluate the trained model
eval_results = evaluate_agent(model, eval_env, n_episodes=20)

## 📊 Visualize Results

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Load training data from TensorBoard logs
from tensorboard.backend.event_processing import event_accumulator

def load_tensorboard_data(log_dir):
    """Load data from TensorBoard logs."""
    ea = event_accumulator.EventAccumulator(log_dir)
    ea.Reload()

    # Get available tags
    tags = ea.Tags()['scalars']

    data = {}
    for tag in tags:
        events = ea.Scalars(tag)
        data[tag] = pd.DataFrame([
            {'step': e.step, 'value': e.value} for e in events
        ])

    return data

# Try to load and plot training data
try:
    # Find the PPO_1 subdirectory
    ppo_dirs = [d for d in os.listdir(log_dir) if d.startswith('PPO_')]
    if ppo_dirs:
        tb_log_dir = os.path.join(log_dir, ppo_dirs[0])
        data = load_tensorboard_data(tb_log_dir)

        # Create plots
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        fig.suptitle('Training Progress', fontsize=16, fontweight='bold')

        # Plot episode reward
        if 'rollout/ep_rew_mean' in data:
            df = data['rollout/ep_rew_mean']
            axes[0, 0].plot(df['step'], df['value'])
            axes[0, 0].set_title('Episode Reward Mean')
            axes[0, 0].set_xlabel('Steps')
            axes[0, 0].set_ylabel('Reward')
            axes[0, 0].grid(True, alpha=0.3)

        # Plot episode length
        if 'rollout/ep_len_mean' in data:
            df = data['rollout/ep_len_mean']
            axes[0, 1].plot(df['step'], df['value'], color='orange')
            axes[0, 1].set_title('Episode Length Mean')
            axes[0, 1].set_xlabel('Steps')
            axes[0, 1].set_ylabel('Length')
            axes[0, 1].grid(True, alpha=0.3)

        # Plot loss
        if 'train/loss' in data:
            df = data['train/loss']
            axes[1, 0].plot(df['step'], df['value'], color='red')
            axes[1, 0].set_title('Training Loss')
            axes[1, 0].set_xlabel('Steps')
            axes[1, 0].set_ylabel('Loss')
            axes[1, 0].grid(True, alpha=0.3)

        # Plot learning rate
        if 'train/learning_rate' in data:
            df = data['train/learning_rate']
            axes[1, 1].plot(df['step'], df['value'], color='green')
            axes[1, 1].set_title('Learning Rate')
            axes[1, 1].set_xlabel('Steps')
            axes[1, 1].set_ylabel('LR')
            axes[1, 1].grid(True, alpha=0.3)

        plt.tight_layout()
        plt.show()

        print("✅ Training plots generated!")
    else:
        print("⚠️  No TensorBoard data found yet. Train for a few more steps.")

except Exception as e:
    print(f"⚠️  Could not generate plots: {e}")
    print("   This is normal if training just started.")

## 💾 Save and Download Model

In [None]:
from google.colab import files
import shutil

# Create a zip file with all models and logs
print("📦 Packaging models and logs...")

output_zip = f"reinforcetactics_training_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
shutil.make_archive(output_zip, 'zip', log_dir)

print(f"✅ Created {output_zip}.zip")
print(f"   Size: {os.path.getsize(output_zip + '.zip') / 1024 / 1024:.2f} MB")

# Download
print("\n⬇️  Downloading...")
files.download(output_zip + '.zip')

print("\n✅ Download complete!")
print("\n📝 The zip contains:")
print("   - final_model.zip (trained model)")
print("   - best_model/ (best performing checkpoint)")
print("   - checkpoints/ (periodic checkpoints)")
print("   - tensorboard/ (training logs)")

## 📂 Load and Test a Saved Model

In [None]:
# Load a previously saved model
print("📂 Loading saved model...")

loaded_model = PPO.load(final_model_path, env=eval_env)
print("✅ Model loaded successfully!")

# Quick test
print("\n🎮 Running quick test...")
obs = eval_env.reset()
for i in range(5):
    action, _ = loaded_model.predict(obs, deterministic=True)
    obs, reward, done, info = eval_env.step(action)
    print(f"Step {i+1}: Action={action}, Reward={reward[0]:.2f}")
    if done[0]:
        print("Episode finished!")
        break

print("\n✅ Model test complete!")

## 🚀 Advanced: Hyperparameter Tuning

Try different hyperparameters to improve performance:

In [None]:
# Example hyperparameter configurations to try
hyperparam_configs = [
    {'learning_rate': 3e-4, 'n_steps': 2048, 'ent_coef': 0.01, 'name': 'baseline'},
    {'learning_rate': 1e-4, 'n_steps': 4096, 'ent_coef': 0.005, 'name': 'conservative'},
    {'learning_rate': 5e-4, 'n_steps': 1024, 'ent_coef': 0.02, 'name': 'aggressive'},
]

print("🔬 Hyperparameter configurations available:")
for i, cfg in enumerate(hyperparam_configs):
    print(f"\n{i+1}. {cfg['name'].title()}:")
    print(f"   - Learning Rate: {cfg['learning_rate']}")
    print(f"   - Steps: {cfg['n_steps']}")
    print(f"   - Entropy Coef: {cfg['ent_coef']}")

print("\n💡 To try a different configuration:")
print("   1. Update the 'config' dictionary above")
print("   2. Re-run the training cells")

## 💡 Tips for Better Training

1. **Use GPU Runtime**: Change to GPU runtime for 5-10x faster training
2. **Train Longer**: Increase `total_timesteps` to 500k-1M for better results
3. **Monitor TensorBoard**: Watch for signs of overfitting or instability
4. **Tune Hyperparameters**: Try different learning rates and batch sizes
5. **Save Checkpoints**: Models are saved every 20k steps automatically
6. **Curriculum Learning**: Start with easier opponents, gradually increase difficulty

## 🐛 Troubleshooting

- **Out of Memory**: Reduce `n_envs` or `batch_size`
- **Slow Training**: Enable GPU runtime
- **Unstable Learning**: Reduce learning rate, increase batch size
- **Not Learning**: Check reward shaping, increase exploration (ent_coef)

## 📚 Next Steps

1. Deploy the trained model to play against humans
2. Implement self-play for stronger agents
3. Add hierarchical RL for complex strategies
4. Create a tournament system for multiple agents
