<a href="https://colab.research.google.com/github/kuds/reinforce-tactics/blob/main/notebooks/ppo_6x6_beginner_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üéÆ Reinforce Tactics - PPO Training on 6x6 Beginner Map

This notebook trains a PPO (Proximal Policy Optimization) agent to play against SimpleBot on the 6x6 beginner map in headless mode.

**Features:**
- Headless training (no GUI rendering) for fast RL training
- 6x6 beginner map for quick training iterations
- SimpleBot opponent for consistent baseline
- Stable-Baselines3 PPO algorithm with MultiInputPolicy
- TensorBoard monitoring
- Checkpoint saving and model evaluation
- GPU acceleration support

**Map Layout (6x6 beginner):**
```
h_1,b_1,p,p,p,p
b_1,p,p,p,p,p
p,p,t,t,p,p
p,p,t,t,p,p
p,p,p,p,p,b_2
p,p,p,p,b_2,h_2
```

**Runtime:** Use GPU runtime for faster training (Runtime ‚Üí Change runtime type ‚Üí GPU)

## üì¶ Setup and Installation

In [None]:
# Install dependencies
# Note: For production use, pin specific versions for reproducibility:
# !pip install -q gymnasium==0.29.1 stable-baselines3[extra]==2.0.0 tensorboard==2.14.0
!pip install -q gymnasium stable-baselines3[extra] tensorboard pandas numpy torch

# Check if GPU is available
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"\n‚úÖ Using device: {device}")
if device == 'cuda':
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print("   ‚ö†Ô∏è  No GPU detected. Training will be slower. Consider switching to GPU runtime.")

In [None]:
# Clone the Reinforce Tactics repository
import os
from pathlib import Path

if not Path('reinforce-tactics').exists():
    print("üì• Cloning Reinforce Tactics repository...")
    !git clone https://github.com/kuds/reinforce-tactics.git
    print("‚úÖ Repository cloned!")
else:
    print("‚úÖ Repository already cloned")

# Change to repository directory
os.chdir('reinforce-tactics')
print(f"\nüìÇ Current directory: {os.getcwd()}")

# Add to Python path
import sys
if os.getcwd() not in sys.path:
    sys.path.insert(0, os.getcwd())
    print("‚úÖ Added to Python path")

## üìö Import Required Modules

In [None]:
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import CheckpointCallback, EvalCallback
from stable_baselines3.common.evaluation import evaluate_policy
import gymnasium as gym

# Import Reinforce Tactics environment
from reinforcetactics.rl.gym_env import StrategyGameEnv

print("‚úÖ All modules imported successfully!")

## üéØ Environment Configuration

We'll create the StrategyGameEnv with:
- **Map**: 6x6 beginner map (`maps/1v1/6x6_beginner.csv`)
- **Opponent**: SimpleBot (consistent baseline opponent)
- **Render Mode**: None (headless mode for fast training)
- **Max Steps**: 500 steps per episode

In [None]:
# Create directories for logs and models
!mkdir -p logs/ppo_6x6_beginner
!mkdir -p models/ppo_6x6_beginner

# Environment configuration
MAP_FILE = 'maps/1v1/6x6_beginner.csv'
LOG_DIR = './logs/ppo_6x6_beginner'
MODEL_DIR = './models/ppo_6x6_beginner'

def make_env():
    """Create and wrap the environment."""
    env = StrategyGameEnv(
        map_file=MAP_FILE,
        opponent='bot',  # SimpleBot opponent
        render_mode=None,  # Headless mode
        max_steps=500
    )
    env = Monitor(env, LOG_DIR)
    return env

# Create vectorized environment
env = DummyVecEnv([make_env])

print("‚úÖ Environment created successfully!")
print(f"\nüìä Environment Details:")
print(f"   Map: {MAP_FILE}")
print(f"   Opponent: SimpleBot")
print(f"   Render Mode: None (headless)")
print(f"   Observation Space: {env.envs[0].observation_space}")
print(f"   Action Space: {env.envs[0].action_space}")

## ü§ñ PPO Model Configuration

We'll configure PPO with the following hyperparameters:
- **Policy**: MultiInputPolicy (for Dict observation space)
- **Learning Rate**: 3e-4
- **Steps per Rollout**: 2048
- **Batch Size**: 64
- **Epochs**: 10
- **Gamma**: 0.99 (discount factor)
- **GAE Lambda**: 0.95
- **Clip Range**: 0.2
- **Entropy Coefficient**: 0.01 (for exploration)

In [None]:
# PPO hyperparameters
ppo_config = {
    'learning_rate': 3e-4,
    'n_steps': 2048,
    'batch_size': 64,
    'n_epochs': 10,
    'gamma': 0.99,
    'gae_lambda': 0.95,
    'clip_range': 0.2,
    'ent_coef': 0.01,
    'verbose': 1,
    'tensorboard_log': LOG_DIR,
    'device': device
}

# Create PPO model
model = PPO(
    policy='MultiInputPolicy',  # Required for Dict observation space
    env=env,
    **ppo_config
)

print("‚úÖ PPO model created successfully!")
print(f"\nüìä Model Configuration:")
for key, value in ppo_config.items():
    print(f"   {key}: {value}")

# Print model architecture
print(f"\nüèóÔ∏è  Model Architecture:")
print(model.policy)

## üíæ Set Up Training Callbacks

We'll use callbacks to:
- Save model checkpoints every 10,000 steps
- Evaluate the model periodically during training

In [None]:
# Checkpoint callback - save model every 10k steps
checkpoint_callback = CheckpointCallback(
    save_freq=10000,
    save_path=MODEL_DIR,
    name_prefix='ppo_6x6_beginner',
    save_replay_buffer=False,
    save_vecnormalize=False,
)

# Create evaluation environment
eval_env = DummyVecEnv([make_env])

# Evaluation callback - evaluate every 20k steps
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path=MODEL_DIR,
    log_path=LOG_DIR,
    eval_freq=20000,
    n_eval_episodes=5,
    deterministic=True,
    render=False
)

# Combine callbacks
callbacks = [checkpoint_callback, eval_callback]

print("‚úÖ Callbacks configured successfully!")
print(f"   - Checkpoint every 10,000 steps")
print(f"   - Evaluation every 20,000 steps (5 episodes)")

## üèãÔ∏è Training Section

Now we'll train the PPO agent. For this demo, we'll train for **100,000 timesteps**.

**Note**: For better results, increase to 500,000 - 1,000,000 timesteps. Training time depends on hardware:
- With GPU: ~5-10 minutes for 100k steps
- Without GPU: ~30-60 minutes for 100k steps

You can monitor training progress in TensorBoard by running the cell below after training starts.

In [None]:
# Training configuration
TOTAL_TIMESTEPS = 100000  # Increase to 500k-1M for better results

print(f"üèãÔ∏è  Starting PPO training...")
print(f"   Total timesteps: {TOTAL_TIMESTEPS:,}")
print(f"   Device: {device}")
print(f"   Map: 6x6 beginner")
print(f"   Opponent: SimpleBot")
print(f"\nüìä Training progress will be displayed below...\n")

# Train the model
model.learn(
    total_timesteps=TOTAL_TIMESTEPS,
    callback=callbacks,
    progress_bar=True
)

# Save final model
final_model_path = f"{MODEL_DIR}/ppo_6x6_beginner_final"
model.save(final_model_path)

print(f"\n‚úÖ Training complete!")
print(f"   Final model saved to: {final_model_path}")

## üìà Monitor Training with TensorBoard

Launch TensorBoard to visualize training metrics in real-time:

In [None]:
%load_ext tensorboard
%tensorboard --logdir {LOG_DIR}

## üéØ Evaluation Section

Now let's evaluate the trained model against SimpleBot and calculate the win rate.

In [None]:
# Load the trained model
print("üì• Loading trained model...")
trained_model = PPO.load(final_model_path, env=env)
print("‚úÖ Model loaded successfully!")

In [None]:
# Evaluate the model
print("üéØ Evaluating trained model...\n")

n_eval_episodes = 20
mean_reward, std_reward = evaluate_policy(
    trained_model,
    eval_env,
    n_eval_episodes=n_eval_episodes,
    deterministic=True,
    return_episode_rewards=False
)

print(f"\nüìä Evaluation Results ({n_eval_episodes} episodes):")
print(f"   Mean Reward: {mean_reward:.2f} ¬± {std_reward:.2f}")

In [None]:
# Detailed evaluation with episode-by-episode results
print("\nüîç Detailed Episode-by-Episode Evaluation...\n")

wins = 0
losses = 0
episode_rewards = []
episode_lengths = []

for episode in range(n_eval_episodes):
    obs = eval_env.reset()
    done = False
    episode_reward = 0
    episode_length = 0

    while not done:
        action, _ = trained_model.predict(obs, deterministic=True)
        obs, reward, done, info = eval_env.step(action)
        episode_reward += reward[0]
        episode_length += 1

    # Check winner from info
    episode_stats = info[0].get('episode_stats', {})
    winner = episode_stats.get('winner', None)

    if winner == 1:
        wins += 1
        result = "WIN ‚úÖ"
    elif winner == 2:
        losses += 1
        result = "LOSS ‚ùå"
    else:
        result = "DRAW ‚öñÔ∏è"

    episode_rewards.append(episode_reward)
    episode_lengths.append(episode_length)

    print(f"Episode {episode + 1:2d}: {result} | Reward: {episode_reward:7.2f} | Length: {episode_length:3d} steps")

# Calculate statistics
win_rate = (wins / n_eval_episodes) * 100
avg_reward = np.mean(episode_rewards)
avg_length = np.mean(episode_lengths)

print(f"\n" + "="*60)
print(f"üìä EVALUATION SUMMARY")
print(f"="*60)
print(f"Total Episodes:     {n_eval_episodes}")
print(f"Wins:               {wins} ({win_rate:.1f}%)")
print(f"Losses:             {losses} ({(losses/n_eval_episodes)*100:.1f}%)")
print(f"Average Reward:     {avg_reward:.2f}")
print(f"Average Length:     {avg_length:.1f} steps")
print(f"="*60)

## üëÄ Watch a Game (Optional)

If you want to see the agent play, you can run a single episode and print the actions:

In [None]:
print("üéÆ Watching a single game...\n")

obs = eval_env.reset()
done = False
step_count = 0
total_reward = 0

while not done and step_count < 50:  # Limit to 50 steps for display
    action, _ = trained_model.predict(obs, deterministic=True)
    obs, reward, done, info = eval_env.step(action)

    step_count += 1
    total_reward += reward[0]

    # Decode action for display
    action_types = ['CREATE_UNIT', 'MOVE', 'ATTACK', 'SEIZE', 'HEAL', 'END_TURN']
    action_type_idx = action[0][0]
    action_type = action_types[action_type_idx] if action_type_idx < len(action_types) else 'UNKNOWN'

    print(f"Step {step_count:2d}: Action={action_type:12s} | Reward={reward[0]:7.2f} | Total={total_reward:7.2f}")

    if done:
        episode_stats = info[0].get('episode_stats', {})
        winner = episode_stats.get('winner', None)
        if winner == 1:
            print("\n‚úÖ Agent WON!")
        elif winner == 2:
            print("\n‚ùå Agent LOST!")
        else:
            print("\n‚öñÔ∏è  Game ended in a draw or timeout")
        break

if not done:
    print(f"\n‚è∏Ô∏è  Game still in progress after {step_count} steps")

## üíæ Model Saving and Loading

Your trained model has been saved automatically. Here's how to load and use it:

In [None]:
print("üì¶ Model Locations:")
print(f"\n   Final Model:")
print(f"   {final_model_path}.zip")
print(f"\n   Checkpoints (every 10k steps):")
!ls -lh {MODEL_DIR}/ppo_6x6_beginner_*.zip 2>/dev/null | tail -5 || echo "   No checkpoints found yet"
print(f"\n   Best Model (from evaluation):")
!ls -lh {MODEL_DIR}/best_model.zip 2>/dev/null || echo "   No best model found yet"

print("\n" + "="*60)
print("üìù How to Load and Use the Model:")
print("="*60)
print("\n1. Load the model:")
print(f"   model = PPO.load('{final_model_path}')")
print("\n2. Create environment:")
print("   env = StrategyGameEnv(map_file='maps/1v1/6x6_beginner.csv', opponent='bot')")
print("\n3. Use the model:")
print("   obs, info = env.reset()")
print("   action, _states = model.predict(obs, deterministic=True)")
print("   obs, reward, terminated, truncated, info = env.step(action)")
print("\n4. Download from Colab (optional):")
print("   from google.colab import files")
print(f"   files.download('{final_model_path}.zip')")

## ‚¨áÔ∏è Download Model (Optional)

If you're running in Colab, you can download the trained model to your local machine:

In [None]:
# Uncomment to download the model
# from google.colab import files
# files.download(f'{final_model_path}.zip')
print("üí° Uncomment the code above to download the trained model")

## üî¨ Hyperparameter Tuning (Advanced)

Try different hyperparameters to improve performance:

In [None]:
print("üî¨ Suggested Hyperparameter Configurations:\n")

configs = [
    {
        'name': 'Baseline (Current)',
        'learning_rate': 3e-4,
        'n_steps': 2048,
        'batch_size': 64,
        'ent_coef': 0.01,
        'description': 'Standard PPO configuration'
    },
    {
        'name': 'Conservative',
        'learning_rate': 1e-4,
        'n_steps': 4096,
        'batch_size': 128,
        'ent_coef': 0.005,
        'description': 'Slower, more stable learning'
    },
    {
        'name': 'Aggressive',
        'learning_rate': 5e-4,
        'n_steps': 1024,
        'batch_size': 32,
        'ent_coef': 0.02,
        'description': 'Faster learning with more exploration'
    },
    {
        'name': 'High Exploration',
        'learning_rate': 3e-4,
        'n_steps': 2048,
        'batch_size': 64,
        'ent_coef': 0.05,
        'description': 'More exploration for diverse strategies'
    }
]

for i, config in enumerate(configs, 1):
    print(f"{i}. {config['name']}:")
    print(f"   Description: {config['description']}")
    print(f"   - learning_rate: {config['learning_rate']}")
    print(f"   - n_steps: {config['n_steps']}")
    print(f"   - batch_size: {config['batch_size']}")
    print(f"   - ent_coef: {config['ent_coef']}")
    print()

print("üí° To try a different configuration:")
print("   1. Modify the ppo_config dictionary in the 'PPO Model Configuration' cell")
print("   2. Re-run the training cells")
print("   3. Compare results using TensorBoard")

## üí° Tips for Better Training

### Performance Optimization
1. **Use GPU Runtime**: Change to GPU runtime in Colab for 5-10x faster training
   - Runtime ‚Üí Change runtime type ‚Üí GPU (T4 or better)
2. **Parallel Environments**: Use `SubprocVecEnv` instead of `DummyVecEnv` for CPU parallelization
3. **Increase Batch Size**: If you have enough memory, larger batch sizes can stabilize training

### Training Duration
1. **Quick Test**: 50k-100k timesteps (~5-10 min on GPU)
2. **Decent Agent**: 500k timesteps (~30-60 min on GPU)
3. **Strong Agent**: 1M-2M timesteps (~2-4 hours on GPU)

### Improving Agent Performance
1. **Reward Shaping**: Adjust reward coefficients in the environment configuration
2. **Curriculum Learning**: Start with easier opponents, gradually increase difficulty
3. **Hyperparameter Tuning**: Try different learning rates, entropy coefficients
4. **Longer Rollouts**: Increase `n_steps` for better credit assignment

### Monitoring and Debugging
1. **TensorBoard**: Monitor loss curves, reward trends, and policy entropy
2. **Episode Stats**: Track win rate, average reward, episode length
3. **Action Distribution**: Check if agent is exploring enough
4. **Invalid Actions**: Monitor the rate of invalid actions

## üêõ Troubleshooting

| Problem | Solution |
|---------|----------|
| **Out of Memory** | Reduce `n_steps`, `batch_size`, or number of parallel environments |
| **Slow Training** | Enable GPU runtime, use `SubprocVecEnv` |
| **Unstable Learning** | Reduce learning rate, increase batch size |
| **Not Learning** | Check reward shaping, increase `ent_coef` for more exploration |
| **High Invalid Actions** | Implement action masking in the environment |
| **Agent Gets Stuck** | Increase entropy coefficient, add reward shaping |

## üìö Next Steps

1. **Scale Up**: Train on larger maps (10x10, 14x14)
2. **Stronger Opponents**: Test against human players or other trained agents
3. **Self-Play**: Implement self-play for continual improvement
4. **Hierarchical RL**: Use goal-based policies for complex strategies
5. **Multi-Agent**: Train multiple agents in competitive scenarios
6. **Transfer Learning**: Use the trained model as a starting point for larger maps

## üìñ References

- [Stable-Baselines3 Documentation](https://stable-baselines3.readthedocs.io/)
- [PPO Paper](https://arxiv.org/abs/1707.06347)
- [Reinforce Tactics Repository](https://github.com/kuds/reinforce-tactics)
- [Gymnasium Documentation](https://gymnasium.farama.org/)