# Reinforcement Learning with Ray RLlib - Comprehensive Guide

This notebook provides a detailed guide to configuring and training RL algorithms with comprehensive hyperparameter control.

## Key Hyperparameter Categories

### 1. Training Hyperparameters
- **gamma (γ)**: Discount factor (0-1). How much to value future rewards vs immediate rewards
- **lr**: Learning rate. Step size for gradient descent updates
- **train_batch_size**: Total samples used per training iteration
- **entropy_coeff**: Encourages exploration by penalizing deterministic policies
- **vf_loss_coeff**: Weight of value function loss in total loss

### 2. Model Architecture
- **fcnet_hiddens**: List of hidden layer sizes (e.g., [256, 256] = 2 layers, 256 units each)
- **fcnet_activation**: Activation function (relu, tanh, elu, etc.)
- **vf_share_layers**: Whether policy and value networks share parameters

### 3. Environment Runners
- **num_env_runners**: Parallel workers collecting experience
- **rollout_fragment_length**: Steps collected before sending to learner

### 4. Algorithm-Specific
- **PPO**: Uses clipped surrogate objective with `clip_param` for stable updates
- **IMPALA**: Uses V-trace for off-policy correction with importance sampling

---


## PPO (Proximal Policy Optimization)

PPO is an on-policy algorithm that uses a clipped surrogate objective to prevent large policy updates.

**Key features:**
- On-policy: Uses recently collected data
- Stable training through policy clipping
- Good for continuous and discrete action spaces

**Algorithm Overview:**
1. Collect trajectories using current policy
2. Compute advantages using value function
3. Update policy using clipped objective
4. Update value function


In [None]:
from ray.rllib.algorithms.ppo import PPOConfig

# Create and configure PPO with detailed hyperparameters
config = (
    PPOConfig()
    .environment("CartPole-v1")
    
    # Training hyperparameters
    .training(
        gamma=0.99,                      # Discount factor for future rewards
        lr=5e-5,                         # Learning rate for optimizer
        train_batch_size=4000,           # Total batch size for training
        sgd_minibatch_size=128,          # Size of minibatches for SGD
        num_sgd_iter=30,                 # Number of SGD epochs per training iteration
        clip_param=0.2,                  # PPO clipping parameter
        vf_clip_param=10.0,              # Value function clipping
        entropy_coeff=0.01,              # Entropy regularization coefficient
        vf_loss_coeff=1.0,               # Value function loss coefficient
        kl_coeff=0.0,                    # KL divergence coefficient
        kl_target=0.01,                  # Target KL divergence
    )
    
    # Model architecture
    .rl_module(
        model_config_dict={
            "fcnet_hiddens": [256, 256],  # Hidden layer sizes
            "fcnet_activation": "relu",    # Activation function
        }
    )
    
    # Rollout configuration
    .env_runners(
        num_env_runners=4,                # Number of parallel workers
        num_envs_per_env_runner=1,       # Environments per worker
        rollout_fragment_length=200,      # Steps per rollout fragment
    )
    
    # Evaluation
    .evaluation(
        evaluation_interval=10,           # Evaluate every N training iterations
        evaluation_duration=10,           # Number of episodes for evaluation
        evaluation_num_env_runners=1,    # Parallel evaluators
    )
)

# Build and train
algo_ppo = config.build_algo()
result = algo_ppo.train()
print(f"PPO Training - Episode Reward: {result['env_runners']['episode_return_mean']:.2f}")


## IMPALA (Importance Weighted Actor-Learner Architecture)

IMPALA is an off-policy algorithm designed for distributed training with high throughput.

**Key features:**
- Off-policy: Can learn from older data
- V-trace for importance sampling correction
- Designed for massively parallel environments
- **Note**: Running in local mode on Windows (distributed training requires Linux/Mac)

**Algorithm Overview:**
1. Actors collect experience continuously
2. Learner trains asynchronously on batches
3. V-trace corrects for off-policy data
4. Policy updates distributed to actors


In [None]:
from ray.rllib.algorithms.impala import IMPALAConfig

# Create and configure IMPALA with detailed hyperparameters (Windows-compatible)
config = (
    IMPALAConfig()
    .environment("CartPole-v1")
    
    # Distributed training configuration (local mode for Windows)
    .learners(
        num_learners=0,                   # 0 = local mode (no distributed learners)
    )
    
    # Environment runners (data collection workers)
    .env_runners(
        num_env_runners=4,                # Number of parallel environment workers
        num_envs_per_env_runner=1,       # Environments per worker
        rollout_fragment_length=50,       # Steps collected per rollout
    )
    
    # Training hyperparameters
    .training(
        gamma=0.99,                       # Discount factor for future rewards
        lr=0.0005,                        # Learning rate for optimizer
        train_batch_size=512,             # Batch size for training
        grad_clip=40.0,                   # Gradient clipping threshold
        grad_clip_by="global_norm",       # Gradient clipping method
        entropy_coeff=0.01,               # Entropy regularization (exploration)
        vf_loss_coeff=0.5,                # Value function loss coefficient
        vtrace_clip_rho_threshold=1.0,    # V-trace importance sampling clip (actor)
        vtrace_clip_pg_rho_threshold=1.0, # V-trace importance sampling clip (policy gradient)
        num_sgd_iter=1,                   # Number of SGD passes per batch
        replay_ratio=0.0,                 # Experience replay ratio
        replay_buffer_num_slots=0,        # Replay buffer size (0 = disabled)
    )
    
    # Model architecture
    .rl_module(
        model_config_dict={
            "fcnet_hiddens": [256, 256],  # Hidden layer sizes for policy/value networks
            "fcnet_activation": "relu",    # Activation function (relu, tanh, etc.)
            "vf_share_layers": True,       # Share layers between policy and value function
        }
    )
    
    # Evaluation settings
    .evaluation(
        evaluation_interval=5,            # Evaluate every N training iterations
        evaluation_duration=10,           # Number of episodes for evaluation
        evaluation_num_env_runners=1,    # Parallel evaluators
        evaluation_config={
            "explore": False,             # Disable exploration during evaluation
        }
    )
    
    # Resource allocation
    .resources(
        num_gpus=0,                       # GPUs for learner (0 = CPU only)
        num_cpus_per_env_runner=1,       # CPUs per environment runner
    )
)

# Build and train
algo_impala = config.build_algo()
result = algo_impala.train()

# Display results
print(f"\n{'='*60}")
print(f"IMPALA Training Results (Iteration 1)")
print(f"{'='*60}")
print(f"Episode Reward Mean:  {result['env_runners']['episode_return_mean']:.2f}")
print(f"Episode Length Mean:  {result['env_runners']['episode_len_mean']:.2f}")
print(f"Episodes This Iter:   {result['env_runners']['num_episodes']}")
print(f"{'='*60}\n")


import pandas as pd

def train_algorithm(algo, num_iterations=10, algorithm_name="Algorithm"):
    """Train an algorithm and track metrics"""
    results = []
    
    print(f"\nTraining {algorithm_name}...")
    print("=" * 70)
    
    for i in range(num_iterations):
        result = algo.train()
        
        episode_reward = result['env_runners']['episode_return_mean']
        episode_length = result['env_runners']['episode_len_mean']
        
        results.append({
            'iteration': i + 1,
            'reward_mean': episode_reward,
            'length_mean': episode_length,
        })
        
        print(f"Iter {i+1:2d} | Reward: {episode_reward:7.2f} | Length: {episode_length:6.2f}")
    
    print("=" * 70)
    
    return pd.DataFrame(results)

# Example usage:
# results_ppo = train_algorithm(algo_ppo, num_iterations=20, algorithm_name="PPO")
# results_impala = train_algorithm(algo_impala, num_iterations=20, algorithm_name="IMPALA")


## Hyperparameter Tuning Guide

### If training is unstable:
- **Decrease learning rate** (lr): Try 1e-5 to 1e-4
- **Increase gradient clipping** (grad_clip): Try 5.0 to 20.0
- **Adjust PPO clip_param**: Try 0.1 to 0.3
- **Reduce batch size**: Smaller batches = more stable but slower

### If learning is too slow:
- **Increase learning rate** (lr): Try 1e-3 to 5e-3
- **Increase batch size** (train_batch_size): More samples per update
- **More env runners**: Collect data faster
- **Larger networks**: More capacity to learn complex policies

### For better exploration:
- **Increase entropy_coeff**: Try 0.01 to 0.1
- **Add noise to actions**: Use exploration config
- **Decrease gamma**: Focus on short-term rewards
- **Use curiosity-driven exploration**: Add intrinsic rewards

### For sample efficiency:
- **IMPALA with replay**: Set replay_ratio > 0 and replay_buffer_num_slots > 0
- **Larger networks**: Increase fcnet_hiddens (e.g., [512, 512])
- **More SGD iterations**: Increase num_sgd_iter (PPO only)
- **Lower discount factor**: Faster learning on short-term tasks

### Common hyperparameter ranges:

| Parameter | Typical Range | Good Starting Point | Notes |
|-----------|---------------|---------------------|-------|
| lr | 1e-5 to 1e-3 | 5e-5 | Lower for continuous control |
| gamma | 0.95 to 0.999 | 0.99 | Higher for long-horizon tasks |
| entropy_coeff | 0.001 to 0.1 | 0.01 | Higher for more exploration |
| train_batch_size | 512 to 8192 | 2048 | Depends on task complexity |
| fcnet_hiddens | [64,64] to [512,512] | [256,256] | Larger for complex tasks |
| clip_param (PPO) | 0.1 to 0.3 | 0.2 | Standard value works well |
| num_sgd_iter (PPO) | 3 to 30 | 10-20 | More = better sample efficiency |

### Algorithm Selection Guide:

**Use PPO when:**
- You want stable, reliable training
- Sample efficiency is important
- You have moderate compute resources
- Task has sparse rewards

**Use IMPALA when:**
- You need high throughput
- You have many parallel environments
- Wall-clock time is more important than sample efficiency
- You have distributed compute resources
