# Policy Gradient with Variance Reduction Techniques

This notebook implements the REINFORCE policy gradient algorithm with variance reduction techniques:
- **Reward-to-go**: Computing returns from current timestep onwards
- **Advantage normalization**: Normalizing advantages to have mean 0 and std 1
- **Baseline function**: Using value function to reduce variance

## Environments:
- **CartPole-v1**: Classic control problem
- **LunarLander-v2**: Continuous control with discrete actions

## 1. Setup and Installation

In [None]:
# Install required packages (uncomment if needed)
# !pip install gymnasium torch matplotlib numpy

In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import torch
import random

# Import custom modules
from pg_agent import PolicyGradientAgent, device
from pg_utils import (
    plot_learning_curves, 
    plot_single_training_curve,
    compare_configurations,
    save_results, 
    load_results
)
from train_pg import train_policy_gradient

# Set random seeds
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

print(f"Using device: {device}")

## 2. Part 1: Environment Exploration

Load environments and understand their characteristics using random agents.

### 2.1 Load Environments and Inspect State/Action Spaces

In [None]:
def explore_environment(env_name):
    """Load environment and print state/action space information"""
    print(f"\n{'='*60}")
    print(f"Environment: {env_name}")
    print(f"{'='*60}")
    
    env = gym.make(env_name)
    
    print(f"\nObservation Space: {env.observation_space}")
    print(f"Action Space: {env.action_space}")
    
    if isinstance(env.observation_space, gym.spaces.Box):
        print(f"Observation Shape: {env.observation_space.shape}")
        print(f"Observation Low: {env.observation_space.low}")
        print(f"Observation High: {env.observation_space.high}")
    
    if isinstance(env.action_space, gym.spaces.Discrete):
        print(f"Number of Actions: {env.action_space.n}")
    
    # Sample observation
    obs, _ = env.reset(seed=SEED)
    print(f"\nSample Observation: {obs}")
    
    env.close()

# Explore CartPole
explore_environment("CartPole-v1")

# Explore LunarLander
explore_environment("LunarLander-v2")

### 2.2 Random Agent to Understand Reward Function

In [None]:
def test_random_agent(env_name, num_episodes=10, max_steps=1000):
    """Test random agent and analyze rewards"""
    print(f"\n{'='*60}")
    print(f"Random Agent Testing: {env_name}")
    print(f"{'='*60}")
    
    env = gym.make(env_name)
    episode_rewards = []
    episode_lengths = []
    all_rewards = []
    
    for episode in range(num_episodes):
        obs, _ = env.reset(seed=SEED + episode)
        total_reward = 0
        steps = 0
        done = False
        truncated = False
        
        while not (done or truncated) and steps < max_steps:
            action = env.action_space.sample()  # Random action
            obs, reward, done, truncated, info = env.step(action)
            total_reward += reward
            all_rewards.append(reward)
            steps += 1
        
        episode_rewards.append(total_reward)
        episode_lengths.append(steps)
        print(f"Episode {episode+1}: Reward = {total_reward:.2f}, Steps = {steps}")
    
    print(f"\nStatistics:")
    print(f"Mean Episode Reward: {np.mean(episode_rewards):.2f} ± {np.std(episode_rewards):.2f}")
    print(f"Mean Episode Length: {np.mean(episode_lengths):.2f} ± {np.std(episode_lengths):.2f}")
    print(f"Reward Range: [{np.min(all_rewards):.2f}, {np.max(all_rewards):.2f}]")
    
    env.close()
    return episode_rewards, episode_lengths

# Test random agents
cartpole_rewards, cartpole_lengths = test_random_agent("CartPole-v1", num_episodes=10)
lunar_rewards, lunar_lengths = test_random_agent("LunarLander-v2", num_episodes=10)

### 2.3 Observations from Random Agents

**CartPole-v1:**
- **State space**: 4D continuous (cart position, cart velocity, pole angle, pole angular velocity)
- **Action space**: 2 discrete actions (push left=0, push right=1)
- **Reward**: +1 for every timestep the pole stays upright
- **Random agent**: Typically achieves 15-30 steps before failure
- **Challenge**: Need to learn coordinated actions to balance pole
- **Success criterion**: Average reward of 475+ over 100 episodes

**LunarLander-v2:**
- **State space**: 8D continuous (position, velocity, angle, angular velocity, leg contact)
- **Action space**: 4 discrete actions (do nothing, fire left, fire main, fire right)
- **Reward**: 
  - Positive for moving towards landing pad
  - Large positive for landing successfully
  - Negative for crashing or using fuel
- **Random agent**: Typically achieves -200 to -100 (crashes)
- **Challenge**: Complex dynamics requiring careful control
- **Success criterion**: Average reward of 200+ over 100 episodes

## 3. Part 2: Policy Gradient Implementation

Implement and compare different variance reduction techniques.

### 3.1 Training Configurations

We will test 4 different configurations:
1. **Baseline**: No reward-to-go, no advantage normalization
2. **Reward-to-go**: With reward-to-go, no advantage normalization
3. **Advantage normalization**: With both reward-to-go and advantage normalization
4. **Full (with baseline)**: All variance reduction techniques including value baseline

### 3.2 Train on CartPole-v1

In [None]:
# Training parameters for CartPole
cartpole_params = {
    'env_name': 'CartPole-v1',
    'num_iterations': 100,
    'batch_size': 5000,
    'lr': 1e-2,
    'gamma': 0.99,
    'hidden_sizes': [32, 32],
    'max_episode_length': 500,
    'print_freq': 10,
    'seed': SEED
}

print("Training Policy Gradient on CartPole-v1 with different configurations...")
print("This will take approximately 15-20 minutes.\n")

In [None]:
# Configuration 1: Baseline (no variance reduction)
print("\n" + "="*80)
print("Configuration 1: Baseline (No Variance Reduction)")
print("="*80)

cartpole_baseline = train_policy_gradient(
    **cartpole_params,
    use_reward_to_go=False,
    use_advantage_normalization=False,
    use_baseline=False,
    save_path='pg_cartpole_baseline.pkl'
)

In [None]:
# Configuration 2: Reward-to-go only
print("\n" + "="*80)
print("Configuration 2: Reward-to-go")
print("="*80)

cartpole_rtg = train_policy_gradient(
    **cartpole_params,
    use_reward_to_go=True,
    use_advantage_normalization=False,
    use_baseline=False,
    save_path='pg_cartpole_rtg.pkl'
)

In [None]:
# Configuration 3: Reward-to-go + Advantage normalization
print("\n" + "="*80)
print("Configuration 3: Reward-to-go + Advantage Normalization")
print("="*80)

cartpole_rtg_norm = train_policy_gradient(
    **cartpole_params,
    use_reward_to_go=True,
    use_advantage_normalization=True,
    use_baseline=False,
    save_path='pg_cartpole_rtg_norm.pkl'
)

In [None]:
# Configuration 4: Full (all variance reduction techniques)
print("\n" + "="*80)
print("Configuration 4: Full (RTG + Norm + Baseline)")
print("="*80)

cartpole_full = train_policy_gradient(
    **cartpole_params,
    use_reward_to_go=True,
    use_advantage_normalization=True,
    use_baseline=True,
    save_path='pg_cartpole_full.pkl'
)

In [None]:
# Compare all CartPole configurations
cartpole_results = {
    'Baseline': cartpole_baseline,
    'Reward-to-go': cartpole_rtg,
    'RTG + Norm': cartpole_rtg_norm,
    'Full (RTG + Norm + Baseline)': cartpole_full
}

plot_learning_curves(cartpole_results, 'CartPole-v1', window=5, 
                    save_path='cartpole_comparison.png')

In [None]:
# Performance comparison
compare_configurations(cartpole_results, 'CartPole-v1', 
                      save_path='cartpole_performance.png')

### 3.3 Train on LunarLander-v2

In [None]:
# Training parameters for LunarLander
lunar_params = {
    'env_name': 'LunarLander-v2',
    'num_iterations': 200,
    'batch_size': 5000,
    'lr': 5e-3,
    'gamma': 0.99,
    'hidden_sizes': [64, 64],
    'max_episode_length': 1000,
    'print_freq': 10,
    'seed': SEED
}

print("Training Policy Gradient on LunarLander-v2 with different configurations...")
print("This will take approximately 30-40 minutes.\n")

In [None]:
# Configuration 1: Baseline (no variance reduction)
print("\n" + "="*80)
print("Configuration 1: Baseline (No Variance Reduction)")
print("="*80)

lunar_baseline = train_policy_gradient(
    **lunar_params,
    use_reward_to_go=False,
    use_advantage_normalization=False,
    use_baseline=False,
    save_path='pg_lunar_baseline.pkl'
)

In [None]:
# Configuration 2: Reward-to-go only
print("\n" + "="*80)
print("Configuration 2: Reward-to-go")
print("="*80)

lunar_rtg = train_policy_gradient(
    **lunar_params,
    use_reward_to_go=True,
    use_advantage_normalization=False,
    use_baseline=False,
    save_path='pg_lunar_rtg.pkl'
)

In [None]:
# Configuration 3: Reward-to-go + Advantage normalization
print("\n" + "="*80)
print("Configuration 3: Reward-to-go + Advantage Normalization")
print("="*80)

lunar_rtg_norm = train_policy_gradient(
    **lunar_params,
    use_reward_to_go=True,
    use_advantage_normalization=True,
    use_baseline=False,
    save_path='pg_lunar_rtg_norm.pkl'
)

In [None]:
# Configuration 4: Full (all variance reduction techniques)
print("\n" + "="*80)
print("Configuration 4: Full (RTG + Norm + Baseline)")
print("="*80)

lunar_full = train_policy_gradient(
    **lunar_params,
    use_reward_to_go=True,
    use_advantage_normalization=True,
    use_baseline=True,
    save_path='pg_lunar_full.pkl'
)

In [None]:
# Compare all LunarLander configurations
lunar_results = {
    'Baseline': lunar_baseline,
    'Reward-to-go': lunar_rtg,
    'RTG + Norm': lunar_rtg_norm,
    'Full (RTG + Norm + Baseline)': lunar_full
}

plot_learning_curves(lunar_results, 'LunarLander-v2', window=10, 
                    save_path='lunar_comparison.png')

In [None]:
# Performance comparison
compare_configurations(lunar_results, 'LunarLander-v2', 
                      save_path='lunar_performance.png')

## 4. Part 3: Impact of Batch Size on Policy Gradient

In this section, we study how batch size affects policy gradient estimates. Batch size determines how many timesteps we collect before performing a policy update.

**Key Questions:**
1. How does batch size affect variance of gradient estimates?
2. How does batch size impact convergence speed?
3. What is the trade-off between sample efficiency and computational cost?
4. What is the optimal batch size for each environment?

**Batch Sizes to Test:**
- **1000**: Small batch (high variance, fast updates)
- **2500**: Medium-small batch
- **5000**: Medium batch (baseline)
- **10000**: Large batch (low variance, slow updates)

We will use the **full configuration** (RTG + Norm + Baseline) for all experiments to isolate the effect of batch size.

### 4.1 Batch Size Experiments on CartPole-v1

In [None]:
# Batch sizes to test
batch_sizes = [1000, 2500, 5000, 10000]
cartpole_batch_results = {}

print("="*80)
print("BATCH SIZE STUDY: CartPole-v1")
print("="*80)
print(f"\nTesting {len(batch_sizes)} different batch sizes")
print(f"Batch sizes: {batch_sizes}")
print(f"Using full variance reduction (RTG + Norm + Baseline)")
print(f"This will take approximately {len(batch_sizes) * 4} minutes\n")

# Base parameters for CartPole
base_params = {
    'env_name': 'CartPole-v1',
    'num_iterations': 100,
    'lr': 1e-2,
    'gamma': 0.99,
    'use_reward_to_go': True,
    'use_advantage_normalization': True,
    'use_baseline': True,
    'hidden_sizes': [32, 32],
    'max_episode_length': 500,
    'print_freq': 20,
    'seed': SEED
}

for batch_size in batch_sizes:
    print(f"\n{'='*80}")
    print(f"Training with Batch Size: {batch_size}")
    print(f"{'='*80}\n")
    
    results = train_policy_gradient(
        **base_params,
        batch_size=batch_size,
        save_path=f'pg_cartpole_batch_{batch_size}.pkl'
    )
    
    cartpole_batch_results[f'Batch {batch_size}'] = results
    
    print(f"\nCompleted batch size {batch_size}")
    print(f"Final mean return: {results['iteration_returns'][-1]:.2f}")
    print(f"Best mean return: {max(results['iteration_returns']):.2f}")

print("\n" + "="*80)
print("CartPole batch size experiments completed!")
print("="*80)

In [None]:
# Plot learning curves for different batch sizes
plot_learning_curves(cartpole_batch_results, 'CartPole-v1 - Batch Size Comparison', 
                    window=5, save_path='cartpole_batch_comparison.png')

In [None]:
# Performance comparison
compare_configurations(cartpole_batch_results, 'CartPole-v1 - Batch Size Impact',
                      save_path='cartpole_batch_performance.png')

### 4.2 Batch Size Experiments on LunarLander-v2

In [None]:
# Batch sizes to test (same as CartPole)
lunar_batch_results = {}

print("="*80)
print("BATCH SIZE STUDY: LunarLander-v2")
print("="*80)
print(f"\nTesting {len(batch_sizes)} different batch sizes")
print(f"Batch sizes: {batch_sizes}")
print(f"Using full variance reduction (RTG + Norm + Baseline)")
print(f"This will take approximately {len(batch_sizes) * 10} minutes\n")

# Base parameters for LunarLander
base_params = {
    'env_name': 'LunarLander-v2',
    'num_iterations': 200,
    'lr': 5e-3,
    'gamma': 0.99,
    'use_reward_to_go': True,
    'use_advantage_normalization': True,
    'use_baseline': True,
    'hidden_sizes': [64, 64],
    'max_episode_length': 1000,
    'print_freq': 20,
    'seed': SEED
}

for batch_size in batch_sizes:
    print(f"\n{'='*80}")
    print(f"Training with Batch Size: {batch_size}")
    print(f"{'='*80}\n")
    
    results = train_policy_gradient(
        **base_params,
        batch_size=batch_size,
        save_path=f'pg_lunar_batch_{batch_size}.pkl'
    )
    
    lunar_batch_results[f'Batch {batch_size}'] = results
    
    print(f"\nCompleted batch size {batch_size}")
    print(f"Final mean return: {results['iteration_returns'][-1]:.2f}")
    print(f"Best mean return: {max(results['iteration_returns']):.2f}")

print("\n" + "="*80)
print("LunarLander batch size experiments completed!")
print("="*80)

In [None]:
# Plot learning curves for different batch sizes
plot_learning_curves(lunar_batch_results, 'LunarLander-v2 - Batch Size Comparison',
                    window=10, save_path='lunar_batch_comparison.png')

In [None]:
# Performance comparison
compare_configurations(lunar_batch_results, 'LunarLander-v2 - Batch Size Impact',
                      save_path='lunar_batch_performance.png')

### 4.3 Batch Size Analysis and Observations

#### Impact of Batch Size on Policy Gradient Estimates

**Theoretical Background:**

Batch size affects the variance and bias of policy gradient estimates:
- **Larger batches**: Lower variance (more samples to average), but slower updates
- **Smaller batches**: Higher variance (fewer samples), but faster updates

The policy gradient estimator is:
```
∇J(θ) ≈ (1/N) ∑_{i=1}^N ∑_t ∇log π_θ(a_t|s_t) * A_t
```

Where N is the number of trajectories in the batch. By the Central Limit Theorem, variance decreases as O(1/√N).

#### Expected Observations:

**Small Batch Size (1000):**
- **Pros**: Fast updates, more iterations per unit time
- **Cons**: High variance, noisy gradients, unstable learning
- **Expected behavior**: Oscillating learning curves, may converge slowly

**Medium-Small Batch Size (2500):**
- **Pros**: Good balance, reasonable variance
- **Cons**: Still some noise in gradients
- **Expected behavior**: Moderate stability, decent convergence

**Medium Batch Size (5000):**
- **Pros**: Low variance, stable learning, good sample efficiency
- **Cons**: Slower updates than smaller batches
- **Expected behavior**: Smooth learning curves, reliable convergence

**Large Batch Size (10000):**
- **Pros**: Very low variance, very stable gradients
- **Cons**: Slow updates, may be sample inefficient
- **Expected behavior**: Very smooth curves, but may converge slowly in wall-clock time

#### Key Insights:

1. **Variance-Speed Tradeoff**: Larger batches reduce gradient variance but require more samples per update
2. **Sample Efficiency**: Medium batch sizes (2500-5000) often provide best sample efficiency
3. **Computational Efficiency**: Smaller batches allow more frequent updates
4. **Environment Dependency**: Optimal batch size depends on environment complexity
5. **Diminishing Returns**: Beyond a certain point, increasing batch size provides minimal benefit

#### Recommendations:

**For CartPole-v1:**
- Optimal batch size: 2500-5000
- Simpler environment, smaller batches work well
- High variance is tolerable due to fast episodes

**For LunarLander-v2:**
- Optimal batch size: 5000-10000
- More complex environment benefits from larger batches
- Lower variance crucial for stable learning

**General Guidelines:**
1. Start with batch size of 5000 timesteps
2. If learning is unstable, increase batch size
3. If learning is too slow, decrease batch size
4. Monitor both sample efficiency and wall-clock time
5. Consider computational resources when choosing batch size

In [None]:
# Detailed quantitative comparison
print("="*80)
print("BATCH SIZE IMPACT - QUANTITATIVE ANALYSIS")
print("="*80)

print("\n" + "="*80)
print("CartPole-v1 Results:")
print("="*80)
print(f"{'Batch Size':<15} {'Final Return':<20} {'Best Return':<20} {'Std Dev (last 20)':<20}")
print("-"*80)

for config_name, results in cartpole_batch_results.items():
    returns = results['iteration_returns']
    final_return = np.mean(returns[-10:])
    best_return = np.max(returns)
    std_dev = np.std(returns[-20:])
    print(f"{config_name:<15} {final_return:<20.2f} {best_return:<20.2f} {std_dev:<20.2f}")

print("\n" + "="*80)
print("LunarLander-v2 Results:")
print("="*80)
print(f"{'Batch Size':<15} {'Final Return':<20} {'Best Return':<20} {'Std Dev (last 20)':<20}")
print("-"*80)

for config_name, results in lunar_batch_results.items():
    returns = results['iteration_returns']
    final_return = np.mean(returns[-10:])
    best_return = np.max(returns)
    std_dev = np.std(returns[-20:])
    print(f"{config_name:<15} {final_return:<20.2f} {best_return:<20.2f} {std_dev:<20.2f}")

print("\n" + "="*80)
print("Key Observations:")
print("="*80)
print("1. Variance (Std Dev): Should decrease with larger batch sizes")
print("2. Final Return: Should be similar across batch sizes (if converged)")
print("3. Best Return: Indicates peak performance achieved")
print("4. Stability: Lower std dev indicates more stable learning")
print("="*80)

In [None]:
# Plot variance vs batch size
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# CartPole variance analysis
ax = axes[0]
batch_sizes_list = [1000, 2500, 5000, 10000]
cartpole_variances = []

for batch_size in batch_sizes_list:
    config_name = f'Batch {batch_size}'
    if config_name in cartpole_batch_results:
        returns = cartpole_batch_results[config_name]['iteration_returns']
        variance = np.std(returns[-20:])  # Variance in last 20 iterations
        cartpole_variances.append(variance)

ax.plot(batch_sizes_list, cartpole_variances, marker='o', linewidth=2, markersize=8)
ax.set_xlabel('Batch Size', fontsize=12)
ax.set_ylabel('Standard Deviation (last 20 iter)', fontsize=12)
ax.set_title('CartPole-v1: Variance vs Batch Size', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.set_xscale('log')

# LunarLander variance analysis
ax = axes[1]
lunar_variances = []

for batch_size in batch_sizes_list:
    config_name = f'Batch {batch_size}'
    if config_name in lunar_batch_results:
        returns = lunar_batch_results[config_name]['iteration_returns']
        variance = np.std(returns[-20:])
        lunar_variances.append(variance)

ax.plot(batch_sizes_list, lunar_variances, marker='o', linewidth=2, markersize=8, color='orange')
ax.set_xlabel('Batch Size', fontsize=12)
ax.set_ylabel('Standard Deviation (last 20 iter)', fontsize=12)
ax.set_title('LunarLander-v2: Variance vs Batch Size', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.set_xscale('log')

plt.tight_layout()
plt.savefig('batch_size_variance_analysis.png', dpi=300, bbox_inches='tight')
print("Variance analysis plot saved to batch_size_variance_analysis.png")
plt.show()

print("\nExpected Pattern: Variance should decrease as batch size increases")
print("This confirms the theoretical prediction: Var ∝ 1/√N")

## 5. Analysis and Observations

### 4.1 CartPole-v1 Results

**Expected Observations:**
- **Baseline**: High variance, slower convergence, unstable learning
- **Reward-to-go**: Reduced variance, faster convergence than baseline
- **RTG + Normalization**: More stable learning, consistent improvement
- **Full (with baseline)**: Best performance, fastest convergence, most stable

**Key Insights:**
1. Reward-to-go significantly reduces variance by only considering future rewards
2. Advantage normalization stabilizes training by centering advantages
3. Value baseline further reduces variance by subtracting state-dependent baseline
4. Combined techniques provide best results

### 4.2 LunarLander-v2 Results

**Expected Observations:**
- More challenging environment shows greater benefit from variance reduction
- Baseline configuration may fail to learn or learn very slowly
- Full configuration should achieve positive rewards consistently
- Advantage normalization is crucial for this environment

**Key Insights:**
1. Variance reduction is essential for complex environments
2. Without proper techniques, policy gradient can be very unstable
3. Value baseline helps agent understand state-dependent expected returns
4. Normalization prevents gradient explosion/vanishing

### 4.3 Comparison of Variance Reduction Techniques

**Reward-to-go (Ψt = Gt:∞):**
- **Pros**: Reduces variance by ignoring past rewards, unbiased estimator
- **Cons**: Still has high variance without normalization
- **Impact**: Moderate improvement in convergence speed

**Advantage Normalization:**
- **Pros**: Stabilizes gradients, prevents extreme updates, improves sample efficiency
- **Cons**: Adds computational overhead (minimal)
- **Impact**: Significant improvement in stability

**Value Baseline:**
- **Pros**: State-dependent variance reduction, learns optimal baseline automatically
- **Cons**: Requires additional network and training
- **Impact**: Best variance reduction, fastest convergence

### 4.4 Recommendations

1. **Always use reward-to-go**: Minimal overhead, significant benefit
2. **Use advantage normalization**: Essential for stable training
3. **Use value baseline for complex tasks**: Worth the additional computation
4. **Tune learning rate carefully**: Too high causes instability, too low is slow
5. **Use appropriate batch size**: Larger batches reduce variance but increase computation

## 6. Command-Line Usage

The implementation can also be run from command line using `train_pg.py`:

```bash
# Basic usage with default settings
python train_pg.py --env CartPole-v1

# With custom parameters
python train_pg.py --env LunarLander-v2 --num_iterations 200 --batch_size 5000 --lr 0.005

# Without reward-to-go
python train_pg.py --env CartPole-v1 --no_reward_to_go

# Without advantage normalization
python train_pg.py --env CartPole-v1 --no_advantage_norm

# Without baseline
python train_pg.py --env CartPole-v1 --no_baseline

# All options disabled (baseline configuration)
python train_pg.py --env CartPole-v1 --no_reward_to_go --no_advantage_norm --no_baseline

# Custom network architecture
python train_pg.py --env LunarLander-v2 --hidden_sizes 128 128 64

# Save to specific path
python train_pg.py --env CartPole-v1 --save_path my_results.pkl
```

### Available Command-Line Arguments:
- `--env`: Environment name (default: CartPole-v1)
- `--num_iterations`: Number of training iterations (default: 100)
- `--batch_size`: Batch size in timesteps (default: 5000)
- `--lr`: Learning rate (default: 3e-4)
- `--gamma`: Discount factor (default: 0.99)
- `--reward_to_go` / `--no_reward_to_go`: Enable/disable reward-to-go
- `--advantage_norm` / `--no_advantage_norm`: Enable/disable advantage normalization
- `--baseline` / `--no_baseline`: Enable/disable value baseline
- `--hidden_sizes`: Network hidden layer sizes (default: 64 64)
- `--max_episode_length`: Maximum episode length (default: 1000)
- `--print_freq`: Print frequency (default: 10)
- `--seed`: Random seed (default: 42)
- `--save_path`: Path to save results (default: auto-generated)

## 7. Summary

This notebook demonstrated:
1. ✅ Environment exploration and random agent baselines
2. ✅ Policy gradient implementation with variance reduction techniques
3. ✅ Comparison of different configurations
4. ✅ Command-line interface for flexible training

**Key Takeaways:**
- Variance reduction is crucial for policy gradient methods
- Reward-to-go provides unbiased variance reduction
- Advantage normalization stabilizes training
- Value baseline offers state-dependent variance reduction
- Combined techniques provide best performance