# Lab Solutions: Deep RL with LunarLander

This notebook contains the complete lab code with expected results and additional insights.

---

In [None]:
!pip install stable-baselines3[extra] gymnasium[box2d] -q

In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
from stable_baselines3 import PPO, A2C, DQN
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback
import time

## Task 1 Solution: Exploring the Environment

**Expected output:**
- Observation space: Box(8,)
- Action space: Discrete(4)
- Random policy mean reward: ~-150 to -250 (crashes!)

In [None]:
env = gym.make('LunarLander-v2')

print("Environment Info:")
print(f"  Observation space: {env.observation_space}")
print(f"  Action space: {env.action_space}")

state, _ = env.reset(seed=42)
print(f"\nInitial state: {state}")


def evaluate_random_policy(env, n_episodes=10):
    episode_rewards = []
    for episode in range(n_episodes):
        state, _ = env.reset()
        done = False
        total_reward = 0
        
        while not done:
            action = env.action_space.sample()
            state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            total_reward += reward
        
        episode_rewards.append(total_reward)
    return episode_rewards


random_rewards = evaluate_random_policy(env, n_episodes=10)
print(f"\nRandom Policy Performance:")
print(f"  Mean: {np.mean(random_rewards):.2f}")
print(f"  Std: {np.std(random_rewards):.2f}")
print(f"\nâœ“ Expected: Mean around -150 to -250 (random agents crash)")

## Task 2 Solution: Training PPO

**Expected results after 100k steps:**
- Mean reward: 180-240 (depends on random seed)
- Training time: ~3-5 minutes
- Should be close to solving (200+) or already solved

In [None]:
env = gym.make('LunarLander-v2')

model_ppo = PPO(
    'MlpPolicy',
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    verbose=1,
    seed=42
)

print("Training PPO...\n")
start = time.time()
model_ppo.learn(total_timesteps=100_000)
train_time = time.time() - start

mean_reward, std_reward = evaluate_policy(model_ppo, env, n_eval_episodes=100)

print(f"\n{'='*50}")
print(f"PPO Results:")
print(f"  Training time: {train_time:.1f}s")
print(f"  Mean reward: {mean_reward:.2f} Â± {std_reward:.2f}")
print(f"  Solved: {'âœ“' if mean_reward >= 200 else 'âœ—'}")
print(f"{'='*50}")
print(f"\nâœ“ Expected: 180-240 mean reward")
print(f"  If <200: Train longer or try different seed")
print(f"  If >200: Great! PPO solved it!")

## Task 3 Solution: Algorithm Comparison

**Expected performance rankings (after 100k steps):**
1. **PPO**: 180-240 (most reliable)
2. **DQN**: 150-220 (can be unstable)
3. **A2C**: 120-200 (faster but higher variance)
4. **Random**: -150 to -250 (baseline)

**Note:** Exact values depend on random seed, but relative ranking usually holds.

In [None]:
# Train A2C
print("Training A2C...\n")
env = gym.make('LunarLander-v2')
model_a2c = A2C('MlpPolicy', env, learning_rate=7e-4, verbose=1, seed=42)
model_a2c.learn(total_timesteps=100_000)
mean_a2c, std_a2c = evaluate_policy(model_a2c, env, n_eval_episodes=100)
print(f"A2C: {mean_a2c:.2f} Â± {std_a2c:.2f}")

In [None]:
# Train DQN
print("\nTraining DQN...\n")
env = gym.make('LunarLander-v2')
model_dqn = DQN('MlpPolicy', env, learning_rate=1e-4, verbose=1, seed=42)
model_dqn.learn(total_timesteps=100_000)
mean_dqn, std_dqn = evaluate_policy(model_dqn, env, n_eval_episodes=100)
print(f"DQN: {mean_dqn:.2f} Â± {std_dqn:.2f}")

In [None]:
# Comparison plot
algorithms = ['Random', 'PPO', 'A2C', 'DQN']
mean_rewards = [np.mean(random_rewards), mean_reward, mean_a2c, mean_dqn]
std_rewards = [np.std(random_rewards), std_reward, std_a2c, std_dqn]

plt.figure(figsize=(10, 6))
colors = ['red', 'blue', 'green', 'orange']
bars = plt.bar(algorithms, mean_rewards, yerr=std_rewards, 
               capsize=5, color=colors, alpha=0.7)
plt.axhline(y=200, color='black', linestyle='--', label='Solved (200)')
plt.ylabel('Mean Reward')
plt.title('Algorithm Comparison')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')

for bar, mean in zip(bars, mean_rewards):
    plt.text(bar.get_x() + bar.get_width()/2., bar.get_height(),
             f'{mean:.0f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\nâœ“ Typical observations:")
print("  - PPO usually performs best (190-240)")
print("  - A2C is faster to train but more variable (120-200)")
print("  - DQN needs careful tuning (150-220)")
print("  - All beat random (-200) by a large margin!")

## Task 4 Solution: Hyperparameter Experiments

**Expected results for learning rate experiments:**
- **LR=1e-4 (low)**: 100-160 (learns slowly, may not solve)
- **LR=3e-4 (default)**: 180-220 (good balance)
- **LR=1e-3 (high)**: 140-200 (can be unstable)

**Key insight:** Default learning rate (3e-4) is usually best!

In [None]:
def train_and_evaluate(learning_rate, timesteps=50_000):
    env = gym.make('LunarLander-v2')
    model = PPO('MlpPolicy', env, learning_rate=learning_rate, 
                verbose=0, seed=42)
    model.learn(total_timesteps=timesteps)
    mean, std = evaluate_policy(model, env, n_eval_episodes=50)
    return mean, std


learning_rates = [1e-4, 3e-4, 1e-3]
results = {}

print("Testing different learning rates...\n")
for lr in learning_rates:
    print(f"LR={lr:.0e}: ", end="")
    mean, std = train_and_evaluate(lr)
    results[lr] = (mean, std)
    print(f"{mean:.2f} Â± {std:.2f}")

# Plot
plt.figure(figsize=(10, 6))
lrs = [f"{lr:.0e}" for lr in learning_rates]
means = [results[lr][0] for lr in learning_rates]
stds = [results[lr][1] for lr in learning_rates]

plt.bar(lrs, means, yerr=stds, capsize=5, alpha=0.7, color='steelblue')
plt.axhline(y=200, color='g', linestyle='--', label='Solved')
plt.xlabel('Learning Rate')
plt.ylabel('Mean Reward')
plt.title('Impact of Learning Rate (50k steps)')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nâœ“ Expected patterns:")
print("  - 1e-4: Too slow, doesn't reach 200 in 50k steps")
print("  - 3e-4: Sweet spot, approaches 200")
print("  - 1e-3: Can work but more variable")

## Task 5 Solution (Optional): Longer Training

**Expected results after 300k steps:**
- Mean reward: 230-270
- Much more consistent (lower std)
- Clearly solved!

**Training curve should show:**
- Initial rapid improvement (0-50k steps)
- Slower refinement (50k-150k steps)
- Plateau/small improvements (150k-300k steps)

In [None]:
eval_env = gym.make('LunarLander-v2')
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path='./logs/',
    log_path='./logs/',
    eval_freq=5000,
    n_eval_episodes=20,
    deterministic=True
)

env = gym.make('LunarLander-v2')
model_long = PPO('MlpPolicy', env, verbose=1, seed=42)

print("Training for 300k steps...\n")
model_long.learn(total_timesteps=300_000, callback=eval_callback)

mean_long, std_long = evaluate_policy(model_long, env, n_eval_episodes=100)
print(f"\nFinal: {mean_long:.2f} Â± {std_long:.2f}")
print(f"\nâœ“ Expected: 230-270 with low variance")
print(f"  Longer training â†’ more consistent performance")

---

## Additional Insights

### Why PPO Usually Wins

1. **Clipped objective**: Prevents too-large policy updates
2. **Multiple epochs**: Reuses data efficiently
3. **Robust defaults**: Works well out-of-the-box

### When to Use Each Algorithm

- **PPO**: Default choice, most reliable
- **A2C**: When you need speed and can tolerate variance
- **DQN**: Discrete actions, when you have good hyperparameters

### Common Issues & Solutions

**Problem: Agent doesn't learn (stuck at -200)**
- Check learning rate (try 3e-4)
- Train longer (100k â†’ 300k steps)
- Try different random seed

**Problem: High variance in performance**
- Increase batch size (64 â†’ 128)
- Use more evaluation episodes (20 â†’ 100)
- Train longer for stability

**Problem: Training is too slow**
- Reduce timesteps for experiments (100k â†’ 50k)
- Use fewer evaluation episodes
- Try A2C (faster but less stable)

### Hyperparameter Cheat Sheet

**Conservative (stable but slow):**
```python
PPO(..., learning_rate=1e-4, n_steps=2048, batch_size=128)
```

**Balanced (recommended):**
```python
PPO(..., learning_rate=3e-4, n_steps=2048, batch_size=64)
```

**Aggressive (fast but risky):**
```python
PPO(..., learning_rate=1e-3, n_steps=1024, batch_size=32)
```

---

## Summary of Expected Results

| Task | Algorithm | Steps | Expected Reward |
|------|-----------|-------|----------------|
| 1 | Random | - | -150 to -250 |
| 2 | PPO | 100k | 180 to 240 |
| 3 | A2C | 100k | 120 to 200 |
| 3 | DQN | 100k | 150 to 220 |
| 4 | PPO (LR=1e-4) | 50k | 100 to 160 |
| 4 | PPO (LR=3e-4) | 50k | 180 to 220 |
| 4 | PPO (LR=1e-3) | 50k | 140 to 200 |
| 5 | PPO | 300k | 230 to 270 |

**Remember:** These are typical ranges. Your exact results will vary based on:
- Random seed
- Hardware (CPU vs GPU)
- Library versions

As long as you're in the ballpark, you're doing fine! ðŸš€