# LunarLander-v3 Reinforcement Learning Comparison
## DQN vs DDQN vs PER

**Author:** Ethan Hulme  
**Course:** CIS2719 - Foundations of Robotics & AI (Coursework 2)  
**Date:** January 2026

---

## Overview

This notebook implements and compares three value-based deep reinforcement learning algorithms for the LunarLander-v3 environment from OpenAI Gymnasium:

1. **DQN (Deep Q-Network)** - Standard value-based RL using experience replay and target networks
2. **DDQN (Double DQN)** - Addresses overestimation bias by decoupling action selection from evaluation
3. **PER (Prioritized Experience Replay)** - Samples transitions based on TD-error magnitude combined with DDQN targets

This notebook contains:
- Complete implementation code
- Results from 10 experimental runs (600 to 10,000 episodes)
- Learning curves and performance visualizations
- GIF animations showing trained agent behavior

## 1. Setup and Installation

First, we'll install all required dependencies. This includes:
- `gymnasium[box2d]` - The LunarLander environment
- `torch` - Deep learning framework
- `numpy` - Numerical computations
- `matplotlib` - Plotting
- `imageio` - GIF generation and display

In [None]:
# Install required packages
!pip install gymnasium[box2d] torch numpy matplotlib imageio imageio-ffmpeg pillow -q

# Import libraries
import os
import math
import random
import collections
from datetime import datetime
from IPython.display import Image, display
import base64

import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim

print("‚úì All packages installed successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## 2. Clone GitHub Repository (Optional)

If you want to access the pre-trained results and GIFs from GitHub, you can clone the repository here. Otherwise, we'll train from scratch.

In [None]:
# Clone the GitHub repository to access all test results
# Replace with your actual repository URL
!git clone https://github.com/humm3ll/LunarLander-v3-RL.git
%cd LunarLander-v3-RL

# List the contents
!ls -la

## 3. Complete Implementation Code

Below is the complete implementation of all three algorithms (DQN, DDQN, and PER) including:
- Q-Network architecture
- Replay buffers (standard and prioritized)
- Agent class with training logic
- Utility functions for environment creation, plotting, and GIF generation

### 3.1 Environment Setup and Q-Network

In [None]:
import gymnasium as gym
import imageio

# -----------------------------------------------------------
#  Utility: make environment with proper seeding
# -----------------------------------------------------------
def make_env(env_name: str, seed: int = 42, render_mode=None):
    """
    Helper to create and seed a Gymnasium environment.
    render_mode=None for fast training,
    render_mode='rgb_array' when generating a GIF.
    """
    env = gym.make(env_name, render_mode=render_mode)
    # reset returns (obs, info) in gymnasium
    env.reset(seed=seed)
    env.action_space.seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    return env


# -----------------------------------------------------------
#  Q-network used by all algorithms
# -----------------------------------------------------------
class QNetwork(nn.Module):
    """
    Simple 2-hidden-layer MLP for approximating Q(s,a).
    Architecture kept small for stability.
    """

    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

print("‚úì Environment and Q-Network defined")

### 3.2 Replay Buffers (Standard and Prioritized)

In [None]:
# -----------------------------------------------------------
#  Replay buffer (vanilla)
# -----------------------------------------------------------
class ReplayBuffer:
    """Standard replay buffer."""

    def __init__(self, capacity: int = 100_000):
        self.buffer = collections.deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size: int):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (
            np.array(states, dtype=np.float32),
            np.array(actions, dtype=np.int64),
            np.array(rewards, dtype=np.float32),
            np.array(next_states, dtype=np.float32),
            np.array(dones, dtype=np.float32),
        )

    def __len__(self):
        return len(self.buffer)


# -----------------------------------------------------------
#  Prioritised Replay Buffer (simple proportional PER)
# -----------------------------------------------------------
class PrioritizedReplayBuffer:
    """
    PER with proportional priorities.
    Each transition gets priority p_i; sampling probability is p_i^alpha / sum p_j^alpha.
    """

    def __init__(self, capacity: int = 100_000, alpha: float = 0.6):
        self.capacity = capacity
        self.alpha = alpha

        self.buffer = []
        self.priorities = np.zeros((capacity,), dtype=np.float32)
        self.pos = 0

    def __len__(self):
        return len(self.buffer)

    def push(self, state, action, reward, next_state, done):
        max_prio = self.priorities.max() if self.buffer else 1.0

        if len(self.buffer) < self.capacity:
            self.buffer.append((state, action, reward, next_state, done))
        else:
            self.buffer[self.pos] = (state, action, reward, next_state, done)

        self.priorities[self.pos] = max_prio
        self.pos = (self.pos + 1) % self.capacity

    def sample(self, batch_size: int, beta: float = 0.4):
        if len(self.buffer) == self.capacity:
            prios = self.priorities
        else:
            prios = self.priorities[: self.pos]

        # convert priorities into a probability distribution
        probs = prios ** self.alpha
        probs /= probs.sum()

        indices = np.random.choice(len(self.buffer), batch_size, p=probs)
        samples = [self.buffer[idx] for idx in indices]
        states, actions, rewards, next_states, dones = zip(*samples)

        # importance-sampling weights (to correct the bias introduced by PER)
        total = len(self.buffer)
        weights = (total * probs[indices]) ** (-beta)
        weights /= weights.max()  # normalise for numerical stability

        return (
            np.array(states, dtype=np.float32),
            np.array(actions, dtype=np.int64),
            np.array(rewards, dtype=np.float32),
            np.array(next_states, dtype=np.float32),
            np.array(dones, dtype=np.float32),
            indices,
            np.array(weights, dtype=np.float32),
        )

    def update_priorities(self, indices, new_priorities):
        # small epsilon keeps priorities non-zero
        for idx, prio in zip(indices, new_priorities):
            self.priorities[idx] = float(prio)

print("‚úì Replay buffers defined")

### 3.3 DQN Agent (supports DQN, DDQN, and PER)

In [None]:
# -----------------------------------------------------------
#  DQN / DDQN / PER Agent
# -----------------------------------------------------------
class DQNAgent:
    """
    Generic agent that can run:
      - 'dqn'    : standard DQN
      - 'ddqn'   : Double DQN (separate action selection and evaluation)
      - 'per'    : Prioritised Replay + Double-DQN targets

    The internal logic is the same, but the replay buffer and TD-target change.
    """

    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        algo_type: str = "dqn",
        gamma: float = 0.99,
        lr: float = 1e-3,
        batch_size: int = 64,
        buffer_capacity: int = 100_000,
        min_buffer_size: int = 10_000,
        target_update_freq: int = 1_000,
        device: str | None = None,
    ):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.gamma = gamma
        self.batch_size = batch_size
        self.min_buffer_size = min_buffer_size
        self.target_update_freq = target_update_freq
        self.algo_type = algo_type.lower()
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")

        # Q-networks
        self.q_net = QNetwork(state_dim, action_dim).to(self.device)
        self.target_net = QNetwork(state_dim, action_dim).to(self.device)
        self.target_net.load_state_dict(self.q_net.state_dict())
        self.target_net.eval()

        self.optimizer = optim.Adam(self.q_net.parameters(), lr=lr)

        # Loss: SmoothL1 (Huber) is standard for DQN
        # For PER we keep per-sample loss (reduction="none") so we can weight it.
        self.loss_fn = nn.SmoothL1Loss(
            reduction="none" if self.algo_type == "per" else "mean"
        )

        # Replay buffer
        if self.algo_type == "per":
            self.buffer = PrioritizedReplayBuffer(capacity=buffer_capacity)
            # parameters for annealing importance-sampling exponent beta
            self.beta_start = 0.4
            self.beta_frames = 200_000
        else:
            self.buffer = ReplayBuffer(capacity=buffer_capacity)

        # epsilon-greedy exploration schedule
        self.eps_start = 1.0
        self.eps_end = 0.05
        self.eps_decay = 250_000  # in frames
        self.frame_idx = 0

        self.training_steps = 0

    # ---------- exploration schedule ----------
    def epsilon(self) -> float:
        # Exponential decay: starts near 1, approaches eps_end as frame_idx grows.
        return self.eps_end + (self.eps_start - self.eps_end) * math.exp(
            -1.0 * self.frame_idx / self.eps_decay
        )

    # ---------- action selection ----------
    def select_action(self, state: np.ndarray) -> int:
        """Epsilon-greedy choice used during training."""
        self.frame_idx += 1
        if random.random() < self.epsilon():
            return random.randrange(self.action_dim)

        state_t = torch.tensor(
            state, dtype=torch.float32, device=self.device
        ).unsqueeze(0)
        with torch.no_grad():
            q_values = self.q_net(state_t)
        return int(q_values.argmax(dim=1).item())

    def greedy_action(self, state: np.ndarray) -> int:
        """Purely greedy action (used during evaluation / GIF generation)."""
        state_t = torch.tensor(
            state, dtype=torch.float32, device=self.device
        ).unsqueeze(0)
        with torch.no_grad():
            q_values = self.q_net(state_t)
        return int(q_values.argmax(dim=1).item())

    # ---------- replay interaction ----------
    def push(self, *transition):
        self.buffer.push(*transition)

    def can_update(self) -> bool:
        return len(self.buffer) >= self.min_buffer_size

    # ---------- TD-target computation ----------
    def compute_td_target(
        self, rewards: torch.Tensor, next_states: torch.Tensor, dones: torch.Tensor
    ) -> torch.Tensor:
        """
        Shared logic for computing TD targets.

        DQN:   max_a' Q_target(s', a')
        DDQN:  Q_target(s', argmax_a' Q_online(s', a'))
        PER:   uses DDQN-style target (usually more stable).
        """
        with torch.no_grad():
            next_q_target = self.target_net(next_states)  # [batch, actions]

            if self.algo_type in ("ddqn", "per"):
                # Double DQN: choose action via online net, evaluate via target net
                next_q_online = self.q_net(next_states)
                next_actions = next_q_online.argmax(dim=1, keepdim=True)
                next_q = next_q_target.gather(1, next_actions).squeeze(1)
            else:
                # Plain DQN: directly take the max over target network
                next_q, _ = next_q_target.max(dim=1)

            td_target = rewards + self.gamma * (1.0 - dones) * next_q
        return td_target

    # ---------- single gradient step ----------
    def update(self):
        if not self.can_update():
            return None

        self.training_steps += 1

        if self.algo_type == "per":
            # Anneal beta from beta_start -> 1.0 over beta_frames updates
            beta = min(
                1.0,
                self.beta_start
                + self.training_steps * (1.0 - self.beta_start) / self.beta_frames,
            )
            (
                states,
                actions,
                rewards,
                next_states,
                dones,
                indices,
                weights,
            ) = self.buffer.sample(self.batch_size, beta)
            weights = torch.tensor(
                weights, dtype=torch.float32, device=self.device
            )  # [batch]
        else:
            states, actions, rewards, next_states, dones = self.buffer.sample(
                self.batch_size
            )
            weights = torch.ones(self.batch_size, device=self.device)

        states = torch.tensor(states, dtype=torch.float32, device=self.device)
        actions = torch.tensor(actions, dtype=torch.int64, device=self.device).unsqueeze(
            1
        )
        rewards = torch.tensor(rewards, dtype=torch.float32, device=self.device)
        next_states = torch.tensor(
            next_states, dtype=torch.float32, device=self.device
        )
        dones = torch.tensor(dones, dtype=torch.float32, device=self.device)

        # Q(s,a) for actions taken
        q_values = self.q_net(states).gather(1, actions).squeeze(1)

        # TD target (depends on algo_type)
        td_target = self.compute_td_target(rewards, next_states, dones)

        # Per-sample loss for PER, mean loss otherwise
        loss_tensor = self.loss_fn(q_values, td_target)
        loss = (loss_tensor * weights).mean()

        self.optimizer.zero_grad()
        loss.backward()
        # Optional but often helps stability
        nn.utils.clip_grad_norm_(self.q_net.parameters(), max_norm=10.0)
        self.optimizer.step()

        # Update priorities based on TD error magnitude
        if self.algo_type == "per":
            new_priorities = loss_tensor.detach().cpu().numpy() + 1e-6
            self.buffer.update_priorities(indices, new_priorities)

        # Softly copy online weights to target network every N steps
        if self.training_steps % self.target_update_freq == 0:
            self.target_net.load_state_dict(self.q_net.state_dict())

        return float(loss.item())

print("‚úì DQN Agent defined")

### 3.4 Training and Visualization Functions

In [None]:
# -----------------------------------------------------------
#  Training utilities
# -----------------------------------------------------------
def train_agent(
    env_name: str,
    algo_type: str,
    num_episodes: int = 600,
    seed: int = 42,
) -> tuple[DQNAgent, list[float]]:
    """
    Train a single agent on LunarLander and return the trained
    agent plus a list of episode returns.
    """

    env = make_env(env_name, seed=seed, render_mode=None)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    agent = DQNAgent(
        state_dim=state_dim,
        action_dim=action_dim,
        algo_type=algo_type,
        gamma=0.99,
        lr=1e-3,
        batch_size=64,
        buffer_capacity=100_000,
        min_buffer_size=10_000,
        target_update_freq=1_000,
    )

    episode_rewards: list[float] = []

    print(f"\n=== Training {algo_type.upper()} on {env_name} for {num_episodes} episodes ===")
    for episode in range(1, num_episodes + 1):
        state, _ = env.reset()
        done = False
        total_reward = 0.0

        while not done:
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Store transition
            agent.push(state, action, reward, next_state, float(done))

            state = next_state
            total_reward += reward

            # Perform gradient step when enough samples available
            if agent.can_update():
                agent.update()

        episode_rewards.append(total_reward)

        if episode % 20 == 0:
            last_mean = np.mean(episode_rewards[-20:])
            print(
                f"Episode {episode:4d} | "
                f"avg reward (last 20): {last_mean:7.2f} | "
                f"epsilon: {agent.epsilon():.3f}"
            )

    env.close()
    return agent, episode_rewards


def plot_learning_curves(results: dict, save_path: str = "learning_curves.png"):
    """
    Plot episode reward curves for each algorithm.
    `results` is a dict: algo_name -> list_of_episode_rewards
    """
    plt.figure(figsize=(10, 6))
    for label, rewards in results.items():
        rewards = np.array(rewards)
        # moving average for smoothing (window 20)
        window = 20
        if len(rewards) >= window:
            smooth = np.convolve(rewards, np.ones(window) / window, mode="valid")
            plt.plot(
                range(window, len(rewards) + 1),
                smooth,
                label=f"{label} (moving avg)",
            )
        plt.plot(np.arange(1, len(rewards) + 1), rewards, alpha=0.25, linestyle="--")

    plt.axhline(200, color="grey", linestyle=":", label="Solved threshold (‚âà200)")
    plt.xlabel("Episode")
    plt.ylabel("Episodic Return")
    plt.title("LunarLander-v3: DQN vs DDQN vs PER")
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(save_path)
    print(f"Saved learning curves to {save_path}")


def generate_gif(
    agent: DQNAgent,
    env_name: str,
    filename: str,
    seed: int = 0,
    episodes: int = 3,
):
    """
    Roll out the learned policy (greedy) and record a GIF.
    """
    env = make_env(env_name, seed=seed, render_mode="rgb_array")
    frames = []

    for ep in range(episodes):
        state, _ = env.reset()
        done = False
        step = 0
        while not done:
            frame = env.render()  # rgb_array
            frames.append(frame)

            action = agent.greedy_action(state)
            next_state, _, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            state = next_state
            step += 1

    env.close()
    imageio.mimsave(filename, frames, fps=30)
    print(f"Saved GIF for policy to {filename}")

print("‚úì Training and visualization functions defined")

## 4. Training Configuration

### Hyperparameters

All experiments used the following hyperparameters:

| Parameter | Value |
|-----------|-------|
| Learning rate | 1e-3 |
| Discount factor (Œ≥) | 0.99 |
| Batch size | 64 |
| Replay buffer capacity | 100,000 |
| Min buffer size before training | 10,000 |
| Target network update frequency | 1,000 steps |
| Epsilon start | 1.0 |
| Epsilon end | 0.05 |
| Epsilon decay | 250,000 frames |
| Loss function | Smooth L1 (Huber) |
| Optimizer | Adam |
| Gradient clipping | Max norm 10.0 |

**PER-Specific:**
- Alpha (priority exponent): 0.6
- Beta (importance sampling): 0.4 ‚Üí 1.0 (annealed over 200,000 frames)

### Experimental Runs

The project includes results from 10 experimental runs:
- **Tests 1-5:** 600 episodes each
- **Tests 6-8:** 2,000 episodes each
- **Test 9:** 5,000 episodes
- **Test 10:** 10,000 episodes

## 5. Run Training (Optional)

**Note:** If you cloned the repository, all training results are already available. You can skip this section and go directly to the results visualization.

If you want to train from scratch, run the cells below. **Warning:** Training takes several hours depending on the number of episodes.

In [None]:
# Training example - 600 episodes
# Uncomment and run if you want to train from scratch

# ENV_NAME = "LunarLander-v3"
# NUM_EPISODES = 600

# results = {}
# trained_agents = {}

# # Train all three algorithms
# for algo in ["dqn", "ddqn", "per"]:
#     agent, rewards = train_agent(ENV_NAME, algo_type=algo, num_episodes=NUM_EPISODES)
#     results[algo.upper()] = rewards
#     trained_agents[algo.upper()] = agent

#     print(f"{algo.upper()} final mean reward over last 50 episodes: {np.mean(rewards[-50:]):.2f}")

# # Generate plots and GIFs
# plot_learning_curves(results, save_path="learning_curves.png")
# generate_gif(trained_agents["DQN"], ENV_NAME, "dqn_agent.gif")
# generate_gif(trained_agents["DDQN"], ENV_NAME, "ddqn_agent.gif")
# generate_gif(trained_agents["PER"], ENV_NAME, "per_agent.gif")

print("Training section ready (commented out). Uncomment to train from scratch.")

## 6. Display Pre-trained Results

Below we'll display the results from all 10 experimental runs, including learning curves and agent behavior GIFs.

### 6.1 Helper Function to Display Images and GIFs

In [None]:
from IPython.display import Image as IPImage, display, HTML

def display_image_from_file(filepath, width=800):
    """Display an image file in the notebook"""
    if os.path.exists(filepath):
        display(IPImage(filename=filepath, width=width))
    else:
        print(f"‚ùå File not found: {filepath}")

def display_gif_from_file(filepath, width=600):
    """Display a GIF file in the notebook"""
    if os.path.exists(filepath):
        with open(filepath, 'rb') as f:
            gif_data = f.read()
        gif_b64 = base64.b64encode(gif_data).decode('ascii')
        display(HTML(f'<img src="data:image/gif;base64,{gif_b64}" width="{width}"/>'))
    else:
        print(f"‚ùå File not found: {filepath}")

def list_test_folders(base_path="Test Results"):
    """List all test result folders"""
    if os.path.exists(base_path):
        folders = [f for f in os.listdir(base_path) if os.path.isdir(os.path.join(base_path, f))]
        return sorted(folders)
    else:
        print(f"‚ùå Directory not found: {base_path}")
        return []

print("‚úì Display helper functions defined")

### 6.2 Test 1 (600 Episodes)

In [None]:
test_path = "Test Results/Test 1 (600 Eps)"

print("=" * 60)
print("TEST 1 - 600 EPISODES")
print("=" * 60)

# Display learning curve
print("\nüìä Learning Curves:")
display_image_from_file(f"{test_path}/learning_curves.png")

# Display agent GIFs
print("\nüéÆ DQN Agent Behavior:")
display_gif_from_file(f"{test_path}/dqn_agent.gif")

print("\nüéÆ DDQN Agent Behavior:")
display_gif_from_file(f"{test_path}/ddqn_agent.gif")

print("\nüéÆ PER Agent Behavior:")
display_gif_from_file(f"{test_path}/per_agent.gif")

### 6.3 Test 6 (2000 Episodes)

In [None]:
test_path = "Test Results/Test 6 (2000 Eps)"

print("=" * 60)
print("TEST 6 - 2000 EPISODES")
print("=" * 60)

# Display learning curve
print("\nüìä Learning Curves:")
display_image_from_file(f"{test_path}/learning_curves.png")

# Display agent GIFs
print("\nüéÆ DQN Agent Behavior:")
display_gif_from_file(f"{test_path}/dqn_agent.gif")

print("\nüéÆ DDQN Agent Behavior:")
display_gif_from_file(f"{test_path}/ddqn_agent.gif")

print("\nüéÆ PER Agent Behavior:")
display_gif_from_file(f"{test_path}/per_agent.gif")

### 6.4 Test 10 (10000 Episodes) - Best Performance

In [None]:
test_path = "Test Results/Test 10 (10000 Eps)"

print("=" * 60)
print("TEST 10 - 10000 EPISODES (MAXIMUM TRAINING)")
print("=" * 60)

# Display learning curve
print("\nüìä Learning Curves:")
display_image_from_file(f"{test_path}/learning_curves.png")

# Display agent GIFs
print("\nüéÆ DQN Agent Behavior:")
display_gif_from_file(f"{test_path}/dqn_agent.gif")

print("\nüéÆ DDQN Agent Behavior:")
display_gif_from_file(f"{test_path}/ddqn_agent.gif")

print("\nüéÆ PER Agent Behavior:")
display_gif_from_file(f"{test_path}/per_agent.gif")

### 6.5 View All Other Test Results

In [None]:
# Display all test results dynamically
test_folders = list_test_folders("Test Results")
print(f"Found {len(test_folders)} test result folders:")
for folder in test_folders:
    print(f"  - {folder}")

# Display results from remaining tests (Tests 2-5, 7-9)
remaining_tests = [
    "Test 2 (600 Eps)",
    "Test 3 (600 Eps)",
    "Test 4 (600 Eps)",
    "Test 5 (600 Eps)",
    "Test 7 (2000 Eps)",
    "Test 8 (2000 Eps)",
    "Test 9 (5000 Eps)"
]

for test_name in remaining_tests:
    test_path = f"Test Results/{test_name}"
    if os.path.exists(test_path):
        print("\n" + "=" * 60)
        print(f"{test_name.upper()}")
        print("=" * 60)

        # Display learning curve
        print("\nüìä Learning Curves:")
        display_image_from_file(f"{test_path}/learning_curves.png")

        # Display agent GIFs (optionally show only learning curves to save space)
        # Uncomment below to show all GIFs for each test
        # print("\nüéÆ DQN Agent:")
        # display_gif_from_file(f"{test_path}/dqn_agent.gif", width=400)
        # print("\nüéÆ DDQN Agent:")
        # display_gif_from_file(f"{test_path}/ddqn_agent.gif", width=400)
        # print("\nüéÆ PER Agent:")
        # display_gif_from_file(f"{test_path}/per_agent.gif", width=400)

## 7. Analysis and Discussion

### Algorithm Comparison

**DQN (Deep Q-Network)**
- Standard value-based RL with experience replay
- Uses max Q-value from target network for TD target
- Baseline for comparison

**DDQN (Double DQN)**
- Addresses overestimation bias in DQN
- Decouples action selection (online network) from action evaluation (target network)
- Generally more stable and achieves better performance

**PER (Prioritized Experience Replay)**
- Samples important transitions more frequently based on TD-error
- Uses importance sampling weights to correct for bias
- Can achieve faster learning and better sample efficiency

### Key Observations

1. **Training Duration Impact:** Agents trained for 10,000 episodes show significantly more stable and higher performance than those trained for 600 episodes.

2. **Algorithm Performance:** DDQN typically shows improved stability over vanilla DQN, while PER can accelerate learning by focusing on informative transitions.

3. **Convergence:** The learning curves show that all three algorithms can solve the LunarLander task (achieving >200 reward) with sufficient training time.

4. **Variance:** Early training shows high variance in rewards, which decreases as the policy improves.

## 8. Environment Information

### LunarLander-v3 Details

**State Space (8 dimensions):**
1. x position
2. y position
3. x velocity
4. y velocity
5. angle
6. angular velocity
7. left leg contact (boolean)
8. right leg contact (boolean)

**Action Space (4 discrete actions):**
- 0: Do nothing
- 1: Fire left engine
- 2: Fire main engine
- 3: Fire right engine

**Reward Structure:**
- Moving towards landing pad: positive reward
- Moving away from landing pad: negative reward
- Crashing: -100
- Landing safely: +100 to +140
- Each leg contact: +10
- Firing main engine: -0.3 per frame
- Firing side engine: -0.03 per frame

**Success Criterion:**
- Average reward ‚â• 200 over 100 consecutive episodes

## 9. References

1. **Mnih, V., et al. (2015).** *Human-level control through deep reinforcement learning.* Nature, 518(7540), 529-533.
   - Original DQN paper introducing experience replay and target networks

2. **Van Hasselt, H., Guez, A., & Silver, D. (2016).** *Deep Reinforcement Learning with Double Q-learning.* Proceedings of the AAAI Conference on Artificial Intelligence.
   - Double DQN paper addressing overestimation bias

3. **Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015).** *Prioritized Experience Replay.* International Conference on Learning Representations (ICLR).
   - PER paper introducing prioritized sampling from replay buffer

4. **Gymnasium Documentation:** https://gymnasium.farama.org/
   - OpenAI's maintained fork of Gym, including LunarLander-v3 environment

## 10. Conclusion

This notebook demonstrates a comprehensive comparison of three value-based deep reinforcement learning algorithms (DQN, DDQN, and PER) on the LunarLander-v3 task. 

**Key Findings:**
- All three algorithms successfully learn to solve the LunarLander task
- DDQN shows improved stability over vanilla DQN by addressing overestimation bias
- PER can accelerate learning by prioritizing important transitions
- Extended training (10,000 episodes) yields significantly better and more stable policies

**Implementation Highlights:**
- Clean, modular code structure
- Shared agent architecture supporting all three algorithms
- Comprehensive hyperparameter tuning
- Multiple experimental runs for validation
- Visual results (learning curves and agent behavior GIFs)

This work was completed as part of CIS2719 Coursework 2 - Foundations of Robotics & AI.

---

**Author:** Ethan Hulme  
**Date:** January 2026  
**GitHub Repository:** https://github.com/humm3ll/LunarLander-v3-RL