# Reinforcement Learning Project – Lunar Lander (Continuous Environment)

## Environment Overview

This notebook tackles the **LunarLanderContinuous-v2** environment, where the agent must learn to land a lunar module smoothly on a designated landing pad using continuous control. The complexity of this task comes from its continuous state and action spaces, requiring algorithms capable of function approximation and gradient-based learning.

### Environment Characteristics:
- **Observation space**: Continuous 8-dimensional vector  
  (position, velocity, angle, angular velocity, and leg contact flags)
- **Action space**: Continuous 2-dimensional vector  
  (main engine and side engine thrusts in the range [-1, 1])

## Algorithms Implemented

We implement and compare two algorithms for continuous action environments:
1. **DQN** –  
2. **Sarsa** – 



<div style="text-align: center;">
    <strong style="display: block; margin-bottom: 10px;">Group ??</strong> 
    <table style="margin: 0 auto; border-collapse: collapse; border: 1px solid black;">
        <tr>
            <th style="border: 1px solid white; padding: 8px;">Name</th>
            <th style="border: 1px solid white; padding: 8px;">Student ID</th>
        </tr>
        <tr>
            <td style="border: 1px solid white; padding: 8px;">Joana Rodrigues</td>
            <td style="border: 1px solid white; padding: 8px;">20240603</td>
        </tr>
        <tr>
            <td style="border: 1px solid white; padding: 8px;">Mara Simões</td>
            <td style="border: 1px solid white; padding: 8px;">20240326</td>
        </tr>
        <tr>
            <td style="border: 1px solid white; padding: 8px;">Matilde Street</td>
            <td style="border: 1px solid white; padding: 8px;">20240523</td>
        </tr>
        <tr>
            <td style="border: 1px solid white; padding: 8px;">Tomás Luzia</td>
            <td style="border: 1px solid white; padding: 8px;">20230477</td>
        </tr>
    </table>
</div>

### 🔗 Table of Contents <a id='table-of-contents'></a>
1. [Imports](#imports)
---

# 1. Imports

**Import Libraries**

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import gymnasium as gym

# Importing necessary libraries for DQN implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
import random
from collections import deque
import torch.optim as optim

**Load Environment**

In [3]:
# Create the environment
env = gym.make("LunarLander-v3", render_mode=None)  # use render_mode="human" to visualize the environment

# Observe the state and action spaces
print("Observation space:", env.observation_space)
print("Example of a state:", env.reset()[0])  # [0] because reset() returns (observation, info)

print("Action space:", env.action_space)
print("Number of possible actions:", env.action_space.n)

Observation space: Box([ -2.5        -2.5       -10.        -10.         -6.2831855 -10.
  -0.         -0.       ], [ 2.5        2.5       10.        10.         6.2831855 10.
  1.         1.       ], (8,), float32)
Example of a state: [-0.0027173   1.399282   -0.27525324 -0.51725024  0.00315551  0.06234898
  0.          0.        ]
Action space: Discrete(4)
Number of possible actions: 4


# 2. DQN

## 2.1. Neural Network

We define a fully connected neural network to approximate the Q-function. 
The input is the 8-dimensional state from the environment, and the output is a vector of 4 Q-values, one for each possible action.
This network will be used to predict the value of each action given a state.

In [4]:
class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(QNetwork, self).__init__()
        # Fully connected layers
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.out = nn.Linear(128, action_dim)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        q_values = self.out(x)
        return q_values

In [5]:
# Instantiate the model
state_dim = env.observation_space.shape[0]  # should be 8
action_dim = env.action_space.n             # should be 4
q_net = QNetwork(state_dim, action_dim)

## 2.2. Replay Buffer

We implement a replay buffer to store past transitions (state, action, reward, next_state, done).
During training, the agent samples random mini-batches from this buffer to break temporal correlations and stabilize learning.
This mechanism is essential for Deep Q-Learning to resemble supervised learning.

In [6]:
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def add(self, state, action, reward, next_state, done):
        """Store a transition in the buffer."""
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        """Sample a random batch of transitions."""
        transitions = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*transitions)

        # Convert to tensors
        states = torch.tensor(np.array(states), dtype=torch.float32)
        actions = torch.tensor(actions, dtype=torch.int64).unsqueeze(1)
        rewards = torch.tensor(rewards, dtype=torch.float32).unsqueeze(1)
        next_states = torch.tensor(np.array(next_states), dtype=torch.float32)
        dones = torch.tensor(dones, dtype=torch.float32).unsqueeze(1)

        return states, actions, rewards, next_states, dones

    def __len__(self):
        return len(self.buffer)

In [7]:
buffer = ReplayBuffer(capacity=10000)

## 2.3. Epsilon-greedy

We implement an epsilon-greedy action selection strategy to balance exploration and exploitation:
- With probability ε, the agent selects a random action (exploration).
- With probability 1−ε, it selects the action with the highest predicted Q-value (exploitation).

This allows the agent to discover new strategies while gradually learning to exploit the best ones.

In [8]:
def select_action(state, q_network, epsilon, action_dim):
    """Selects an action using epsilon-greedy policy."""
    if np.random.rand() < epsilon:
        # Explore: random action
        return np.random.randint(action_dim)
    else:
        # Exploit: choose best action based on Q-values
        state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)  # shape (1, 8)
        q_values = q_network(state_tensor)
        return torch.argmax(q_values).item()

In [9]:
state, _ = env.reset()
epsilon = 0.3

action = select_action(state, q_net, epsilon, action_dim)
print("Selected action:", action)

Selected action: 1


## 2.4. Target Network

We create a separate target network to provide stable Q-value targets during training.
Initially, the target network is a copy of the main Q-network
Throughout training, it is updated periodically to reflect the weights of the current Q-network.
This technique reduces oscillations and helps stabilize the learning process.

In [10]:
# Create the target Q-network
target_q_net = QNetwork(state_dim, action_dim)
target_q_net.load_state_dict(q_net.state_dict())  # Copy weights from q_net
target_q_net.eval()  # Set to evaluation mode (no dropout, no batchnorm)

QNetwork(
  (fc1): Linear(in_features=8, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=128, bias=True)
  (out): Linear(in_features=128, out_features=4, bias=True)
)

## 2.5. Training

We implement the main training loop where the agent interacts with the environment and stores transitions in the replay buffer.
Mini-batches are sampled to update the Q-network using target Q-values computed from a separate target network.
The network is trained by minimizing the mean squared error between the predicted Q-values and the target Q-values, and the target network is periodically updated to stabilize training.
Throughout training, we log the total reward per episode and track the value of epsilon to monitor the exploration-exploitation trade-off.
At the end, both reward and epsilon histories are saved to text files for visualization and analysis.

In [None]:
# Hyperparameters
num_episodes = 500
batch_size = 64
gamma = 0.99
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
target_update_freq = 1000  # steps
learning_rate = 1e-3

# Optimizer
optimizer = optim.Adam(q_net.parameters(), lr=learning_rate)

# Track total steps
total_steps = 0

# For logging
episode_rewards = []
epsilon_history = []

for episode in range(num_episodes):
    state, _ = env.reset()
    total_reward = 0
    done = False

    while not done:
        # Select action using epsilon-greedy
        action = select_action(state, q_net, epsilon, action_dim)

        # Interact with environment
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        # Store transition in replay buffer
        buffer.add(state, action, reward, next_state, done)
        state = next_state
        total_reward += reward
        total_steps += 1

        # Start training only after batch is ready
        if len(buffer) >= batch_size:
            # Sample mini-batch
            states, actions, rewards, next_states, dones = buffer.sample(batch_size)

            # Compute Q-values for current states
            q_values = q_net(states).gather(1, actions)

            # Compute max Q-values for next states from target network
            with torch.no_grad():
                max_next_q = target_q_net(next_states).max(1)[0].unsqueeze(1)
                q_targets = rewards + gamma * max_next_q * (1 - dones)

            # Compute loss
            loss = torch.nn.functional.mse_loss(q_values, q_targets)

            # Optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Periodically update target network
        if total_steps % target_update_freq == 0:
            target_q_net.load_state_dict(q_net.state_dict())

    # Decay epsilon
    epsilon = max(epsilon_min, epsilon * epsilon_decay)

    # Log metrics
    episode_rewards.append(total_reward)
    epsilon_history.append(epsilon)
    print(f"Episode {episode}, Reward: {total_reward:.2f}, Epsilon: {epsilon:.3f}")


# Save episode rewards to a file
with open("episode_rewards.txt", "w") as f:
    for r in episode_rewards:
        f.write(f"{r}\n")

# Save epsilon history to a file
with open("epsilon_history.txt", "w") as f:
    for eps in epsilon_history:
        f.write(f"{eps}\n")

Episode 0, Reward: -409.47, Epsilon: 0.995
Episode 1, Reward: -139.40, Epsilon: 0.990
Episode 2, Reward: -141.39, Epsilon: 0.985
Episode 3, Reward: -63.63, Epsilon: 0.980
Episode 4, Reward: -151.33, Epsilon: 0.975
Episode 5, Reward: -82.10, Epsilon: 0.970
Episode 6, Reward: -77.75, Epsilon: 0.966
Episode 7, Reward: -400.08, Epsilon: 0.961
Episode 8, Reward: -133.35, Epsilon: 0.956
Episode 9, Reward: -189.06, Epsilon: 0.951
Episode 10, Reward: -280.30, Epsilon: 0.946
Episode 11, Reward: -49.82, Epsilon: 0.942
Episode 12, Reward: -103.97, Epsilon: 0.937
Episode 13, Reward: -138.88, Epsilon: 0.932
Episode 14, Reward: -125.65, Epsilon: 0.928
Episode 15, Reward: -300.43, Epsilon: 0.923
Episode 16, Reward: -125.84, Epsilon: 0.918
Episode 17, Reward: -377.64, Epsilon: 0.914
Episode 18, Reward: -45.60, Epsilon: 0.909
Episode 19, Reward: -75.20, Epsilon: 0.905
Episode 20, Reward: -79.02, Epsilon: 0.900
Episode 21, Reward: -274.18, Epsilon: 0.896
Episode 22, Reward: -135.37, Epsilon: 0.891
Episo

## 2.6. Visualization

In [12]:
# Load episode rewards
with open("episode_rewards.txt", "r") as f:
    episode_rewards = [float(line.strip()) for line in f.readlines()]

# Load epsilon history
with open("epsilon_history.txt", "r") as f:
    epsilon_history = [float(line.strip()) for line in f.readlines()]

# Create episode index
episodes = np.arange(len(episode_rewards))

# Compute moving average of rewards
window = 20
moving_avg = pd.Series(episode_rewards).rolling(window).mean()

In [13]:
import matplotlib
matplotlib.use("Agg")  # Use a non-interactive backend
import matplotlib.pyplot as plt

In [14]:
# Plot 1: Reward per episode + moving average
plt.figure(figsize=(12, 5))
plt.plot(episodes, episode_rewards, label="Reward per Episode", alpha=0.4)
plt.plot(episodes, moving_avg, label=f"{window}-Episode Moving Average", color='red', linewidth=2)
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("DQN Training Performance on LunarLander-v2")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

: 

In [None]:
# Plot 2: Epsilon decay over time
plt.figure(figsize=(12, 4))
plt.plot(episodes, epsilon_history, label="Epsilon", color='purple')
plt.xlabel("Episode")
plt.ylabel("Epsilon")
plt.title("Exploration Rate (Epsilon) Over Time")
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# Plot 3: Histogram of reward distribution
plt.figure(figsize=(8, 5))
plt.hist(episode_rewards, bins=30, edgecolor='black')
plt.xlabel("Total Reward")
plt.ylabel("Frequency")
plt.title("Reward Distribution Across Episodes")
plt.tight_layout()
plt.show()

# 3. Sarsa