# Deep Q-Network (DQN) Agent for LunarLander-v2

In this notebook, we'll implement a Deep Q-Network (DQN) to solve the "LunarLander-v2" environment from OpenAI Gym. The DQN agent will learn to land a spacecraft safely on the lunar surface. Our goal is to achieve an average score of 195 or above over 100 consecutive trials.

### Import Libraries

In [1]:
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import random
from collections import namedtuple, deque
from tqdm import tqdm

### DQN Model

Below is the neural network that will approximate the Q-value function. The network will take the state as input and output Q-values for each action.


In [2]:
class DQN(nn.Module):
    def __init__(self, state_size, action_size):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

### Replay Memory

We'll use a replay memory to store transitions that the agent observes, allowing us to reuse this data later. By sampling from it randomly, the transitions that build up a batch are decorrelated.

In [3]:
Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward', 'done'))

class ReplayMemory:
    def __init__(self, capacity):
        self.memory = deque([], maxlen=capacity)

    def push(self, *args):
        self.memory.append(Transition(*args))

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

### DQN Agent

The DQN agent learns from the environment by interacting with it. It stores experiences and learns by replaying these experiences.

In [4]:
class DQNAgent:
    def __init__(self, state_size, action_size, batch_size=128, gamma=0.99, epsilon_start=1.0, epsilon_end=0.01, epsilon_decay=0.995, learning_rate=0.0005):
        self.state_size = state_size
        self.action_size = action_size
        self.batch_size = batch_size
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.learning_rate = learning_rate

        self.policy_net = DQN(state_size, action_size)
        self.target_net = DQN(state_size, action_size)
        self.target_net.load_state_dict(self.policy_net.state_dict())
        self.target_net.eval()

        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=self.learning_rate)
        self.memory = ReplayMemory(10000)

    def select_action(self, state):
        sample = random.random()
        eps_threshold = self.epsilon
        if sample > eps_threshold:
            with torch.no_grad():
                return self.policy_net(state).max(1)[1].view(1, 1)
        else:
            return torch.tensor([[random.randrange(self.action_size)]], dtype=torch.long)

    def optimize_model(self):
        if len(self.memory) < self.batch_size:
            return

        transitions = self.memory.sample(self.batch_size)
        batch = Transition(*zip(*transitions))

        non_final_mask = torch.tensor(tuple(map(lambda s: s is not None, batch.next_state)), dtype=torch.bool)
        non_final_next_states = torch.cat([s for s in batch.next_state if s is not None])
        state_batch = torch.cat(batch.state)
        action_batch = torch.cat(batch.action)
        reward_batch = torch.cat(batch.reward)
        done_batch = torch.tensor(batch.done, dtype=torch.float32)

        state_action_values = self.policy_net(state_batch).gather(1, action_batch)
        next_state_values = torch.zeros(self.batch_size)
        next_state_values[non_final_mask] = self.target_net(non_final_next_states).max(1)[0].detach()
        expected_state_action_values = (next_state_values * self.gamma) * (1 - done_batch) + reward_batch

        loss = F.mse_loss(state_action_values, expected_state_action_values.unsqueeze(1))
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

    def update_target_net(self):
        self.target_net.load_state_dict(self.policy_net.state_dict())

    def decay_epsilon(self):
        self.epsilon = max(self.epsilon_end, self.epsilon_decay * self.epsilon)

### Training the Agent

Now, let's train our DQN agent on the LunarLander environment. We will run the environment for a number of episodes and optimize our agent's policy network based on its experience.

In [None]:
env = gym.make('LunarLander-v2')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size)

num_episodes = 1000
scores = []

# Wrap the episode range with tqdm for the progress bar
for i_episode in tqdm(range(num_episodes), desc="Training Progress"):
    state_tuple = env.reset()
    state = state_tuple[0] if isinstance(state_tuple, tuple) else state_tuple
    state = torch.from_numpy(state).float().unsqueeze(0)
    total_reward = 0
    done = False

    while not done:
        action = agent.select_action(state)
        output = env.step(action.item())
        next_state = output[0]
        reward = output[1]
        done = output[2]

        next_state = torch.from_numpy(next_state).float().unsqueeze(0)
        reward_tensor = torch.tensor([reward], dtype=torch.float)

        agent.memory.push(state, action, next_state, reward_tensor, done)
        state = next_state
        total_reward += reward
        agent.optimize_model()

    agent.update_target_net()
    agent.decay_epsilon()
    scores.append(total_reward)

    if i_episode % 50 == 0:
        print(f"Episode {i_episode}, Average Score: {np.mean(scores[-50:])}")

    if np.mean(scores[-50:]) >= 195:
        print(f"Solved in {i_episode} episodes!")
        break

env.close()

Training Progress:   0%|          | 2/1000 [00:00<01:12, 13.77it/s]

Episode 0, Average Score: -145.64225312385526


Training Progress:   5%|▌         | 51/1000 [00:18<05:49,  2.72it/s]

Episode 50, Average Score: -186.26312916575782


Training Progress:  10%|█         | 101/1000 [00:44<09:35,  1.56it/s]

Episode 100, Average Score: -106.56872105648213


Training Progress:  13%|█▎        | 128/1000 [01:40<1:24:59,  5.85s/it]