# Deep Q-Network with Lunar Lander

This notebook shows an implementation of a DQN on the LunarLander environment.
Details on the environment can be found [here](https://gym.openai.com/envs/LunarLander-v2/).

## 1. Setup

We first need to install some dependencies for using the environment:

In [1]:
!pip3 install box2d-py # for mac
# !pip3 install box2d-py # for windows
!pip3 install pyglet



In [2]:
import random
from time import time
from collections import deque
import numpy as np
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [3]:
env = gym.make('LunarLander-v2')
env.seed(0)
random.seed(0)
torch.manual_seed(0)

<torch._C.Generator at 0x7f938b27f3f0>

In [4]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

## 2. Define the neural network, the replay buffer and the agent

First, we define the neural network that predicts the Q-values for all actions, given a state as input.
This is a fully-connected neural net with two hidden layers using Relu activations.
The last layer does not have any activation and outputs a Q-value for every action.

In [5]:
class QNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 32)
        self.fc2 = nn.Linear(32, 64)
        self.fc3 = nn.Linear(64, action_size)  
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)     

Next, we define a replay buffer that saves previous transitions and provides a `sample` function to randomly extract a batch of experiences from the buffer.

Note that experiences are internally saved as `numpy`-arrays. They are converted back to PyTorch tensors before being returned by the `sample`-method.

In [6]:
class StateTransition:
    def __init__(self, state, action, reward, next_state, done):
        self.state = state
        self.action = action
        self.reward = reward
        self.next_state = next_state
        self.done = 1 if done else 0 # Convert done flag from boolean to int

class ReplayBuffer:
    def __init__(self, buffer_size, batch_size):
        self.batch_size = batch_size
        self.memory = deque(maxlen=buffer_size)
       
    def add(self, state, action, reward, next_state, done):
        state_transition = StateTransition(state, action, reward, next_state, done)
        self.memory.append(state_transition)
                
    def sample(self):
        state_transitions = random.sample(self.memory, self.batch_size)
        
        # Convert to PyTorch tensors
        states = np.vstack([s_t.state for s_t in state_transitions])
        states_tensor = torch.from_numpy(states).float().to(device)
        
        actions = np.vstack([s_t.action for s_t in state_transitions])
        actions_tensor = torch.from_numpy(actions).long().to(device)

        rewards = np.vstack([s_t.reward for s_t in state_transitions])
        rewards_tensor = torch.from_numpy(rewards).float().to(device)

        next_states = np.vstack([s_t.next_state for s_t in state_transitions])
        next_states_tensor = torch.from_numpy(next_states).float().to(device)
        
        dones = np.vstack([s_t.done for s_t in state_transitions])
        dones_tensor = torch.from_numpy(dones).float().to(device)
        
        return (states_tensor, actions_tensor, rewards_tensor, next_states_tensor, dones_tensor)
        
    def is_filled(self):
        return len(self.memory) >= BATCH_SIZE
    

In [7]:
BUFFER_SIZE = 100000    # Replay memory size
BATCH_SIZE = 64         # Number of experiences to sample from memory
GAMMA = 0.99            # Discount factor
TARGET_SYNC = 20        # How often the target networks is synchronized
       
class DQNAgent:
    def __init__(self, state_size, action_size):
        
        self.action_size = action_size
        
        # Initialize Q and Target Q networks
        self.q_network = QNetwork(state_size, action_size).to(device)
        self.target_network = QNetwork(state_size, action_size).to(device)
        
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.001)
        
        # Initiliase replay buffer 
        self.memory = ReplayBuffer(BUFFER_SIZE, BATCH_SIZE)
        self.timestep = 0
    
    def train(self, state, action, reward, next_state, done):

        self.memory.add(state, action, reward, next_state, done)
        self.timestep += 1
        
        if not self.memory.is_filled(): # train only when buffer is filled
            return

        states, actions, rewards, next_states, dones = self.memory.sample()
               
        # you need to implement the following method in task 5
        loss = self.calculate_loss(states, actions, rewards, next_states, dones) 
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Synchronize target network by copying weights
        if self.timestep % TARGET_SYNC == 0:
            self.target_network.load_state_dict(self.q_network.state_dict())
    
    
    def calculate_loss(self, states, actions, rewards, next_states, dones):
    
        action_values = self.target_network(next_states).detach()
        max_action_values = action_values.max(1)[0].unsqueeze(1)

        # If "done==1" just use reward, else update Q_target with discounted action values
        Q_target = rewards + (GAMMA * max_action_values * (1 - dones))
        Q_prediction = self.q_network(states).gather(1, actions)

        # Calculate loss and update weights
        loss = F.mse_loss(Q_prediction, Q_target)

        return loss
    
    def choose_action(self, state, epsilon):
        rnd = random.random()
        if rnd < epsilon:
            return np.random.randint(self.action_size)
        else:
            state = torch.from_numpy(state).float().unsqueeze(0).to(device)
            action_values = self.q_network(state)
            action = np.argmax(action_values.cpu().data.numpy())
            return action

### 3. Executes episodes and train the model

We first define the necessary paramters for training:

In [8]:
TARGET_SCORE = 200            # Train until this score is reached
MAX_EPISODE_LENGTH = 1000     # Max steps allowed in a single episode
EPSILON_MIN = 0.01            # Minimum epsilon 

Then we start executing episodes and observe the mean score per episode.
The environment is considered as solved if this score is above 200.

In [9]:
# Get state and action sizes
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

print(f'State size: {state_size}, action size: {action_size}')
dqn_agent = DQNAgent(state_size, action_size)
start = time()
last_time = start

scores_window = deque(maxlen=100)
mean_score = 0
episode = 0

while True:
    episode += 1
    score = 0
    state = env.reset()

    for t in range(MAX_EPISODE_LENGTH):
        
        epsilon = max(1/episode, EPSILON_MIN)
        action = dqn_agent.choose_action(state, epsilon)
        next_state, reward, done, info = env.step(action)
        
        dqn_agent.train(state, action, reward, next_state, done)
        state = next_state        
        score += reward        
        if done:
            break

    scores_window.append(score)
    mean_score = np.mean(scores_window)
    
    if episode % 10 == 0:
        print(f'After {episode} episodes, average score is {mean_score:.2f}. ', end='')
        print(f'Took {time()-last_time:.0f} seconds.')
        last_time = time()
    
    if mean_score >= TARGET_SCORE:
        print(f'Environment solved in {episode} episodes. Average score: {mean_score:.2f}')
        break

print(f'Took {time()-start:.0f} seconds (~{(time()-start)//60} minutes)')

State size: 8, action size: 4
After 10 episodes, average score is -189.71. Took 3 seconds.
After 20 episodes, average score is -187.55. Took 4 seconds.
After 30 episodes, average score is -177.34. Took 14 seconds.
After 40 episodes, average score is -155.51. Took 21 seconds.
After 50 episodes, average score is -150.81. Took 17 seconds.
After 60 episodes, average score is -135.88. Took 18 seconds.
After 70 episodes, average score is -114.35. Took 17 seconds.
After 80 episodes, average score is -99.03. Took 20 seconds.
After 90 episodes, average score is -88.96. Took 22 seconds.
After 100 episodes, average score is -82.97. Took 16 seconds.
After 110 episodes, average score is -68.24. Took 21 seconds.
After 120 episodes, average score is -43.03. Took 14 seconds.
After 130 episodes, average score is -32.91. Took 13 seconds.
After 140 episodes, average score is -39.19. Took 12 seconds.
After 150 episodes, average score is -31.52. Took 26 seconds.
After 160 episodes, average score is -37.27.

### 4. Play epsiode and record it

Use the trained model to play and record one episode. The recorded video will be stored into the `video`-subfolder on disk.

In [10]:
import time

FPS = 25
record_folder="video"  

env = gym.make('LunarLander-v2')
env = gym.wrappers.Monitor(env, record_folder, force=True)

state = env.reset()
total_reward = 0.0

while True:
    start_time = time.time()
    env.render()

    state = torch.from_numpy(state).float().unsqueeze(0).to(device)
    action_values = dqn_agent.q_network(state)
    action = np.argmax(action_values.cpu().data.numpy())

    state, reward, done, _ = env.step(action)
    total_reward += reward
    if done:
        break
    
    compute_time = time.time() - start_time
    delta = 1/FPS - compute_time
    if delta > 0:
        time.sleep(delta)

print(f"Total reward: {total_reward:.2f}" )
env.close()

Unknown encoder 'libx264'
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31m

[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encod

[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encod

Total reward: 215.93


[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder failed: None[0m
[31mERROR: VideoRecorder encoder exited with status 1[0m


### 5. Task: Implement the following functions to make the code above work

To make the agent train something, we need to implement the `calculate_loss` function in the code above. To make this easier, we do this along the following mini tasks.

As an example, were are given a tiny replay buffer that contains only two transitions of the form `state`, `action`, `reward`, `next_state` and `done`. 

The resulting tensors `states`, `actions`, `rewards`, `next_states` and `dones` are of the same format as the input to the function `calculate_loss`. 

In [11]:
state_1 = [ 0.64,  0.38,  0.04, -0.10, -0.22, -.00,  0.00,  0.00]
state_2 = [ 0.00,  0.35,  0.41, -0.59, -0.66, -0.23,  0.00,  0.00]
states = torch.FloatTensor([state_1, state_2])

actions = torch.LongTensor([[2],[1]])

rewards = torch.FloatTensor([[1.8670],[1.2630]])

next_state_1 = [-0.60,  0.94, -0.04, -0.13,  0.27, 0.70,  0.00,  0.00]
next_state_2 = [-0.60,  0.94, -0.04, -0.13,  0.27, 0.70,  0.00,  0.00]
next_states = torch.FloatTensor([next_state_1, next_state_2])

dones = torch.FloatTensor([[0],[1]])

#### Subtask 1:

We first calculate the Q-Learning target. In a first step we use the `target_network` to calculate the Q-values for every state in the `next_states` tensor.

In [12]:
q_values = dqn_agent.target_network(next_states)
q_values

tensor([[49.8446, 44.1981, 52.0392, 56.0594],
        [49.8446, 44.1981, 52.0392, 56.0594]], grad_fn=<AddmmBackward0>)

Since we do not want to backpropagate on these values, we detach them from the computational graph as follows:

In [13]:
q_values = q_values.detach()
q_values

tensor([[49.8446, 44.1981, 52.0392, 56.0594],
        [49.8446, 44.1981, 52.0392, 56.0594]])

Since we are using Q-Learning, we are only interested in the maximum value per line.
Implement some code that squashed the above to a torch tensor of shape `[2, 1]` that contains for every state only the maximum Q-value.

In [14]:
max_q_values = q_values.max(1)[0].unsqueeze(1)
max_q_values

tensor([[56.0594],
        [56.0594]])

Now we are ready to calcualte the Q-Learning targets using the tensors `rewards` and `dones` as seen in the lecture. Remember: The target consist only of the reward if the done flag is set for a transition.

In [15]:
GAMMA = 0.99
Q_targets = rewards + (GAMMA * max_q_values * (1 - dones))
Q_targets

tensor([[57.3658],
        [ 1.2630]])

#### Subtask 2:

We now caluclate the predicton of the network on the current states. For this we use the `q_network` of the agent.

In [16]:
predictions = dqn_agent.q_network(states)
predictions

tensor([[45.6411, 48.2667, 47.1487, 43.2907],
        [ 2.1099,  9.9811,  6.8551, -2.1050]], grad_fn=<AddmmBackward0>)

This returns for every state the Q-values for all actions. However, we only need the q-values of the according  that was actually taken in this transition (this is stored in `actions`).
Next, extract the Q-Value for the taken action.

In [17]:
q_value_action = predictions.gather(1, actions)
q_value_action

tensor([[47.1487],
        [ 9.9811]], grad_fn=<GatherBackward0>)

These values can now be used to define the loss for the current batch:

In [18]:
loss = F.mse_loss(q_value_action, Q_targets)

#### Subtask 3:
Use the code from these examples to implement the `calculate_loss` function from above and train the agent.