In [1]:
%matplotlib inline

Reinforcement Learning (DQN) Tutorial
=====================================
In this exercise, you will practice how to use PyTorch to train a Deep Q-learning (DQN) agent
on the CartPole-v0 task from OpenAI Gym. Specifically, you will need to implement several functions/classes which are necessary components of the DQN algorithm:

1.  ``ReplayMemory``
2.  ``Q-Network``
3.  ``Optimize_Model``
4.  ``Select_action``

Each function/class has its own cell where you can see more details, please complete the exercices marked with TODO.

**Packages**

We need OpenAI gym for the environment (Install using `pip install gym`).

In [2]:
import gym
from gym import wrappers
import random
import math
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F
import matplotlib.pyplot as plt
import pdb

# if gpu is to be used
use_cuda = torch.cuda.is_available()
FloatTensor = torch.cuda.FloatTensor if use_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if use_cuda else torch.LongTensor
ByteTensor = torch.cuda.ByteTensor if use_cuda else torch.ByteTensor
Tensor = FloatTensor

## Hyperparameters
After implementing the neural network model and other necessary functions, you can try to do more hyperparameters tuning.

In [3]:
# hyper parameters
EPISODES = 400  # number of episodes
EPS_START = 0.9  # e-greedy threshold start value
EPS_END = 0.05  # e-greedy threshold end value
EPS_DECAY = 200  # e-greedy threshold decay
GAMMA = 0.99  # Q-learning discount factor
LR = 0.01  # NN optimizer learning rate
HIDDEN_LAYER = 64  # NN hidden layer size
BATCH_SIZE = 64  # Q-learning batch size

## Environment
CartPole-v0 is a classic reinforcement learning environment from OpenAI Gym. In this environment, the agent has to decide between two actions $-$ moving the cart left or right $-$ so that the pole attached to it stays upright.

As the agent observes the current state of the environment and chooses an action, the environment transitions to a new state, and also returns a reward that indicates the consequences of the action. In this task, rewards are +1 for every incremental timestep and the environment terminates if the pole falls over too far or the cart moves more then 2.4 units away from center. This means better performing scenarios will run for longer duration, accumulating larger return. The CartPole task is designed so that the inputs to the agent are 4 real values representing the environment state (position, velocity, etc.).

We first set up the envrionment of CartPole-v0 using Gym.

In [4]:
env = gym.make('CartPole-v0')
env._max_episode_steps = 500
env = wrappers.Monitor(env, './tmp/cartpole-v0-1', force=True)

Replay Memory
-------------
We will be using experience replay memory for training our DQN. It stores
the transitions that the agent observes, allowing us to reuse this data
later. By sampling from it randomly, the transitions that build up a
batch are decorrelated. It has been shown that this greatly stabilizes
and improves the DQN training procedure.

For this, we're going to implement the replay memory buffer as a python class:

-  ``ReplayMemory`` $-$ a cyclic buffer of bounded size that holds the
   transitions observed recently. It also implements a ``.sample()``
   method for selecting a random batch of transitions for training, and a ``.push()`` method for adding a new transition while potentially remove the oldest saved transition if the size of memory buffer exceeds the capacity. Each tranisiton is a tuple which consists of state, action, next_state, reward.

In [5]:
class ReplayMemory:
    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []

    def push(self, transition):
        self.memory.append(transition)
        if len(self.memory) > self.capacity:
            del self.memory[0]

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

## Q-Network
Next, we need to define our model. Our model will consist of fully connected layers that takes in the
state returned by the envrionment. It has two
outputs, representing $Q(s, \mathrm{left})$ and
$Q(s, \mathrm{right})$ (where $s$ is the input to the
network). In effect, the network is trying to predict the *expected return* of
taking each action given the current input.

Define a 2-layer fully connected neural network with ${\rm tanh}$ activation at the hidden layer, followed by the output layer. The hidden layer size is decided by the hyperparameter 'HIDDEN_LAYER' and the size of the output is 2. You could also try any other architectures you want.

In [6]:
class Network(nn.Module):
    def __init__(self):
        nn.Module.__init__(self)
        self.l1 = nn.Linear(4, HIDDEN_LAYER)
        self.l2 = nn.Linear(HIDDEN_LAYER, 2)

    def forward(self, x):
        x = torch.tanh(self.l1(x))
        x = self.l2(x)
        return x

Create the model, memory buffer and optimizer.

In [7]:
model = Network()
if use_cuda:
    model.cuda()
memory = ReplayMemory(10000)
optimizer = optim.Adam(model.parameters(), LR)
steps_done = 0
episode_durations = []

DQN algorithm
-------------

Our environment is deterministic, so all equations presented here are
also formulated deterministically for the sake of simplicity. In the
reinforcement learning literature, they would also contain expectations
over stochastic transitions in the environment.

Our aim will be to train a policy that tries to maximize the discounted,
cumulative reward
$R_{t_0} = \sum_{t=t_0}^{\infty} \gamma^{t - t_0} r_t$, where
$R_{t_0}$ is also known as the *return*. The discount,
$\gamma$, should be a constant between $0$ and $1$
that ensures the sum converges. It makes rewards from the uncertain far
future less important for our agent than the ones in the near future
that it can be fairly confident about.

The main idea behind Q-learning is that if we had a function
$Q^*: State \times Action \rightarrow \mathbb{R}$, that could tell
us what our return would be, if we were to take an action in a given
state, then we could easily construct a policy that maximizes our
rewards:

$$\pi^*(s) = \arg\!\max_a \ Q^*(s, a)$$

However, we don't know everything about the world, so we don't have
access to $Q^*$. But, since neural networks are universal function
approximators, we can simply create one and train it to resemble
$Q^*$.

For our training update rule, we'll use a fact that every $Q$
function for some policy obeys the Bellman equation:

$$Q^{\pi}(s, a) = r + \gamma\,Q^{\pi}(s', \pi(s'))$$

The difference between the two sides of the equality is known as the
temporal difference error, $\delta$:

$$\delta = Q(s, a) - (r + \gamma \max_b Q(s', b))$$

Training
--------

First, we need to implement some utility functions for our training procedure

-  ``select_action`` $-$ will select an action accordingly to an epsilon
   greedy policy. Simply put, we'll sometimes use our model for choosing
   the action, and sometimes we'll just sample one uniformly. The
   probability of choosing a random action will start at ``EPS_START``
   and will decay exponentially towards ``EPS_END``. ``EPS_DECAY``
   controls the rate of the decay.

-  ``optimize_model`` $-$ performs a single step of the optimization. It first samples a batch, concatenates
all the tensors into a single one, then we'll use the model to calculate the Q values for different state and use bellman euqation to optmize our model.

In [8]:
def select_action(state):
    global steps_done
    sample = random.random()
    eps_threshold = EPS_END + (EPS_START - EPS_END) * math.exp(-1. * steps_done / EPS_DECAY)
    steps_done += 1
    if sample > eps_threshold:
        return model(Variable(state).type(FloatTensor)).data.max(1)[1].view(1, 1)
    else:
        return LongTensor([[random.randrange(2)]])

In [9]:
def optimize_model():
    if len(memory) < BATCH_SIZE:
        return

    # random transition batch is taken from experience replay memory
    transitions = memory.sample(BATCH_SIZE)
    batch_state, batch_action, batch_next_state, batch_reward = zip(*transitions)
    batch_state = Variable(torch.cat(batch_state))
    batch_action = Variable(torch.cat(batch_action))
    batch_reward = Variable(torch.cat(batch_reward))
    batch_next_state = Variable(torch.cat(batch_next_state))

    # current Q values are estimated by NN for all actions
    current_q_values = model(batch_state).gather(1, batch_action)
    # expected Q values are estimated from actions which gives maximum Q value
    max_next_q_values = model(batch_next_state).detach().max(1)[0]
    expected_q_values = batch_reward + (GAMMA * max_next_q_values)

    # loss is measured from error between current and newly expected Q values
    loss = F.smooth_l1_loss(current_q_values.squeeze(), expected_q_values)

    # backpropagation of loss to NN
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Below, you can find the main training loop. At the beginning we reset
the environment and initialize the ``state`` Tensor. Then, we sample
an action, execute it, observe the next state and the reward (always
1), and optimize our model once. When the episode ends (our model
fails), we restart the loop.

In [10]:
for e in range(EPISODES):
    state = env.reset()
    steps = 0
    while True:
        env.render()
        action = select_action(FloatTensor([state]))
        next_state, reward, done, _ = env.step(action[0, 0].item())
        # negative reward when attempt ends
        if done:
            reward = -1
        memory.push((FloatTensor([state]),
                     action,  # action is already a tensor
                     FloatTensor([next_state]),
                     FloatTensor([reward])))

        optimize_model()
        state = next_state
        steps += 1

        if done:
            print("{2} Episode {0} finished after {1} steps"
                  .format(e, steps, '\033[92m' if steps >= 195 else '\033[99m'))
            episode_durations.append(steps)
            break

print('Complete')
env.render()
env.close()
plt.ioff()
plt.show()

  action = select_action(FloatTensor([state]))


[99m Episode 0 finished after 11 steps
[99m Episode 1 finished after 16 steps
[99m Episode 2 finished after 28 steps
[99m Episode 3 finished after 12 steps
[99m Episode 4 finished after 19 steps
[99m Episode 5 finished after 15 steps
[99m Episode 6 finished after 26 steps
[99m Episode 7 finished after 13 steps
[99m Episode 8 finished after 11 steps
[99m Episode 9 finished after 9 steps
[99m Episode 10 finished after 11 steps
[99m Episode 11 finished after 10 steps
[99m Episode 12 finished after 8 steps
[99m Episode 13 finished after 9 steps
[99m Episode 14 finished after 11 steps
[99m Episode 15 finished after 13 steps
[99m Episode 16 finished after 9 steps
[99m Episode 17 finished after 14 steps
[99m Episode 18 finished after 10 steps
[99m Episode 19 finished after 13 steps
[99m Episode 20 finished after 12 steps
[99m Episode 21 finished after 12 steps
[99m Episode 22 finished after 9 steps
[99m Episode 23 finished after 9 steps
[99m Episode 24 finished after 


KeyboardInterrupt

