In [2]:
import numpy as np
import tensorflow as tf
import tensorflow.contrib.slim as slim # slim is a wrapper that makes building networks easier
from collections import deque # deques make better replay buffers than lists since
                              # adding/removing from either end is O(1)

# Cartpole problem introduction

In this homework, we'll explore the cartpole problem:

<img src="cartpole.png">

A pole is balanced on top of a cart which moves along a one-dimensional track. The goal of the task is to keep the pole balanced by moving the cart side to side. To make this into a MDP like we've discussed, we need the following elements:

* *Agent:* the controller of the cart
* *Environment:* the cart/world/physics
* *State:* we'll define the state to be a tuple of (x position of cart, x velocity of cart, angle of pole, angular velocity of pole).
* *Terminal states:* we'll end the episode when the pole tips too far over (> 15 degrees, in this implementation) or when the cart goes too far to either side (> 2.5 units).
* *Actions:* to keep it simple, we'll have only two actions: apply a force of +F toward the right, or -F toward the left, which we'll call "right" and "left," respectively.
* *Rewards:* To keep things simple and clear, we'll only give a reward in terminal states. Since all terminal states are losing, the reward will be -1.

We'll compare two Q-learning approaches to this task in this homework: 

* *Tabular:* "standard" Q-learning
* *DQN:* A deep-Q network that approximates the Q-function, loosely inspired by the Atari game playing paper.

We'll also compare to a baseline controller that takes random actions at every step.

Most of the code chunks in this document have been run for you already, since some of them (especially the DQN training) take a non-trivial amount of time. However, we encourage you to play around with the code and get your hands dirty.

# Conceptual questions

(There are 10 questions across 3 sections on this homework, some with code chunks interspersed, make sure you answer all of them! They can be answered in a separate document or directly in this file, whichever you prefer.)

1\. Since the reward for every *episode* (not every action!) will be -1, why would a Q-learning system learn any interesting behavior on this task?

2\. Why might a DQN (or some other function approximator) be an appropriate choice here?

In [3]:
class cartpole_problem(object):
    """Class implementing the cartpole world -- you may want to glance at the
       methods to see if you can understand what's going on."""
    def __init__(self, max_lifetime=1000):
        self.delta_t = 0.05
        self.gravity = 9.8
        self.force = 1.
        self.cart_mass = 1.
        self.pole_mass = 0.2
        self.mass = self.cart_mass + self.pole_mass
        self.pole_half_length = 1.
        self.max_lifetime = max_lifetime

        self.reset_state()

    def get_state(self):
        """Returns current state as a tuple"""
        return (self.x, self.x_dot, self.phi, self.phi_dot)

    def reset_state(self):
        """Reset state variables to initial conditions"""
        self.x = 0.
        self.x_dot = 0.
        self.phi = 0.
        self.phi_dot = 0.

    def tick(self, action):
        """Time step according to EoM and action."""

        if action == "left":
            action_force = self.force
        else:
            action_force = -self.force

        dt = self.delta_t
        self.x += dt * self.x_dot
        self.phi += dt * self.phi_dot

        sin_phi = np.sin(self.phi)
        cos_phi = np.cos(self.phi)

        F = action_force + sin_phi * self.pole_mass * self.pole_half_length * (self.phi_dot**2)
        phi_2_dot = (sin_phi * self.gravity - cos_phi * F/ self.mass) / (0.5 * self.pole_half_length * (4./3 - self.pole_mass * cos_phi**2 / self.mass))
        x_2_dot = (F - self.pole_mass * self.pole_half_length * phi_2_dot) / self.mass

        self.x_dot += dt * x_2_dot
        self.phi_dot += dt * phi_2_dot


    def loses(self):
        """Loses if not within 2.5 units of start and 15 deg. of vertical"""
        return not (-2.5 < self.x < 2.5 and -0.262 < self.phi < 0.262)

    def run_trial(self, controller, testing=False):
        self.reset_state()
        i = 0
        while i < self.max_lifetime:
            i += 1
            this_state = self.get_state()
            this_action = controller.choose_action(this_state)
            self.tick(this_action)
            new_state = self.get_state()

            loss = self.loses()
            reward = -1. if loss else 0.
            if not testing:
                controller.update(this_state, this_action, new_state, reward)

            if loss:
                break

        if testing:
            print("Ran testing trial with %s Controller, achieved a lifetime of %i steps" % (controller.name, i))

        return i


    def run_k_trials(self, controller, k):
        """Runs k trials, using the specified controller. Controller must have
           a choose_action(state) method which returns one of "left" and
           "right," and must have an update(state, action, next state, reward)
           method (if training=True)."""
        avg_lifetime = 0.
        for i in range(k):
            avg_lifetime += self.run_trial(controller)

        avg_lifetime /= k
        print("Ran %i trials with %s Controller, (average lifetime of %f steps)" % (k,  controller.name, avg_lifetime))


In [4]:
class random_controller(object):
    """Random controller/base class for fancier ones."""
    def __init__(self):
        self.name = "Random"
        self.testing = False

    def set_testing(self):
        """Can toggle exploration, for instance."""
        self.testing = True

    def set_training(self):
        """Can toggle exploration, for instance."""
        self.testing = False

    def choose_action(self, state):
        """Takes a state and returns an action, "left" or "right," to take.
           this method chooses randomly, should be overridden by fancy
           controllers."""
        return np.random.choice(["left", "right"])

    def update(self, prev_state, action, new_state, reward):
        """Update policy or whatever, override."""
        pass

In [7]:
cpp = cartpole_problem()

# try a few random controllers with different random seeds
# this gives a baseline for comparison
for i in range(10):
    np.random.seed(i)
    cpc = random_controller()
    cpp.run_trial(cpc, testing=True)


Ran testing trial with Random Controller, achieved a lifetime of 16 steps
Ran testing trial with Random Controller, achieved a lifetime of 15 steps
Ran testing trial with Random Controller, achieved a lifetime of 40 steps
Ran testing trial with Random Controller, achieved a lifetime of 18 steps
Ran testing trial with Random Controller, achieved a lifetime of 21 steps
Ran testing trial with Random Controller, achieved a lifetime of 26 steps
Ran testing trial with Random Controller, achieved a lifetime of 33 steps
Ran testing trial with Random Controller, achieved a lifetime of 16 steps
Ran testing trial with Random Controller, achieved a lifetime of 17 steps
Ran testing trial with Random Controller, achieved a lifetime of 14 steps


# Tabular Q learning

There is a difficulty in making this a tabular Q-learning problem: it's not a finite MDP! Since the space is continuous, it's actually infinite. In order to avoid trying to make an infinite table, we'll discretize the space (actually quite drastically), by chopping each the position and angle dimensions to only 3 values, and the velocity dimensions to 5, thus reducing the continuous state space to 225 discrete states. It's not perfect, but as you'll see below, it offers quite an improvement over the random controller. 

In [10]:

class tabular_Q_controller(random_controller):
    """Tabular Q-learning controller."""

    def __init__(self, epsilon=0.05, gamma=0.95, eta=0.1):
        """Epsilon: exploration probability (epsilon-greedy)
           gamma: discount factor
           eta: update rate"""
        super().__init__()
        self.name = "Tabular Q"
        disc = [-1, 0, 1]
        disc_dot = [-2, -1, 0, 1, 2]
        self.Q_table = {(x, x_dot, phi, phi_dot): {"left": 0.01-np.random.rand()/50, "right": 0.01-np.random.rand()/50} for x in disc for x_dot in disc_dot for phi in disc for phi_dot in disc_dot}
        self.eta = eta
        self.gamma = gamma
        self.epsilon = epsilon

    def discretize_state(self, state):
        """Convert continuous state into discrete with 3 possible values of each
           position, 5 possible values of each derivative."""
        x, x_dot, phi, phi_dot = state
        if x > 1.:
            x = 1
        elif x < -1.:
            x = -1
        else:
            x = 0

        if x_dot < -0.1:
            x_dot = -2
        elif x_dot > 0.1:
            x_dot = 2
        elif x_dot < -0.03:
            x_dot = -1
        elif x_dot > 0.03:
            x_dot = 1
        else:
            x_dot = 0

        if phi > 0.1:
            phi = 1
        elif phi < -0.1:
            phi = -1
        else:
            phi = 0

        if phi_dot < -0.1:
            phi_dot = -2
        elif phi_dot > 0.1:
            phi_dot = 2
        elif phi_dot < -0.03:
            phi_dot = -1
        elif phi_dot > 0.03:
            phi_dot = 1
        else:
            phi_dot = 0

        return (x, x_dot, phi, phi_dot)

    def choose_action(self, state):
        """Epsilon-greedy w.r.t the current Q-table."""
        state = self.discretize_state(state)
        if not self.testing and np.random.rand() < self.epsilon:
            return np.random.choice(["left", "right"])
        else:
            curr_Q_vals = self.Q_table[state]
            if curr_Q_vals["left"] > curr_Q_vals["right"]:
                return "left"
            return "right"

    def update(self, prev_state, action, new_state, reward):
        """Update Q table."""
        prev_state = self.discretize_state(prev_state)
        new_state = self.discretize_state(new_state)
        if reward != 0.:
            target = reward # reward states are terminal in this task
        else:
            target = self.gamma * max(self.Q_table[new_state].values())

        self.Q_table[prev_state][action] = (1 - self.eta) * self.Q_table[prev_state][action] + self.eta * target


In [11]:
np.random.seed(0)
tqc = tabular_Q_controller()
tqc.set_testing()
cpp.run_trial(tqc, testing=True)
# for trainable controllers, we'll run a few testing trials during
# training to see how they evolve
for step in range(5):
    tqc.set_training()
    cpp.run_k_trials(tqc, 1000)
    tqc.set_testing()
    cpp.run_trial(tqc, testing=True)

Ran testing trial with Tabular Q Controller, achieved a lifetime of 16 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 80.216000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 123 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 80.714000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 123 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 90.360000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 125 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 72.991000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 78 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 87.945000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 126 steps


# Tabular Q-learning questions

3\. The tabular Q-learning system does much better than a random controller, but it still only lives about 5 times as long. What could we do to improve the tabular Q system's performance on this task further? For whatever you propose, how would it affect training? 

4\. Try setting gamma = 0.0 (living in the moment). What happens? Why?

In [12]:
np.random.seed(0)
tqc = tabular_Q_controller(gamma=0.)
tqc.set_testing()
cpp.run_trial(tqc, testing=True)
for i in range(5):
    tqc.set_training()
    cpp.run_k_trials(tqc, 1000)
    tqc.set_testing()
    cpp.run_trial(tqc, testing=True)

Ran testing trial with Tabular Q Controller, achieved a lifetime of 16 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 16.086000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 16 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 16.200000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 16 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 16.151000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 16 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 16.020000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 16 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 16.090000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 16 steps


5\. What happens if we set gamma = 1 (living in all moments at once)? Naively, one might expect to get random behavior, since all trials get the same total reward, and gamma = 1 is essentially saying that the total reward is all that matters, not when the reward appears. However, this is not what actually happens. Why?

In [13]:
np.random.seed(0)
tqc = tabular_Q_controller(gamma=1.)
tqc.set_testing()
cpp.run_trial(tqc, testing=True)
for i in range(5):
    tqc.set_training()
    cpp.run_k_trials(tqc, 1000)
    tqc.set_testing()
    cpp.run_trial(tqc, testing=True)

Ran testing trial with Tabular Q Controller, achieved a lifetime of 16 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 68.582000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 43 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 68.383000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 79 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 73.737000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 98 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 81.137000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 123 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 79.202000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 104 steps


6\. What happens if you set epsilon = 1 (random behavior while training)? Why?

In [15]:
np.random.seed(0)
tqc = tabular_Q_controller(epsilon=1.)
tqc.set_testing()
cpp.run_trial(tqc, testing=True)
for i in range(5):
    tqc.set_training()
    cpp.run_k_trials(tqc, 1000)
    tqc.set_testing()
    cpp.run_trial(tqc, testing=True)

Ran testing trial with Tabular Q Controller, achieved a lifetime of 16 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 18.413000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 106 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 18.725000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 106 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 18.308000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 106 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 18.500000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 111 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 18.631000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 111 steps


7\. What happens if you set epsilon = 0 (no exploration)? Why does this happen here, and what might be different about other tasks that makes eexploration important?

In [16]:
np.random.seed(0)
tqc = tabular_Q_controller(epsilon=0.)
tqc.set_testing()
cpp.run_trial(tqc, testing=True)
for i in range(5):
    tqc.set_training()
    cpp.run_k_trials(tqc, 1000)
    tqc.set_testing()
    cpp.run_trial(tqc, testing=True)

Ran testing trial with Tabular Q Controller, achieved a lifetime of 16 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 97.499000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 126 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 126.000000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 126 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 126.000000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 126 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 126.000000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 126 steps
Ran 1000 trials with Tabular Q Controller, (average lifetime of 126.000000 steps)
Ran testing trial with Tabular Q Controller, achieved a lifetime of 126 steps


Food for thought (no answer necessary): Are the discretization values very important? (The current values were picked by a few quick rounds of trial and error.) If we discretized the space more finely, would we see better results? Is it better to space the breaks linearly or quadratically?

# DQN

In some ways, creating the DQN is simpler than creating the tabular Q-learning system. Neural nets can accept continuous input, so we can simply pass the current state to the network without discretizing.

As you'll see below, this system does quite a bit better. In fact, it reaches the time limit at which the cartpole code stops by default (1000 steps).

In [8]:
class dqn_controller(random_controller):
    """Simple deep-Q network controller -- 4 inputs (one for each state
       variable), two hidden layers, two outputs (Q-left, Q-right), and an
       optional replay buffer."""

    def __init__(self, epsilon=0.05, gamma=0.95, eta=1e-4, nh1=100, nh2=100, replay_buffer=True):
        """Epsilon: exploration probability (epsilon-greedy)
           gamma: discount factor
           eta: learning rate,
           nh1: number of hidden units in first hidden layer,
           nh2: number of hidden units in second hidden layer,
           replay_buffer: whether to use a replay buffer"""
        super().__init__()
        self.name = "DQN"
        self.eta = eta
        self.gamma = gamma
        self.epsilon = epsilon

        if replay_buffer:
            self.replay_buffer = deque()
            self.replay_buffer_max_size = 1000
        else:
            self.replay_buffer = None

        # network creation
        self.input = tf.placeholder(tf.float32, [1, 4])
        h1 = slim.layers.fully_connected(self.input, nh1, activation_fn=tf.nn.tanh)
        h2 = slim.layers.fully_connected(h1, nh2, activation_fn=tf.nn.tanh)
        self.Q_vals = slim.layers.fully_connected(h2, 2, activation_fn=tf.nn.tanh)

        # training stuff
        self.target =  tf.placeholder(tf.float32, [1, 2])
        self.loss = tf.nn.l2_loss(self.Q_vals - self.target)
        optimizer = tf.train.AdamOptimizer(self.eta, epsilon=1e-3) # (this is an unrelated epsilon)
        self.train = optimizer.minimize(self.loss)

        # session and init
        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())

    def choose_action(self, state):
        """Takes a state and returns an action, "left" or "right," to take.
           epsilon-greedy w.r.t current Q-function approx."""
        if not self.testing and np.random.rand() < self.epsilon:
            return np.random.choice(["left", "right"])
        else:
            curr_Q_vals = self.sess.run(self.Q_vals, feed_dict={self.input: np.array(state, ndmin=2)})
            if curr_Q_vals[0, 0] > curr_Q_vals[0, 1]:
                return "left"
            return "right"

    def update(self, prev_state, action, new_state, reward):
        """Update policy or whatever, override."""
        if self.replay_buffer is not None:
            # put this (S, A, S, R) tuple in buffer
            self.replay_buffer.append((prev_state, action, new_state, reward))
            rb_len = len(self.replay_buffer)
            # pick a random (S, A, S, R) tuple from buffer
            (prev_state, action, new_state,reward) = self.replay_buffer[np.random.randint(0, rb_len)]

            # remove a memory if getting too full
            if rb_len > self.replay_buffer_max_size:
                self.replay_buffer.popleft()

        if reward != 0.:
            target_val = reward # reward states are terminal in this task
        else:
            new_Q_vals = self.sess.run(self.Q_vals, feed_dict={self.input: np.array(new_state, ndmin=2)})
            target_val = self.gamma * np.max(new_Q_vals)

        # hacky way to update only the correct Q value: make the target for the
        # other its current value
        target_Q_vals = self.sess.run(self.Q_vals, feed_dict={self.input: np.array(prev_state, ndmin=2)})
        if action == "left":
            target_Q_vals[0, 0] = target_val
        else:
            target_Q_vals[0, 1] = target_val

        self.sess.run(self.train, feed_dict={self.input: np.array(prev_state, ndmin=2), self.target: target_Q_vals.reshape([1,2])})


In [17]:
np.random.seed(0)
tf.set_random_seed(0)
dqn = dqn_controller(replay_buffer=True)
dqn.set_testing()
cpp.run_trial(dqn, testing=True)
for i in range(8):
    dqn.set_training()
    cpp.run_k_trials(dqn, 1000)
    dqn.set_testing()
    cpp.run_trial(dqn, testing=True)

Ran testing trial with DQN Controller, achieved a lifetime of 24 steps
Ran 1000 trials with DQN Controller, (average lifetime of 18.629000 steps)
Ran testing trial with DQN Controller, achieved a lifetime of 23 steps
Ran 1000 trials with DQN Controller, (average lifetime of 19.294000 steps)
Ran testing trial with DQN Controller, achieved a lifetime of 16 steps
Ran 1000 trials with DQN Controller, (average lifetime of 19.562000 steps)
Ran testing trial with DQN Controller, achieved a lifetime of 19 steps
Ran 1000 trials with DQN Controller, (average lifetime of 19.773000 steps)
Ran testing trial with DQN Controller, achieved a lifetime of 18 steps
Ran 1000 trials with DQN Controller, (average lifetime of 38.124000 steps)
Ran testing trial with DQN Controller, achieved a lifetime of 44 steps
Ran 1000 trials with DQN Controller, (average lifetime of 133.583000 steps)
Ran testing trial with DQN Controller, achieved a lifetime of 155 steps
Ran 1000 trials with DQN Controller, (average lifet

# DQN questions

8\. Why does the DQN take longer to learn than the tabular Q-learning system? (There are a number of potentially correct answers here.)

9\. In my implementation, I used the tanh activation function. Why might this be an appropriate choice here? More specifically, what are some activation functions that would NOT yield good results at the output layer?

10\. What happens if we turn off the replay buffer? Why might it be important?

In [18]:
np.random.seed(0)
tf.set_random_seed(0)
dqn = dqn_controller(replay_buffer=False)
dqn.set_testing()
cpp.run_trial(dqn, testing=True)
for i in range(8):
    dqn.set_training()
    cpp.run_k_trials(dqn, 1000)
    dqn.set_testing()
    cpp.run_trial(dqn, testing=True)

Ran testing trial with DQN Controller, achieved a lifetime of 51 steps
Ran 1000 trials with DQN Controller, (average lifetime of 17.409000 steps)
Ran testing trial with DQN Controller, achieved a lifetime of 21 steps
Ran 1000 trials with DQN Controller, (average lifetime of 19.705000 steps)
Ran testing trial with DQN Controller, achieved a lifetime of 47 steps
Ran 1000 trials with DQN Controller, (average lifetime of 20.104000 steps)
Ran testing trial with DQN Controller, achieved a lifetime of 21 steps
Ran 1000 trials with DQN Controller, (average lifetime of 19.570000 steps)
Ran testing trial with DQN Controller, achieved a lifetime of 17 steps
Ran 1000 trials with DQN Controller, (average lifetime of 18.803000 steps)
Ran testing trial with DQN Controller, achieved a lifetime of 21 steps
Ran 1000 trials with DQN Controller, (average lifetime of 18.665000 steps)
Ran testing trial with DQN Controller, achieved a lifetime of 21 steps
Ran 1000 trials with DQN Controller, (average lifetim

Food for thought: If you gave the DQN the same discretized states that the tabular Q-network gets, would it do any better than the tabular system does? (Try it out if you're curious!)